* Re: filecache/swapcache questions @ 1999-06-21 5:29 Kanoj Sarcar 1999-06-21 11:25 ` Stephen C. Tweedie 0 siblings, 1 reply; 60+ messages in thread From: Kanoj Sarcar @ 1999-06-21 5:29 UTC (permalink / raw) To: sct; +Cc: linux-mm Okay, lets see if I am being stupid again ... Imagine a process exitting, executing exit_mmap. exit_mmap cleans out the vma list from the mm, ie sets mm->mmap = 0. Then, it invokes vm_ops->unmap, say on a MAP_SHARED file vma, which starts file io, that puts the process to sleep. Now, a sys_swapoff comes in ... this will not be able to retrieve the swap handles from the former process (since the vma's are invisible), so it may end up deleting the device with a warning message about non 0 swap_map count. The exitting process then invokes a bunch of swap_free()s via zap_page_range, whereas the swap id might already have been reassigned. If there's no protection against this, a possible fix would be for exit_mmap not to clean the vma list, rather delete a vma at a time from the list. So, what is the call to swap_free doing in filemap_sync_pte? When will this call ever be executed? Thanks. Kanoj kanoj@engr.sgi.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions 1999-06-21 5:29 filecache/swapcache questions Kanoj Sarcar @ 1999-06-21 11:25 ` Stephen C. Tweedie 1999-06-21 16:46 ` Kanoj Sarcar 0 siblings, 1 reply; 60+ messages in thread From: Stephen C. Tweedie @ 1999-06-21 11:25 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: sct, linux-mm Hi, On Sun, 20 Jun 1999 22:29:14 -0700 (PDT), kanoj@google.engr.sgi.com (Kanoj Sarcar) said: > Imagine a process exitting, executing exit_mmap. exit_mmap > cleans out the vma list from the mm, ie sets mm->mmap = 0. > Then, it invokes vm_ops->unmap, say on a MAP_SHARED file > vma, which starts file io, that puts the process to sleep. > Now, a sys_swapoff comes in ... this will not be able to > retrieve the swap handles from the former process (since > the vma's are invisible), so it may end up deleting the > device with a warning message about non 0 swap_map count. > The exitting process then invokes a bunch of swap_free()s > via zap_page_range, whereas the swap id might already have > been reassigned. Agreed. > If there's no protection against this, a possible fix would > be for exit_mmap not to clean the vma list, rather delete a > vma at a time from the list. Looking at this, we have other problems: the forced swapin caused by sys_swapoff() doesn't down() the mmap semaphore. That is very bad indeed. We need to fix it. If we fix it, then we can fix exit_mmap() at the same time by taking the mmap semaphore while we do the unmap/close operations. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions 1999-06-21 11:25 ` Stephen C. Tweedie @ 1999-06-21 16:46 ` Kanoj Sarcar 1999-06-21 16:57 ` Stephen C. Tweedie 0 siblings, 1 reply; 60+ messages in thread From: Kanoj Sarcar @ 1999-06-21 16:46 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: linux-mm > > Hi, > > On Sun, 20 Jun 1999 22:29:14 -0700 (PDT), kanoj@google.engr.sgi.com > (Kanoj Sarcar) said: > > > Imagine a process exitting, executing exit_mmap. exit_mmap > > cleans out the vma list from the mm, ie sets mm->mmap = 0. > > Then, it invokes vm_ops->unmap, say on a MAP_SHARED file > > vma, which starts file io, that puts the process to sleep. > > > Now, a sys_swapoff comes in ... this will not be able to > > retrieve the swap handles from the former process (since > > the vma's are invisible), so it may end up deleting the > > device with a warning message about non 0 swap_map count. > > > The exitting process then invokes a bunch of swap_free()s > > via zap_page_range, whereas the swap id might already have > > been reassigned. > > Agreed. > > > If there's no protection against this, a possible fix would > > be for exit_mmap not to clean the vma list, rather delete a > > vma at a time from the list. > > Looking at this, we have other problems: the forced swapin caused by > sys_swapoff() doesn't down() the mmap semaphore. That is very bad > indeed. We need to fix it. If we fix it, then we can fix exit_mmap() > at the same time by taking the mmap semaphore while we do the > unmap/close operations. > > --Stephen > I don't agree with you about swapoff needing the mmap_sem. In my thinking, mmap_sem is needed to preserve the vma list, *if* you go to sleep while scanning the list. Updates to the vma fields/ chain are protected by kernel_lock and mmap_sem. If you are scanning the vma list, and are guaranteed not to sleep, why would you need to grab mmap_sem, if you already have the kernel_lock, like swapoff does? Yes, but I agree we can play it safe and grab the lock ... that might make it easier to synchronize with exit_mmap. Let me think about this and post a possible patch. Thanks. Kanoj kanoj@engr.sgi.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions 1999-06-21 16:46 ` Kanoj Sarcar @ 1999-06-21 16:57 ` Stephen C. Tweedie 1999-06-21 17:36 ` Kanoj Sarcar 0 siblings, 1 reply; 60+ messages in thread From: Stephen C. Tweedie @ 1999-06-21 16:57 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: Stephen C. Tweedie, linux-mm Hi, On Mon, 21 Jun 1999 09:46:19 -0700 (PDT), kanoj@google.engr.sgi.com (Kanoj Sarcar) said: > I don't agree with you about swapoff needing the mmap_sem. In my > thinking, mmap_sem is needed to preserve the vma list, *if* you > go to sleep while scanning the list. Updates to the vma fields/ > chain are protected by kernel_lock and mmap_sem. No. mmap_sem protects both the vma list and the page tables. Page faults hold the mmap semaphore both to protect the vma list and to protect against concurrent pagins to the same page. The swapper is currently exempt from the mmap_sem, so the paging code needs to check whether the current pte has disappeared if it ever blocks, but it assumes that we never have concurrent pagein occurring (think threads). swapoff currently breaks that assumption. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions 1999-06-21 16:57 ` Stephen C. Tweedie @ 1999-06-21 17:36 ` Kanoj Sarcar 1999-06-21 17:49 ` Stephen C. Tweedie 0 siblings, 1 reply; 60+ messages in thread From: Kanoj Sarcar @ 1999-06-21 17:36 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: linux-mm > > Hi, > > On Mon, 21 Jun 1999 09:46:19 -0700 (PDT), kanoj@google.engr.sgi.com > (Kanoj Sarcar) said: > > > I don't agree with you about swapoff needing the mmap_sem. In my > > thinking, mmap_sem is needed to preserve the vma list, *if* you > > go to sleep while scanning the list. Updates to the vma fields/ > > chain are protected by kernel_lock and mmap_sem. > > No. mmap_sem protects both the vma list and the page tables. Page > faults hold the mmap semaphore both to protect the vma list and to > protect against concurrent pagins to the same page. > > The swapper is currently exempt from the mmap_sem, so the paging code > needs to check whether the current pte has disappeared if it ever > blocks, but it assumes that we never have concurrent pagein occurring > (think threads). swapoff currently breaks that assumption. > But doesn't my previous logic work in this case too? Namely that kernel_lock is held when any code looks at or changes a pte, so if swapoff holds the kernel_lock and never goes to sleep, things should work? Maybe if you can jot down a quick scenario where a problem occurs when swapoff does not take mmap_sem, it would be easier for me to spot which concurrency issue I am missing ... Thanks. Kanoj kanoj@engr.sgi.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions 1999-06-21 17:36 ` Kanoj Sarcar @ 1999-06-21 17:49 ` Stephen C. Tweedie 1999-06-21 18:46 ` Kanoj Sarcar 0 siblings, 1 reply; 60+ messages in thread From: Stephen C. Tweedie @ 1999-06-21 17:49 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: Stephen C. Tweedie, linux-mm Hi, On Mon, 21 Jun 1999 10:36:37 -0700 (PDT), kanoj@google.engr.sgi.com (Kanoj Sarcar) said: > But doesn't my previous logic work in this case too? Namely > that kernel_lock is held when any code looks at or changes > a pte, so if swapoff holds the kernel_lock and never goes to > sleep, things should work? No, because the swapoff could still take place while a normal swapin is already in progress. > Maybe if you can jot down a quick scenario where a problem occurs when > swapoff does not take mmap_sem, it would be easier for me to spot > which concurrency issue I am missing ... Look no further than swap_in(), which knows that there is no pte (so swapout concurrency is not a problem) and it holds the mmap lock (so there are no concurrent swap_ins on the page). It reads in the page adn unconditionally sets up the pte to point to it, assuming that nobody else can conceivably set the pte while we do the swap outselves. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions 1999-06-21 17:49 ` Stephen C. Tweedie @ 1999-06-21 18:46 ` Kanoj Sarcar 1999-06-21 23:44 ` Kanoj Sarcar 1999-06-28 22:36 ` filecache/swapcache questions Stephen C. Tweedie 0 siblings, 2 replies; 60+ messages in thread From: Kanoj Sarcar @ 1999-06-21 18:46 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: linux-mm > > Hi, > > On Mon, 21 Jun 1999 10:36:37 -0700 (PDT), kanoj@google.engr.sgi.com > (Kanoj Sarcar) said: > > > But doesn't my previous logic work in this case too? Namely > > that kernel_lock is held when any code looks at or changes > > a pte, so if swapoff holds the kernel_lock and never goes to > > sleep, things should work? > > No, because the swapoff could still take place while a normal swapin is > already in progress. > > > Maybe if you can jot down a quick scenario where a problem occurs when > > swapoff does not take mmap_sem, it would be easier for me to spot > > which concurrency issue I am missing ... > > Look no further than swap_in(), which knows that there is no pte (so > swapout concurrency is not a problem) and it holds the mmap lock (so > there are no concurrent swap_ins on the page). It reads in the page adn > unconditionally sets up the pte to point to it, assuming that nobody > else can conceivably set the pte while we do the swap outselves. > > --Stephen > Hmm, am I being fooled by the comment in swap_in? /* * The tests may look silly, but it essentially makes sure that * no other process did a swap-in on us just as we were waiting. * Also, swap_in seems to be revalidating the pte if it goes to sleep: if (pte_val(*page_table) != entry) { if (page_map) free_page_and_swap_cache(page_address(page_map)); return; } All this while holding kernel_lock ... So, I am still mystified about why swapoff would need the mmap_sem. Kanoj -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions 1999-06-21 18:46 ` Kanoj Sarcar @ 1999-06-21 23:44 ` Kanoj Sarcar 1999-06-24 22:23 ` Andrea Arcangeli 1999-06-28 22:36 ` filecache/swapcache questions Stephen C. Tweedie 1 sibling, 1 reply; 60+ messages in thread From: Kanoj Sarcar @ 1999-06-21 23:44 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: sct, linux-mm And continuing on with the problems with swapoff ... While forking, we copy swap handles from the parent into the child in copy_page_range. There are of course sleep point in dup_mmap (kmem_cache_alloc would be one, vm_ops->open could be another). A swapoff coming in at this point might scan the process list, not find the nascent child, and just delete the device, leaving the child referencing the old swap handles. Irregardless of our current discussions about why the mmap_sem is needed in swapoff to protect ptes, it seems that grabbing it in swapoff could trivially solve this fork race ... and some code changes in exit_mmap could also fix the exit race ... Kanoj kanoj@engr.sgi.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions 1999-06-21 23:44 ` Kanoj Sarcar @ 1999-06-24 22:23 ` Andrea Arcangeli 1999-06-24 23:55 ` Kanoj Sarcar 0 siblings, 1 reply; 60+ messages in thread From: Andrea Arcangeli @ 1999-06-24 22:23 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: sct, linux-mm On Mon, 21 Jun 1999, Kanoj Sarcar wrote: >And continuing on with the problems with swapoff ... I have not thought yet at the races you are talking about in the thread. But I think I seen another potential problem related to swapoff in the last days. Think if you run swapoff -a while there is a program that is faulting in a swapin exception. The process is sleeping into read_swap_cache_async() after having increased the swap-count (this is the only problem). While the task is sleeping swapoff will swapin the page and will map the swapped-in page in the pte of the process while the process is sleeping. Then swapoff continue and see that the swap-count is still > 0 (1 in the example) even if the page is been swapped-in for all tasks in the system. Swapoff get confused and set the swap count to 0 by hand (and doing that it corrupts a bit the state of the VM). I think I reproduced the above scenario stress testing 2.3.8 + my VM changes (finally "stable" except the buffer beyond end of the device problem) but it the problem I seen is real then it will apply to 2.2.x as well. Andrea Arcangeli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions 1999-06-24 22:23 ` Andrea Arcangeli @ 1999-06-24 23:55 ` Kanoj Sarcar 1999-06-25 0:26 ` Andrea Arcangeli 0 siblings, 1 reply; 60+ messages in thread From: Kanoj Sarcar @ 1999-06-24 23:55 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: torvalds, sct, linux-mm > > On Mon, 21 Jun 1999, Kanoj Sarcar wrote: > > >And continuing on with the problems with swapoff ... > > I have not thought yet at the races you are talking about in the thread. > > But I think I seen another potential problem related to swapoff in the > last days. Think if you run swapoff -a while there is a program that is > faulting in a swapin exception. The process is sleeping into > read_swap_cache_async() after having increased the swap-count (this is the > only problem). While the task is sleeping swapoff will swapin the page and > will map the swapped-in page in the pte of the process while the process > is sleeping. Then swapoff continue and see that the swap-count is still > > 0 (1 in the example) even if the page is been swapped-in for all tasks in > the system. Swapoff get confused and set the swap count to 0 by hand (and > doing that it corrupts a bit the state of the VM). I think I reproduced > the above scenario stress testing 2.3.8 + my VM changes (finally "stable" > except the buffer beyond end of the device problem) but it the problem > I seen is real then it will apply to 2.2.x as well. > > Andrea Arcangeli > Andrea, The scenario that you lay out is not possible, as both Stephen and I pointed out earlier in this thread. swapoff uses read_swap_cache, so if a process has started a swapin, swapoff will wait for that io to complete. Note that swapoff can not proceed until the read-in is complete (at which point the swapcount is decremented by PG_swap_unlock_after logic). So, it is not possible for swapoff to see swap count > 0. At least in theory ... As to why you might be seeing the problem, this might be due to fork/exit races with swapoff (which I pointed out in this thread), which I hope to have a fix for sometime soon (although it looks ugly). Also, see below. Linus, The swap lockmap deletion in 2.3.8 is not complete. I hope you will be taking in Andrea's "shm pages in swapcache" changes (although I haven't reviewed it, so I can't attest to its goodness). One problem in 2.3.8 is that a shm page could be getting swapped out, and a swapoff could actually read the contents of the swaphandle into a new page, *before* the swapout completed (this was prevented in 2.3.7 in rw_swap_page_base() by swap lockmap checking), since shm pages are not in the swap cache (thus swapoff would have no way of synchronizing with the swapout completing). This could lead to shm data getting corrupted. And also lead to swapoff manually setting swapcount to 0, with shm swapout termination also decrementing swapcount. Or maybe I am just confused .... Kanoj kanoj@engr.sgi.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions 1999-06-24 23:55 ` Kanoj Sarcar @ 1999-06-25 0:26 ` Andrea Arcangeli 1999-06-28 1:48 ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races Kanoj Sarcar 0 siblings, 1 reply; 60+ messages in thread From: Andrea Arcangeli @ 1999-06-25 0:26 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: torvalds, sct, linux-mm On Thu, 24 Jun 1999, Kanoj Sarcar wrote: >The scenario that you lay out is not possible, as both Stephen and I >pointed out earlier in this thread. swapoff uses read_swap_cache, >so if a process has started a swapin, swapoff will wait for that io >to complete. Note that swapoff can not proceed until the read-in is Sorry, I forgot to specify where the the faulting-task was sleeping. I wasn't talking about the case where the faulting-task was sleeping on I/O with the swap-cache page just alloced and hashed in the page cache. If the task is sleeping waiting for I/O then I completly agree with you that swapoff will block in lookup_swap_cache because it will see the swap-cache page locked down from the faulting-task. In my case the faulting-task was sleeping in _GFP_ (maybe swapping out some stuff in sync mode). And if you look at rw_swap_cache_async you'll notice that the task can go to sleep in GFP while holding an additional reference into the swap space (see swap_duplicate). While the task was sleeping swapoff was allowed to alloc a new page in the meantime, then was allowed to add such new page to the swap cache and to start I/O on it, and finally to remap the pte with the new page. Then swapoff continued noticing that there was an additional reference in the swap cache even if nobody was mapping such swapped-out page anymore (the additional reference was of the proggy sleeping in GFP). >The swap lockmap deletion in 2.3.8 is not complete. I hope you will >be taking in Andrea's "shm pages in swapcache" changes (although I I'll send the shm patch to Linus in the next days (but I bet nobody will trigger the race in the meantime, also considering that database people have the shm memory not swappable). Andrea Arcangeli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races 1999-06-25 0:26 ` Andrea Arcangeli @ 1999-06-28 1:48 ` Kanoj Sarcar 1999-06-28 10:35 ` Andrea Arcangeli ` (2 more replies) 0 siblings, 3 replies; 60+ messages in thread From: Kanoj Sarcar @ 1999-06-28 1:48 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: torvalds, sct, linux-mm, Kanoj Sarcar Linus/Andrea/Stephen, This is the patch that tries to cure the swapoff races with processes forking, exiting, and (readahead) swapping by faulting. Basically, all these operations are synchronized by the process mmap_sem. Unfortunately, swapoff has to visit all processes, during which it must hold tasklist_lock, a spinlock. Hence, it can not take the mmap_sem, a sleeping mutex. So, the patch links up all active mm's in a list that swapoff can visit (with minor restructuring, kswapd can also use this, although it can not hold mmap_sem). Addition/deletions to the list are protected by a sleeping mutex, hence swapoff can grab the individual mmap_sems, while preventing changes to the list. Effectively, process creation and destruction are locked out if swapoff is running. To do this, the lock ordering is mm_sem -> mmap_sem. To prevent deadlocks, care must be taken that a process invoking delete/insert_mmlist does not have its own mmap_sem held. For this, the do_fork path needs to change so as not to acquire mmap_sem early, rather only when it is really needed. This does not open up a resource-ordering problem between kernel_lock and mmap_sem, since the kernel_lock is a monitor lock that is released at schedule time, so no deadlocks are possible. I have just done basic sanity testing on this, I am hoping Andrea can run his swapoff stress tests to see whether this patch helps cure the problem he was seeing. Thanks. Kanoj kanoj@engr.sgi.com --- /usr/tmp/p_rdiff_a009HP/exec.c Sun Jun 27 16:51:58 1999 +++ fs/exec.c Sun Jun 27 15:14:43 1999 @@ -399,6 +399,7 @@ up(&mm->mmap_sem); mm_release(); mmput(old_mm); + insert_mmlist(mm); return 0; /* --- /usr/tmp/p_rdiff_a009HP/sched.h Sun Jun 27 16:52:01 1999 +++ include/linux/sched.h Fri Jun 25 17:22:56 1999 @@ -170,6 +170,8 @@ atomic_t count; int map_count; /* number of VMAs */ struct semaphore mmap_sem; + struct mm_struct *prev; /* list of allocated mms */ + struct mm_struct *next; /* list of allocated mms */ unsigned long context; unsigned long start_code, end_code, start_data, end_data; unsigned long start_brk, brk, start_stack; @@ -191,7 +193,7 @@ swapper_pg_dir, \ ATOMIC_INIT(1), 1, \ __MUTEX_INITIALIZER(name.mmap_sem), \ - 0, \ + &init_mm, &init_mm, 0, \ 0, 0, 0, 0, \ 0, 0, 0, \ 0, 0, 0, 0, \ @@ -611,6 +613,7 @@ /* * Routines for handling mm_structs */ +extern struct semaphore mm_sem; extern struct mm_struct * mm_alloc(void); static inline void mmget(struct mm_struct * mm) { @@ -619,6 +622,22 @@ extern void mmput(struct mm_struct *); /* Remove the current tasks stale references to the old mm_struct */ extern void mm_release(void); +static inline void insert_mmlist(struct mm_struct * mm) +{ + down(&mm_sem); + mm->prev = &init_mm; + mm->next = init_mm.next; + init_mm.next->prev = mm; + init_mm.next = mm; + up(&mm_sem); +} +static inline void delete_mmlist(struct mm_struct * mm) +{ + down(&mm_sem); + mm->next->prev = mm->prev; + mm->prev->next = mm->next; + up(&mm_sem); +} extern int copy_thread(int, unsigned long, unsigned long, struct task_struct *, struct pt_regs *); extern void flush_thread(void); --- /usr/tmp/p_rdiff_a009HP/fork.c Sun Jun 27 16:52:04 1999 +++ kernel/fork.c Sun Jun 27 15:28:34 1999 @@ -351,6 +351,7 @@ release_segments(mm); exit_mmap(mm); free_page_tables(mm); + delete_mmlist(mm); kmem_cache_free(mm_cachep, mm); } } @@ -383,7 +384,11 @@ retval = new_page_tables(tsk); if (retval) goto free_mm; + insert_mmlist(mm); + + down(¤t->mm->mmap_sem); retval = dup_mmap(mm); + up(¤t->mm->mmap_sem); if (retval) goto free_pt; up(&mm->mmap_sem); @@ -549,7 +554,6 @@ *p = *current; - down(¤t->mm->mmap_sem); lock_kernel(); retval = -EAGAIN; @@ -676,7 +680,6 @@ ++total_forks; bad_fork: unlock_kernel(); - up(¤t->mm->mmap_sem); fork_out: if ((clone_flags & CLONE_VFORK) && (retval > 0)) down(&sem); --- /usr/tmp/p_rdiff_a009HP/mmap.c Sun Jun 27 16:52:07 1999 +++ mm/mmap.c Sun Jun 27 15:20:08 1999 @@ -39,6 +39,8 @@ /* SLAB cache for vm_area_struct's. */ kmem_cache_t *vm_area_cachep; +struct semaphore mm_sem; + int sysctl_overcommit_memory; /* Check that a process has enough memory to allocate a @@ -812,6 +814,7 @@ { struct vm_area_struct * mpnt; + down(&mm->mmap_sem); mpnt = mm->mmap; mm->mmap = mm->mmap_avl = mm->mmap_cache = NULL; mm->rss = 0; @@ -843,6 +846,7 @@ printk("exit_mmap: map count is %d\n", mm->map_count); clear_page_tables(mm, 0, USER_PTRS_PER_PGD); + up(&mm->mmap_sem); } /* Insert vm structure into process list sorted by address @@ -957,6 +961,7 @@ void __init vma_init(void) { + init_MUTEX(&mm_sem); vm_area_cachep = kmem_cache_create("vm_area_struct", sizeof(struct vm_area_struct), 0, SLAB_HWCACHE_ALIGN, --- /usr/tmp/p_rdiff_a009HP/page_alloc.c Sun Jun 27 16:52:09 1999 +++ mm/page_alloc.c Sun Jun 27 15:39:58 1999 @@ -385,10 +385,9 @@ } /* - * The tests may look silly, but it essentially makes sure that - * no other process did a swap-in on us just as we were waiting. + * Concurrent swap-in via swapoff is interlocked out. * - * Also, don't bother to add to the swap cache if this page-in + * Don't bother to add to the swap cache if this page-in * was due to a write access. */ void swap_in(struct task_struct * tsk, struct vm_area_struct * vma, @@ -400,11 +399,6 @@ if (!page_map) { swapin_readahead(entry); page_map = read_swap_cache(entry); - } - if (pte_val(*page_table) != entry) { - if (page_map) - free_page_and_swap_cache(page_address(page_map)); - return; } if (!page_map) { set_pte(page_table, BAD_PAGE); --- /usr/tmp/p_rdiff_a009HP/swapfile.c Sun Jun 27 16:52:12 1999 +++ mm/swapfile.c Sun Jun 27 15:27:49 1999 @@ -259,20 +259,20 @@ } } -static void unuse_process(struct mm_struct * mm, unsigned long entry, +static void unuse_mm(struct mm_struct * mm, unsigned long entry, unsigned long page) { struct vm_area_struct* vma; /* - * Go through process' page directory. + * Go through address space page directory. */ - if (!mm || mm == &init_mm) - return; + down(&mm->mmap_sem); for (vma = mm->mmap; vma; vma = vma->vm_next) { pgd_t * pgd = pgd_offset(mm, vma->vm_start); unuse_vma(vma, pgd, entry, page); } + up(&mm->mmap_sem); return; } @@ -283,8 +283,8 @@ */ static int try_to_unuse(unsigned int type) { + struct mm_struct * mm; struct swap_info_struct * si = &swap_info[type]; - struct task_struct *p; struct page *page_map; unsigned long entry, page; int i; @@ -316,10 +316,12 @@ return -ENOMEM; } page = page_address(page_map); - read_lock(&tasklist_lock); - for_each_task(p) - unuse_process(p->mm, entry, page); - read_unlock(&tasklist_lock); + down(&mm_sem); + mm = init_mm.next; + while (mm != &init_mm) { + unuse_mm(mm, entry, page); + } + up(&mm_sem); shm_unuse(entry, page); /* Now get rid of the extra reference to the temporary page we've been using. */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races 1999-06-28 1:48 ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races Kanoj Sarcar @ 1999-06-28 10:35 ` Andrea Arcangeli 1999-06-28 17:11 ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Kanoj Sarcar 1999-06-28 16:32 ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races Stephen C. Tweedie 1999-06-28 19:39 ` Chuck Lever 2 siblings, 1 reply; 60+ messages in thread From: Andrea Arcangeli @ 1999-06-28 10:35 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: torvalds, sct, linux-mm On Sun, 27 Jun 1999, Kanoj Sarcar wrote: >This is the patch that tries to cure the swapoff races with processes >forking, exiting, and (readahead) swapping by faulting. For the record: at least the read_swap_cache_async race I pointed out can be fixed without grabbing the mmap semaphore. I agree that grabbing the semaphore would fix the race though. Here it is the alternate fix: Index: mm/swap_state.c =================================================================== RCS file: /var/cvs/linux/mm/swap_state.c,v retrieving revision 1.1.1.3 diff -u -r1.1.1.3 swap_state.c --- mm/swap_state.c 1999/06/14 15:30:09 1.1.1.3 +++ mm/swap_state.c 1999/06/28 10:15:15 @@ -125,7 +125,7 @@ "swap_duplicate: entry %08lx, offset exceeds max\n", entry); goto out; bad_unused: - printk(KERN_ERR + printk(KERN_WARNING "swap_duplicate at %8p: entry %08lx, unused page\n", __builtin_return_address(0), entry); goto out; @@ -291,20 +291,15 @@ entry, wait ? ", wait" : ""); #endif /* - * Make sure the swap entry is still in use. - */ - if (!swap_duplicate(entry)) /* Account for the swap cache */ - goto out; - /* * Look for the page in the swap cache. */ found_page = lookup_swap_cache(entry); if (found_page) - goto out_free_swap; + goto out; new_page_addr = __get_free_page(GFP_USER); if (!new_page_addr) - goto out_free_swap; /* Out of memory */ + goto out; /* Out of memory */ new_page = mem_map + MAP_NR(new_page_addr); /* @@ -313,6 +308,11 @@ found_page = lookup_swap_cache(entry); if (found_page) goto out_free_page; + /* + * Make sure the swap entry is still in use. + */ + if (!swap_duplicate(entry)) /* Account for the swap cache */ + goto out_free_page; /* * Add it to the swap cache and read its contents. */ @@ -330,8 +330,6 @@ out_free_page: __free_page(new_page); -out_free_swap: - swap_free(entry); out: return found_page; } NOTE: this will cause swap_duplicate to generate some warning message but everything will work fine then, exactly because the swapin code just check if the pte is changed (swapped in from swapoff) before looking if read_swap_cache returned a NULL pointer. (also the shm.c swap-cache code checks if the pte is changed before to go oom). But probably the right thing to do is to grab the mm semaphore in swapoff as you did since we don't risk to deadlock there :). Comments? Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 1999-06-28 10:35 ` Andrea Arcangeli @ 1999-06-28 17:11 ` Kanoj Sarcar 0 siblings, 0 replies; 60+ messages in thread From: Kanoj Sarcar @ 1999-06-28 17:11 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: torvalds, sct, linux-mm > Here it is the alternate fix: > > Index: mm/swap_state.c > =================================================================== > RCS file: /var/cvs/linux/mm/swap_state.c,v > retrieving revision 1.1.1.3 > diff -u -r1.1.1.3 swap_state.c > --- mm/swap_state.c 1999/06/14 15:30:09 1.1.1.3 > +++ mm/swap_state.c 1999/06/28 10:15:15 > @@ -125,7 +125,7 @@ > "swap_duplicate: entry %08lx, offset exceeds max\n", entry); > goto out; > bad_unused: > - printk(KERN_ERR > + printk(KERN_WARNING > "swap_duplicate at %8p: entry %08lx, unused page\n", > __builtin_return_address(0), entry); > goto out; > @@ -291,20 +291,15 @@ > entry, wait ? ", wait" : ""); > #endif > /* > - * Make sure the swap entry is still in use. > - */ > - if (!swap_duplicate(entry)) /* Account for the swap cache */ > - goto out; > - /* > * Look for the page in the swap cache. > */ > found_page = lookup_swap_cache(entry); > if (found_page) > - goto out_free_swap; > + goto out; > > new_page_addr = __get_free_page(GFP_USER); > if (!new_page_addr) > - goto out_free_swap; /* Out of memory */ > + goto out; /* Out of memory */ > new_page = mem_map + MAP_NR(new_page_addr); > > /* > @@ -313,6 +308,11 @@ > found_page = lookup_swap_cache(entry); > if (found_page) > goto out_free_page; > + /* > + * Make sure the swap entry is still in use. > + */ > + if (!swap_duplicate(entry)) /* Account for the swap cache */ > + goto out_free_page; > /* > * Add it to the swap cache and read its contents. > */ > @@ -330,8 +330,6 @@ > > out_free_page: > __free_page(new_page); > -out_free_swap: > - swap_free(entry); > out: > return found_page; > } > > > > NOTE: this will cause swap_duplicate to generate some warning message but Or not, depending on whether the swap id has already been allocated to a newly added swap device. In which case, the worst we will do is read-ahead in some unneeded swap pages. Not too bad ... I thought about this solution, and figured that a better idea is probably to give up on read-ahead in swapin_readahead() if the faulting pte had already been swapped in. Anyway, it seems to prevent fork/exit races, grabbing mmap_sem is needed in the swapoff path. Best to use that synchronization, since the fault path also grabs mmap_sem. Thanks. Kanoj > everything will work fine then, exactly because the swapin code just check > if the pte is changed (swapped in from swapoff) before looking if > read_swap_cache returned a NULL pointer. (also the shm.c swap-cache code > checks if the pte is changed before to go oom). > > But probably the right thing to do is to grab the mm semaphore in swapoff > as you did since we don't risk to deadlock there :). > > Comments? > > Andrea > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://humbolt.geo.uu.nl/Linux-MM/ > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races 1999-06-28 1:48 ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races Kanoj Sarcar 1999-06-28 10:35 ` Andrea Arcangeli @ 1999-06-28 16:32 ` Stephen C. Tweedie 1999-06-28 17:25 ` Kanoj Sarcar 1999-06-28 19:39 ` Chuck Lever 2 siblings, 1 reply; 60+ messages in thread From: Stephen C. Tweedie @ 1999-06-28 16:32 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: Andrea Arcangeli, torvalds, sct, linux-mm Hi, On Sun, 27 Jun 1999 18:48:47 -0700 (PDT), kanoj@google.engr.sgi.com (Kanoj Sarcar) said: > Linus/Andrea/Stephen, > This is the patch that tries to cure the swapoff races with processes > forking, exiting, and (readahead) swapping by faulting. > Basically, all these operations are synchronized by the process > mmap_sem. Unfortunately, swapoff has to visit all processes, during > which it must hold tasklist_lock, a spinlock. Hence, it can not take > the mmap_sem, a sleeping mutex. But it can atomic_inc(&mm->count) to pin the mm, drop the task lock and take the mm semaphore, and mmput() once it has finished. > So, the patch links up all active mm's in a list that swapoff can > visit There shouldn't be need for a new data structure. A bit of extra work in swapoff should be all that is needed, and that avoids adding any extra code at all on the hot paths. Adding extra locks is the sort of thing other unixes do to solve problems like this: we don't want to fall into that trap on Linux. :) --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races 1999-06-28 16:32 ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races Stephen C. Tweedie @ 1999-06-28 17:25 ` Kanoj Sarcar 1999-06-28 20:40 ` Stephen C. Tweedie 0 siblings, 1 reply; 60+ messages in thread From: Kanoj Sarcar @ 1999-06-28 17:25 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: andrea, torvalds, linux-mm > > Hi, > > On Sun, 27 Jun 1999 18:48:47 -0700 (PDT), kanoj@google.engr.sgi.com > (Kanoj Sarcar) said: > > > Linus/Andrea/Stephen, > > This is the patch that tries to cure the swapoff races with processes > > forking, exiting, and (readahead) swapping by faulting. > > > Basically, all these operations are synchronized by the process > > mmap_sem. Unfortunately, swapoff has to visit all processes, during > > which it must hold tasklist_lock, a spinlock. Hence, it can not take > > the mmap_sem, a sleeping mutex. > > But it can atomic_inc(&mm->count) to pin the mm, drop the task lock and > take the mm semaphore, and mmput() once it has finished. > Hmm, hadn't thought about that one. Of course, as soon as you drop the task_lock, in theory, you have to resume your search from the beginning of the task list, since the list might have changed while you dropped the task_lock (assume for a moment that the vm code does not know how the task list is managed). That prevents any forward progress by swapoff. I did think of other ways to maintain a hold on the process, preventing it from forking or exitting, but my judgement was they were going to be more heavyweight than my current solution. > > So, the patch links up all active mm's in a list that swapoff can > > visit > > There shouldn't be need for a new data structure. A bit of extra work > in swapoff should be all that is needed, and that avoids adding any > extra code at all on the hot paths. > > Adding extra locks is the sort of thing other unixes do to solve > problems like this: we don't want to fall into that trap on Linux. :) > Agreed ... if you can come up with a reasonably simple and lightweight solution without using locks. Thanks. Kanoj kanoj@engr.sgi.com > --Stephen > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races 1999-06-28 17:25 ` Kanoj Sarcar @ 1999-06-28 20:40 ` Stephen C. Tweedie 1999-06-28 21:11 ` Kanoj Sarcar 0 siblings, 1 reply; 60+ messages in thread From: Stephen C. Tweedie @ 1999-06-28 20:40 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: Stephen C. Tweedie, andrea, torvalds, linux-mm Hi, On Mon, 28 Jun 1999 10:25:45 -0700 (PDT), kanoj@google.engr.sgi.com (Kanoj Sarcar) said: >> But it can atomic_inc(&mm->count) to pin the mm, drop the task lock and >> take the mm semaphore, and mmput() once it has finished. > Hmm, hadn't thought about that one. Of course, as soon as you drop > the task_lock, in theory, you have to resume your search from the > beginning of the task list, since the list might have changed while > you dropped the task_lock (assume for a moment that the vm code does > not know how the task list is managed). That prevents any forward > progress by swapoff. Then keep a fencepost of the highest pid you have completed so far, and with the lock held, look for the lowest pid greater than that one. If you don't make any progress on the mm, bump up the fencepost pid by one. It will work. It's a little extra overhead, but it confines all of the cost to the swapoff path. The pid scan isn't going to be nearly as expensive as the rest of the vm scanning we are already forced to do in swapoff. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races 1999-06-28 20:40 ` Stephen C. Tweedie @ 1999-06-28 21:11 ` Kanoj Sarcar 1999-06-28 22:12 ` Stephen C. Tweedie 0 siblings, 1 reply; 60+ messages in thread From: Kanoj Sarcar @ 1999-06-28 21:11 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: andrea, torvalds, linux-mm > > Hi, > > On Mon, 28 Jun 1999 10:25:45 -0700 (PDT), kanoj@google.engr.sgi.com > (Kanoj Sarcar) said: > > >> But it can atomic_inc(&mm->count) to pin the mm, drop the task lock and > >> take the mm semaphore, and mmput() once it has finished. > > > Hmm, hadn't thought about that one. Of course, as soon as you drop > > the task_lock, in theory, you have to resume your search from the > > beginning of the task list, since the list might have changed while > > you dropped the task_lock (assume for a moment that the vm code does > > not know how the task list is managed). That prevents any forward > > progress by swapoff. > > Then keep a fencepost of the highest pid you have completed so far, > and with the lock held, look for the lowest pid greater than that > one. If you don't make any progress on the mm, bump up the fencepost > pid by one. If I understand right, here is an example. Lets say I believe I have scanned uptil pid 10. You are suggesting, after having scanned pid 10, hold on to task_lock, and look for the min pid > 10. Say that is pid 12. Problem is, while I was scanning pid 10, maybe pid 5 got reallocated, and pid 5 is a new process (probably a child of pid 20). Note that I mention that it is good design for the vm code not to assume how the task list is managed or pids allocated (yes, I have thought of having a swapoff generation number stored in each task structure too ...) > > It will work. It's a little extra overhead, but it confines all of > the cost to the swapoff path. The pid scan isn't going to be nearly > as expensive as the rest of the vm scanning we are already forced to > do in swapoff. I would love to confine the complexity in the swapoff path, except I can't come up with a solution. In any case, I think I was not clear about what the cost is in my fix. It is adding 2 chain fields in the mm structure, adding and deleting to this chain at mm alloc/free time, and the up/down cost on the mutex. Note that the up/down cost is minimal (one atomic inc/dec) when no swapoff is going on, since the kernel_lock also protects the chain. The mutex only becomes contended when there is a swapoff in progress. Thanks. Kanoj kanoj@engr.sgi.com Ps - All this discussion does not seem to be making it on to the linux-mm web page ... > > --Stephen > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races 1999-06-28 21:11 ` Kanoj Sarcar @ 1999-06-28 22:12 ` Stephen C. Tweedie 1999-06-28 23:43 ` Kanoj Sarcar 0 siblings, 1 reply; 60+ messages in thread From: Stephen C. Tweedie @ 1999-06-28 22:12 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: Stephen C. Tweedie, andrea, torvalds, linux-mm Hi, On Mon, 28 Jun 1999 14:11:18 -0700 (PDT), kanoj@google.engr.sgi.com (Kanoj Sarcar) said: > If I understand right, here is an example. Lets say I believe I > have scanned uptil pid 10. You are suggesting, after having scanned > pid 10, hold on to task_lock, and look for the min pid > 10. Say > that is pid 12. Problem is, while I was scanning pid 10, maybe > pid 5 got reallocated, and pid 5 is a new process (probably a > child of pid 20). Fine --- repeat the whole thing until we have no swap entries left. We can still guarantee to make progress without extra locking for normal swapping. >> It will work. It's a little extra overhead, but it confines all of >> the cost to the swapoff path. The pid scan isn't going to be nearly >> as expensive as the rest of the vm scanning we are already forced to >> do in swapoff. > I would love to confine the complexity in the swapoff path, except > I can't come up with a solution. In any case, I think I was not > clear about what the cost is in my fix. It is adding 2 chain fields > in the mm structure, adding and deleting to this chain at mm alloc/free > time, and the up/down cost on the mutex. But it's not necessary. Other OSes may add a lock here, a lock there every time it happens to make a non-performance-critical path easier, but in the long term that sort of thinking just bloats the fast paths. > Note that the up/down cost is minimal (one atomic inc/dec) when no > swapoff is going on On SMP, the cache traffic produced by such locks is not minimal. You can measure the performance hit of every single cache miss that results. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races 1999-06-28 22:12 ` Stephen C. Tweedie @ 1999-06-28 23:43 ` Kanoj Sarcar 1999-06-29 11:44 ` Stephen C. Tweedie 0 siblings, 1 reply; 60+ messages in thread From: Kanoj Sarcar @ 1999-06-28 23:43 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: andrea, torvalds, linux-mm > > Hi, > > On Mon, 28 Jun 1999 14:11:18 -0700 (PDT), kanoj@google.engr.sgi.com > (Kanoj Sarcar) said: > > > If I understand right, here is an example. Lets say I believe I > > have scanned uptil pid 10. You are suggesting, after having scanned > > pid 10, hold on to task_lock, and look for the min pid > 10. Say > > that is pid 12. Problem is, while I was scanning pid 10, maybe > > pid 5 got reallocated, and pid 5 is a new process (probably a > > child of pid 20). > > Fine --- repeat the whole thing until we have no swap entries left. We > can still guarantee to make progress without extra locking for normal > swapping. > This will almost always work, except theoretically, you still can not guarantee forward progress, unless you can stop forks() from happening. That is, given a high enough rate of forking, swapoff is never going to terminate. Kanoj -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races 1999-06-28 23:43 ` Kanoj Sarcar @ 1999-06-29 11:44 ` Stephen C. Tweedie 1999-06-29 22:01 ` Kanoj Sarcar 0 siblings, 1 reply; 60+ messages in thread From: Stephen C. Tweedie @ 1999-06-29 11:44 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: Stephen C. Tweedie, andrea, torvalds, linux-mm Hi, On Mon, 28 Jun 1999 16:43:59 -0700 (PDT), kanoj@google.engr.sgi.com (Kanoj Sarcar) said: > This will almost always work, except theoretically, you still can > not guarantee forward progress, unless you can stop forks() from > happening. That is, given a high enough rate of forking, swapoff > is never going to terminate. Then repeat until it converges, ie. until you have no swap entries left. No big deal. Unless the swapoff sweep and the fork are running over pid space at exactly the same rate forever (which we do not have to worry about!), you will make progress. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races 1999-06-29 11:44 ` Stephen C. Tweedie @ 1999-06-29 22:01 ` Kanoj Sarcar 1999-06-30 17:28 ` Stephen C. Tweedie 0 siblings, 1 reply; 60+ messages in thread From: Kanoj Sarcar @ 1999-06-29 22:01 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: andrea, torvalds, linux-mm > > Hi, > > On Mon, 28 Jun 1999 16:43:59 -0700 (PDT), kanoj@google.engr.sgi.com > (Kanoj Sarcar) said: > > > This will almost always work, except theoretically, you still can > > not guarantee forward progress, unless you can stop forks() from > > happening. That is, given a high enough rate of forking, swapoff > > is never going to terminate. > > Then repeat until it converges, ie. until you have no swap entries left. > No big deal. Unless the swapoff sweep and the fork are running over pid > space at exactly the same rate forever (which we do not have to worry > about!), you will make progress. > Stephen, Seeing that both of us devoted so much time to discussing this, I felt compelled to look at what is involved in doing what you are suggesting. To know whether there are any more references left to be eliminated on a swap page, we can not tolerate a SWAP_MAP_MAX concept; else we can never determine whether there are processes still referencing the swap page. Removing SWAP_MAP_MAX is a good thing in itself. The swap_map[] array needs to be declared as an array of elements of the same size as the page->count field, ie an atomic_t (since there can be no more references to the swap page than there can be on the physical page). Also, I am not sure why you say that fork can not keep ahead of the swapoff sweep forever. Are you saying it is okay not to guarantee forward progress of swapoff while a program that keeps on forking (and the children exit almost immediately) is running? Then there's the complexity of clone(CLONE_PID), which creates task structures with the same pid, so the pid fencepost algorithm would need to handle that too ... Let me know what you think of these two issues, then I can try to create a patch that does this ... Kanoj kanoj@engr.sgi.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races 1999-06-29 22:01 ` Kanoj Sarcar @ 1999-06-30 17:28 ` Stephen C. Tweedie 1999-06-30 18:05 ` Kanoj Sarcar 0 siblings, 1 reply; 60+ messages in thread From: Stephen C. Tweedie @ 1999-06-30 17:28 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: Stephen C. Tweedie, andrea, torvalds, linux-mm Hi, On Tue, 29 Jun 1999 15:01:24 -0700 (PDT), kanoj@google.engr.sgi.com (Kanoj Sarcar) said: > To know whether there are any more references left to be eliminated > on a swap page, we can not tolerate a SWAP_MAP_MAX concept; else we > can never determine whether there are processes still referencing the > swap page. Removing SWAP_MAP_MAX is a good thing in itself. The > swap_map[] array needs to be declared as an array of elements of the > same size as the page->count field, ie an atomic_t (since there can be > no more references to the swap page than there can be on the physical > page). Yes there can... > Also, I am not sure why you say that fork can not keep ahead of > the swapoff sweep forever. Hmm, maybe.. > Are you saying it is okay not to guarantee forward progress of swapoff > while a program that keeps on forking (and the children exit almost > immediately) is running? There are a lot of things which don't make forward progress in such a situation already. Put a lock on dup_mm() if it worries you that much. > Then there's the complexity of clone(CLONE_PID), which creates task > structures with the same pid, so the pid fencepost algorithm would > need to handle that too ... Sure. I never said that I had a complete solution: I just don't believe that a new mm lock on all the faulting paths is necessary for a complete solution. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races 1999-06-30 17:28 ` Stephen C. Tweedie @ 1999-06-30 18:05 ` Kanoj Sarcar 0 siblings, 0 replies; 60+ messages in thread From: Kanoj Sarcar @ 1999-06-30 18:05 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: andrea, torvalds, linux-mm > > Hi, > > On Tue, 29 Jun 1999 15:01:24 -0700 (PDT), kanoj@google.engr.sgi.com > (Kanoj Sarcar) said: > > > To know whether there are any more references left to be eliminated > > on a swap page, we can not tolerate a SWAP_MAP_MAX concept; else we > > can never determine whether there are processes still referencing the > > swap page. Removing SWAP_MAP_MAX is a good thing in itself. The > > swap_map[] array needs to be declared as an array of elements of the > > same size as the page->count field, ie an atomic_t (since there can be > > no more references to the swap page than there can be on the physical > > page). > > Yes there can... I don't know how, but if this is true, and we do not have a theoretical upper bound on the swap_count, then we will have to preserve SWAP_MAP_MAX ... which will render your proposal unachieveable ... > > > Also, I am not sure why you say that fork can not keep ahead of > > the swapoff sweep forever. > > Hmm, maybe.. > > > Are you saying it is okay not to guarantee forward progress of swapoff > > while a program that keeps on forking (and the children exit almost > > immediately) is running? > > There are a lot of things which don't make forward progress in such a > situation already. Put a lock on dup_mm() if it worries you that much. That's basically what my solution does ... adds in a lock point in copy_mm. > > > Then there's the complexity of clone(CLONE_PID), which creates task > > structures with the same pid, so the pid fencepost algorithm would > > need to handle that too ... > > Sure. I never said that I had a complete solution: I just don't believe > that a new mm lock on all the faulting paths is necessary for a complete > solution. Hmmm, did you look at my solution in detail ... no locks are taken on the page fault paths, other than mmap_sem, which the current code already takes ... Thanks. Kanoj kanoj@engr.sgi.com > > --Stephen > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races 1999-06-28 1:48 ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races Kanoj Sarcar 1999-06-28 10:35 ` Andrea Arcangeli 1999-06-28 16:32 ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races Stephen C. Tweedie @ 1999-06-28 19:39 ` Chuck Lever 1999-06-28 19:55 ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Kanoj Sarcar 1999-06-28 20:45 ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races Stephen C. Tweedie 2 siblings, 2 replies; 60+ messages in thread From: Chuck Lever @ 1999-06-28 19:39 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: Andrea Arcangeli, torvalds, sct, linux-mm On Sun, 27 Jun 1999, Kanoj Sarcar wrote: > Basically, all these operations are synchronized by the process > mmap_sem. Unfortunately, swapoff has to visit all processes, during > which it must hold tasklist_lock, a spinlock. Hence, it can not take > the mmap_sem, a sleeping mutex. So, the patch links up all active > mm's in a list that swapoff can visit (with minor restructuring, > kswapd can also use this, although it can not hold mmap_sem). > Addition/deletions to the list are protected by a sleeping > mutex, hence swapoff can grab the individual mmap_sems, while > preventing changes to the list. Effectively, process creation > and destruction are locked out if swapoff is running. > > To do this, the lock ordering is mm_sem -> mmap_sem. To > prevent deadlocks, care must be taken that a process invoking > delete/insert_mmlist does not have its own mmap_sem held. For > this, the do_fork path needs to change so as not to acquire > mmap_sem early, rather only when it is really needed. This does > not open up a resource-ordering problem between kernel_lock and > mmap_sem, since the kernel_lock is a monitor lock that is released > at schedule time, so no deadlocks are possible. i'm already working on a patch that will allow kswapd to grab the mmap_sem for the task that is about to be swapped. this takes a slightly different approach, since i'm focusing on kswapd and not on swapoff. essentially the patch does two things: 1) it separates the logic of try_to_free_pages() and kswapd. kswapd now does the swapping, while try_to_free_pages() only does the shrink_mmap() phase. 2) after kswapd has chosen a process to swap, it drops the kernel lock and grabs the mmap_sem for the thing it's about to swap. it picks up the kernel lock at appropriate points lower in the code. i think it simplifies things a lot; there is no longer a concern about a process deadlocking when re-acquiring it's own semaphore. and, swapping and page-fault handling for a given object can be serialized via the object's mmap_sem. - Chuck Lever -- corporate: <chuckl@netscape.com> personal: <chucklever@netscape.net> or <cel@monkey.org> The Linux Scalability project: http://www.citi.umich.edu/projects/linux-scalability/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 1999-06-28 19:39 ` Chuck Lever @ 1999-06-28 19:55 ` Kanoj Sarcar 1999-06-28 20:33 ` Chuck Lever 1999-06-28 22:09 ` Stephen C. Tweedie 1999-06-28 20:45 ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races Stephen C. Tweedie 1 sibling, 2 replies; 60+ messages in thread From: Kanoj Sarcar @ 1999-06-28 19:55 UTC (permalink / raw) To: Chuck Lever; +Cc: andrea, torvalds, sct, linux-mm > > i'm already working on a patch that will allow kswapd to grab the mmap_sem > for the task that is about to be swapped. this takes a slightly different > approach, since i'm focusing on kswapd and not on swapoff. essentially > the patch does two things: So, I would think some (if not mine) swapoff fix is still needed ... > > 1) it separates the logic of try_to_free_pages() and kswapd. kswapd now > does the swapping, while try_to_free_pages() only does the shrink_mmap() > phase. > > 2) after kswapd has chosen a process to swap, it drops the kernel lock > and grabs the mmap_sem for the thing it's about to swap. it picks up the > kernel lock at appropriate points lower in the code. > Agreed this would be a nice thing to be able to do ... Other than the deadlock problem, there's another issue involved, I think. Processes can go to sleep (inside drivers/fs for example while mmaping/munmaping/faulting) holding their mmap_sem, so any solution should be able to guarantee that (at least one of) the memory free'ers do not go to sleep indefinitely (or for some time that is upto driver/fs code to determine). Kanoj kanoj@engr.sgi.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 1999-06-28 19:55 ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Kanoj Sarcar @ 1999-06-28 20:33 ` Chuck Lever 1999-06-28 20:51 ` Kanoj Sarcar 1999-06-28 22:09 ` Stephen C. Tweedie 1 sibling, 1 reply; 60+ messages in thread From: Chuck Lever @ 1999-06-28 20:33 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: andrea, torvalds, sct, linux-mm On Mon, 28 Jun 1999, Kanoj Sarcar wrote: > > i'm already working on a patch that will allow kswapd to grab the mmap_sem > > for the task that is about to be swapped. this takes a slightly different > > approach, since i'm focusing on kswapd and not on swapoff. essentially > > the patch does two things: > > So, I would think some (if not mine) swapoff fix is still needed ... oh absolutely! i was thinking that my patch might help make your work simpler, that's all. once i've tested it a little more, i'll post it to the list. > Other than the deadlock problem, there's another issue involved, I > think. Processes can go to sleep (inside drivers/fs for example while > mmaping/munmaping/faulting) holding their mmap_sem, so any solution > should be able to guarantee that (at least one of) the memory free'ers > do not go to sleep indefinitely (or for some time that is upto driver/fs > code to determine). or perhaps the kernel could start more than one kswapd (one per swap partition?). with my patch, regular processes never wait for swap out I/O, only kswapd does. if you're concerned about bounding the latency of VM operations in order to provide some RT guarantees, then i'd imagine, based on what i've read on this list, that Linus might want to keep things simple more than he'd want to clutter the memory freeing logic... but if there's a simple way to "guarantee" a low latency then it would be worth the trouble. - Chuck Lever -- corporate: <chuckl@netscape.com> personal: <chucklever@netscape.net> or <cel@monkey.org> The Linux Scalability project: http://www.citi.umich.edu/projects/linux-scalability/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 1999-06-28 20:33 ` Chuck Lever @ 1999-06-28 20:51 ` Kanoj Sarcar 1999-06-28 21:32 ` Chuck Lever 1999-06-28 22:08 ` Stephen C. Tweedie 0 siblings, 2 replies; 60+ messages in thread From: Kanoj Sarcar @ 1999-06-28 20:51 UTC (permalink / raw) To: Chuck Lever; +Cc: andrea, torvalds, sct, linux-mm > > > Other than the deadlock problem, there's another issue involved, I > > think. Processes can go to sleep (inside drivers/fs for example while > > mmaping/munmaping/faulting) holding their mmap_sem, so any solution > > should be able to guarantee that (at least one of) the memory free'ers > > do not go to sleep indefinitely (or for some time that is upto driver/fs > > code to determine). > > or perhaps the kernel could start more than one kswapd (one per swap > partition?). with my patch, regular processes never wait for swap out > I/O, only kswapd does. > > if you're concerned about bounding the latency of VM operations in order > to provide some RT guarantees, then i'd imagine, based on what i've read > on this list, that Linus might want to keep things simple more than he'd > want to clutter the memory freeing logic... but if there's a simple way to > "guarantee" a low latency then it would be worth the trouble. Oh no, I was not talking about exotic stuff like RT ... I was simply pointing out that to prevent deadlocks, and guarantee forward progress, you have to show that despite what underlying fs/driver code does, at least one memory freer is free to do its job. Else, under low memory conditions, no memory freer can free up memory, so the system is effectively hung. If you have to wait for mmap_sem, you can not easily do that (unless you are willing to do a trylock for mmap_sem, ie give up on a process and continue scanning for others). This is partly why after thinking about it, I did not attempt to do this myself. Note that while Stephen's 2.2 kpiod work was probably aimed at fixing fs deadlocks, I think it also gave the nice property that the chances that the "swapout" method goes to sleep were reduced. Not to 0, since make_pio_request() itself requests memory ... Things are probably much better in 2.3, I am not upto date with .7 and .8. Kanoj -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 1999-06-28 20:51 ` Kanoj Sarcar @ 1999-06-28 21:32 ` Chuck Lever 1999-06-28 21:38 ` Kanoj Sarcar 1999-06-28 22:21 ` Stephen C. Tweedie 1999-06-28 22:08 ` Stephen C. Tweedie 1 sibling, 2 replies; 60+ messages in thread From: Chuck Lever @ 1999-06-28 21:32 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: andrea, torvalds, sct, linux-mm On Mon, 28 Jun 1999, Kanoj Sarcar wrote: > > or perhaps the kernel could start more than one kswapd (one per swap > > partition?). with my patch, regular processes never wait for swap out > > I/O, only kswapd does. > > > > if you're concerned about bounding the latency of VM operations in order > > to provide some RT guarantees, then i'd imagine, based on what i've read > > on this list, that Linus might want to keep things simple more than he'd > > want to clutter the memory freeing logic... but if there's a simple way to > > "guarantee" a low latency then it would be worth the trouble. > > Oh no, I was not talking about exotic stuff like RT ... I was > simply pointing out that to prevent deadlocks, and guarantee forward > progress, you have to show that despite what underlying fs/driver > code does, at least one memory freer is free to do its job. Else, > under low memory conditions, no memory freer can free up memory, so > the system is effectively hung. If you have to wait for mmap_sem, > you can not easily do that (unless you are willing to do a trylock > for mmap_sem, ie give up on a process and continue scanning for others). > This is partly why after thinking about it, I did not attempt to do > this myself. (i also tried down_trylock, but discarded it.) well, except that kswapd itself doesn't free any memory. it simply copies data from memory to disk. shrink_mmap() actually does the freeing, and can do this with minimal locking, and from within regular application processes. when a process calls shrink_mmap(), it will cause some pages to be made available to GFP. if you need evidence that shrink_mmap() will keep a system running without swapping, just run 2.3.8 :) :) come to think of it, i don't think there is a safety guarantee in this mechanism to prevent a lock-up. i'll have to think more about it. - Chuck Lever -- corporate: <chuckl@netscape.com> personal: <chucklever@netscape.net> or <cel@monkey.org> The Linux Scalability project: http://www.citi.umich.edu/projects/linux-scalability/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 1999-06-28 21:32 ` Chuck Lever @ 1999-06-28 21:38 ` Kanoj Sarcar 1999-06-28 21:50 ` Chuck Lever 1999-06-28 22:22 ` Stephen C. Tweedie 1999-06-28 22:21 ` Stephen C. Tweedie 1 sibling, 2 replies; 60+ messages in thread From: Kanoj Sarcar @ 1999-06-28 21:38 UTC (permalink / raw) To: Chuck Lever; +Cc: andrea, torvalds, sct, linux-mm > > (i also tried down_trylock, but discarded it.) > > well, except that kswapd itself doesn't free any memory. it simply copies > data from memory to disk. shrink_mmap() actually does the freeing, and > can do this with minimal locking, and from within regular application > processes. when a process calls shrink_mmap(), it will cause some pages > to be made available to GFP. > The page is not really free for reallocation, unless kswapd can push out the contents to disk, right? Which means, kswapd should have as minimal sleep/memallocation points as possible ... Kanoj kanoj@engr.sgi.com > if you need evidence that shrink_mmap() will keep a system running without > swapping, just run 2.3.8 :) :) > > come to think of it, i don't think there is a safety guarantee in this > mechanism to prevent a lock-up. i'll have to think more about it. > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 1999-06-28 21:38 ` Kanoj Sarcar @ 1999-06-28 21:50 ` Chuck Lever 1999-06-28 22:15 ` Kanoj Sarcar 1999-06-28 22:22 ` Stephen C. Tweedie 1 sibling, 1 reply; 60+ messages in thread From: Chuck Lever @ 1999-06-28 21:50 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: linux-mm On Mon, 28 Jun 1999, Kanoj Sarcar wrote: > > well, except that kswapd itself doesn't free any memory. it simply copies > > data from memory to disk. shrink_mmap() actually does the freeing, and > > can do this with minimal locking, and from within regular application > > processes. when a process calls shrink_mmap(), it will cause some pages > > to be made available to GFP. > > The page is not really free for reallocation, unless kswapd can > push out the contents to disk, right? Which means, kswapd should > have as minimal sleep/memallocation points as possible ... kswapd itself always uses a gfp_mask that includes GFP_IO, so nothing it calls will ever wait. the I/O it schedules is asynchronous, and when complete, the buffer exit code in end_buffer_io_async will set the page flags appropriately for shrink_mmap() to come by and steal it. also, the buffer code will use pre-allocated buffers if gfp fails. - Chuck Lever -- corporate: <chuckl@netscape.com> personal: <chucklever@netscape.net> or <cel@monkey.org> The Linux Scalability project: http://www.citi.umich.edu/projects/linux-scalability/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 1999-06-28 21:50 ` Chuck Lever @ 1999-06-28 22:15 ` Kanoj Sarcar 1999-06-29 11:23 ` Stephen C. Tweedie 0 siblings, 1 reply; 60+ messages in thread From: Kanoj Sarcar @ 1999-06-28 22:15 UTC (permalink / raw) To: Chuck Lever; +Cc: linux-mm > > On Mon, 28 Jun 1999, Kanoj Sarcar wrote: > > > well, except that kswapd itself doesn't free any memory. it simply copies > > > data from memory to disk. shrink_mmap() actually does the freeing, and > > > can do this with minimal locking, and from within regular application > > > processes. when a process calls shrink_mmap(), it will cause some pages > > > to be made available to GFP. > > > > The page is not really free for reallocation, unless kswapd can > > push out the contents to disk, right? Which means, kswapd should > > have as minimal sleep/memallocation points as possible ... > > kswapd itself always uses a gfp_mask that includes GFP_IO, so nothing it > calls will ever wait. the I/O it schedules is asynchronous, and when > complete, the buffer exit code in end_buffer_io_async will set the page > flags appropriately for shrink_mmap() to come by and steal it. also, the > buffer code will use pre-allocated buffers if gfp fails. > Which is why you must gurantee that kswapd can always run, and keep as few blocking points as possible ... Kanoj -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 1999-06-28 22:15 ` Kanoj Sarcar @ 1999-06-29 11:23 ` Stephen C. Tweedie 1999-06-29 17:36 ` Kanoj Sarcar 0 siblings, 1 reply; 60+ messages in thread From: Stephen C. Tweedie @ 1999-06-29 11:23 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: Chuck Lever, linux-mm Hi, On Mon, 28 Jun 1999 15:15:29 -0700 (PDT), kanoj@google.engr.sgi.com (Kanoj Sarcar) said: >> kswapd itself always uses a gfp_mask that includes GFP_IO, so nothing it >> calls will ever wait. the I/O it schedules is asynchronous, and when >> complete, the buffer exit code in end_buffer_io_async will set the page >> flags appropriately for shrink_mmap() to come by and steal it. also, the >> buffer code will use pre-allocated buffers if gfp fails. >> > Which is why you must gurantee that kswapd can always run, and keep > as few blocking points as possible ... Look, we're just going round in circles here. kswapd *can* always run. kswapd never ever waits in its memory allocation calls. In get_free_pages(), we special case PF_MEMALLOC processes (such as kswapd) and completely avoid trying to free pages in that case: rather, we rely on the free page thresholds preserving a last-chance set of free pages which are _only_ usable by such processes. kswapd can wait for IO, but the block device layers go to great lengths to ensure that this can always proceed safely. If the device layers need an extra memory allocation to succeed, that again is protected by PF_MEMALLOC. kswapd never waits for long-term-held filesystem locks: that is what kpiod is for. This architecture is very robust. Add an extra mmap semaphore lock to the swapout path and you destroy it. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 1999-06-29 11:23 ` Stephen C. Tweedie @ 1999-06-29 17:36 ` Kanoj Sarcar 0 siblings, 0 replies; 60+ messages in thread From: Kanoj Sarcar @ 1999-06-29 17:36 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: cel, linux-mm > > Hi, > > On Mon, 28 Jun 1999 15:15:29 -0700 (PDT), kanoj@google.engr.sgi.com > (Kanoj Sarcar) said: > > >> kswapd itself always uses a gfp_mask that includes GFP_IO, so nothing it > >> calls will ever wait. the I/O it schedules is asynchronous, and when > >> complete, the buffer exit code in end_buffer_io_async will set the page > >> flags appropriately for shrink_mmap() to come by and steal it. also, the > >> buffer code will use pre-allocated buffers if gfp fails. > >> > > > Which is why you must gurantee that kswapd can always run, and keep > > as few blocking points as possible ... > > Look, we're just going round in circles here. > > kswapd *can* always run. > Not if you are going to try grabbing mmap_sem in that path ... Anyway, I guess we have established that is a bad idea ... Kanoj -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 1999-06-28 21:38 ` Kanoj Sarcar 1999-06-28 21:50 ` Chuck Lever @ 1999-06-28 22:22 ` Stephen C. Tweedie 1 sibling, 0 replies; 60+ messages in thread From: Stephen C. Tweedie @ 1999-06-28 22:22 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: Chuck Lever, andrea, torvalds, sct, linux-mm Hi, On Mon, 28 Jun 1999 14:38:43 -0700 (PDT), kanoj@google.engr.sgi.com (Kanoj Sarcar) said: > The page is not really free for reallocation, unless kswapd can > push out the contents to disk, right? Which means, kswapd should > have as minimal sleep/memallocation points as possible ... The kswapd process is marked with the PF_MEMALLOC process flag, so any recursive memory allocations it attempts get satisfied without IO being invoked. kswapd does not sleep during memory allocation. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 1999-06-28 21:32 ` Chuck Lever 1999-06-28 21:38 ` Kanoj Sarcar @ 1999-06-28 22:21 ` Stephen C. Tweedie 1999-06-28 22:57 ` Andrea Arcangeli 1999-06-29 1:00 ` Chuck Lever 1 sibling, 2 replies; 60+ messages in thread From: Stephen C. Tweedie @ 1999-06-28 22:21 UTC (permalink / raw) To: Chuck Lever; +Cc: Kanoj Sarcar, andrea, torvalds, sct, linux-mm Hi, On Mon, 28 Jun 1999 17:32:05 -0400 (EDT), Chuck Lever <cel@monkey.org> said: > well, except that kswapd itself doesn't free any memory. It has to. That was why kswapd was initially written, to ensure that interrupt memory requests (eg. busy router boxes) don't starve of memory. All of the benefits of kswapd came later. In normal kernels the try_to_swap_out doesn't free memory, true enough, but kswapd calls shrink_mmap() too to make sure it does make real progress in freeing memory. > if you need evidence that shrink_mmap() will keep a system running without > swapping, just run 2.3.8 :) :) 2.3.8 shows up slower on several benchmarks because of its reluctance to swap. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 1999-06-28 22:21 ` Stephen C. Tweedie @ 1999-06-28 22:57 ` Andrea Arcangeli 1999-06-29 2:13 ` Chuck Lever 1999-06-29 1:00 ` Chuck Lever 1 sibling, 1 reply; 60+ messages in thread From: Andrea Arcangeli @ 1999-06-28 22:57 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Chuck Lever, Kanoj Sarcar, torvalds, linux-mm On Mon, 28 Jun 1999, Stephen C. Tweedie wrote: >> if you need evidence that shrink_mmap() will keep a system running without >> swapping, just run 2.3.8 :) :) > >2.3.8 shows up slower on several benchmarks because of its reluctance to >swap. Here the point is if you are swapping over your ramdisk or over my HD :). Over my HD (system+swap all in the same IDE disk) you must _avoid_ to swap at all costs if you care about performances. And btw with the clock algorithm nobody can ever be sure to get a good swap/cache balance. With the page-LRU code I have almost ready for 2.3.x (definitely stable for 2.2.x) instead we'll be sure to swapout only when there isn't plenty of cache recyclable. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 1999-06-28 22:57 ` Andrea Arcangeli @ 1999-06-29 2:13 ` Chuck Lever 1999-06-29 12:01 ` Stephen C. Tweedie 0 siblings, 1 reply; 60+ messages in thread From: Chuck Lever @ 1999-06-29 2:13 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Stephen C. Tweedie, Kanoj Sarcar, torvalds, linux-mm On Tue, 29 Jun 1999, Andrea Arcangeli wrote: > On Mon, 28 Jun 1999, Stephen C. Tweedie wrote: > > >> if you need evidence that shrink_mmap() will keep a system running without > >> swapping, just run 2.3.8 :) :) > > > >2.3.8 shows up slower on several benchmarks because of its reluctance to > >swap. > > Here the point is if you are swapping over your ramdisk or over my HD :). > Over my HD (system+swap all in the same IDE disk) you must _avoid_ to swap > at all costs if you care about performances. i'm not so sure about that. swapping out, if efficiently done, is a series of asynchronous sequential writes. the only performance that will interfere with is heavily I/O-bound applications. even so, if it gets more pages out of an application's way, then shrink_mmap will be less destructive to your working set, which is a *good* thing, and your caches will perform better. at least, that's the way i've seen it with the workloads i've been playing with. so, i believe that swapping (paging) is my friend, up to a point. - Chuck Lever -- corporate: <chuckl@netscape.com> personal: <chucklever@netscape.net> or <cel@monkey.org> The Linux Scalability project: http://www.citi.umich.edu/projects/linux-scalability/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 1999-06-29 2:13 ` Chuck Lever @ 1999-06-29 12:01 ` Stephen C. Tweedie 1999-06-29 12:32 ` Andrea Arcangeli 0 siblings, 1 reply; 60+ messages in thread From: Stephen C. Tweedie @ 1999-06-29 12:01 UTC (permalink / raw) To: Chuck Lever; +Cc: Andrea Arcangeli, Stephen C. Tweedie, Kanoj Sarcar, linux-mm Hi, On Mon, 28 Jun 1999 22:13:15 -0400 (EDT), Chuck Lever <cel@monkey.org> said: > On Tue, 29 Jun 1999, Andrea Arcangeli wrote: >> >> Here the point is if you are swapping over your ramdisk or over my HD :). >> Over my HD (system+swap all in the same IDE disk) you must _avoid_ to swap >> at all costs if you care about performances. > i'm not so sure about that. swapping out, if efficiently done, is a > series of asynchronous sequential writes. the only performance that will > interfere with is heavily I/O-bound applications. even so, if it gets > more pages out of an application's way, then shrink_mmap will be less > destructive to your working set, which is a *good* thing, and your caches > will perform better. Absolutely. The important thing is to do enough swapping to make sure that unused data is not kicking around in memory. Maybe you don't want the swapper to be active during your kernel compile, but if you have less than a GB of physical memory then you probably want it to at least think about swapping unused stuff out as the compilation starts. If you defer swapping too much, you just end up doing more paging IO since you can fit less of your working set into cache. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 1999-06-29 12:01 ` Stephen C. Tweedie @ 1999-06-29 12:32 ` Andrea Arcangeli 1999-06-30 15:59 ` Stephen C. Tweedie 0 siblings, 1 reply; 60+ messages in thread From: Andrea Arcangeli @ 1999-06-29 12:32 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Chuck Lever, Kanoj Sarcar, linux-mm On Tue, 29 Jun 1999, Stephen C. Tweedie wrote: >Absolutely. The important thing is to do enough swapping to make sure >that unused data is not kicking around in memory. Maybe you don't want I know that sometime is the right thing do to. But think also a difference scenario. You have a machine that only reads all the time from a disk 10giga of data in loop. The data is so big and you reference it so in round-robin that you have no chance to find one bit of data in in the page-cache (but don't tell me to not use a lru-algorithm :). So what you gain? You find most of your task swapped out: when you click netscape on the other desktop you find yourself stalled. Then you change desktop, the program continue to read from disk in background, and then you find stalled again the next time. In this case you gain _nothing_ from swapping out netscape. So I think we should make the swapout level to be at least configurable. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 1999-06-29 12:32 ` Andrea Arcangeli @ 1999-06-30 15:59 ` Stephen C. Tweedie 0 siblings, 0 replies; 60+ messages in thread From: Stephen C. Tweedie @ 1999-06-30 15:59 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Stephen C. Tweedie, Chuck Lever, Kanoj Sarcar, linux-mm Hi, On Tue, 29 Jun 1999 14:32:41 +0200 (CEST), Andrea Arcangeli <andrea@suse.de> said: > On Tue, 29 Jun 1999, Stephen C. Tweedie wrote: >> Absolutely. The important thing is to do enough swapping to make sure >> that unused data is not kicking around in memory. Maybe you don't want > I know that sometime is the right thing do to. > But think also a difference scenario. You have a machine that only reads > all the time from a disk 10giga of data in loop. Absolutely. The find|grep workload, for example. The point is that this memory load is different from the load imposed by a kernel build. If you are using file IO more, you need to be turning the cache over more. The old buffer cache had this property, and it worked very well indeed. The buffer cache would try to recycle itself in preference to growing, so for file-intensive workloads we would naturally evict cached data in preference to swapping, but for memory-intensive compute workloads we would be more likely to swap unused VM pages out. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 1999-06-28 22:21 ` Stephen C. Tweedie 1999-06-28 22:57 ` Andrea Arcangeli @ 1999-06-29 1:00 ` Chuck Lever 1 sibling, 0 replies; 60+ messages in thread From: Chuck Lever @ 1999-06-29 1:00 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Kanoj Sarcar, andrea, torvalds, linux-mm On Mon, 28 Jun 1999, Stephen C. Tweedie wrote: > On Mon, 28 Jun 1999 17:32:05 -0400 (EDT), Chuck Lever <cel@monkey.org> > said: > > > well, except that kswapd itself doesn't free any memory. > > It has to. That was why kswapd was initially written, to ensure that > interrupt memory requests (eg. busy router boxes) don't starve of > memory. All of the benefits of kswapd came later. In normal kernels > the try_to_swap_out doesn't free memory, true enough, but kswapd calls > shrink_mmap() too to make sure it does make real progress in freeing > memory. again, foot in mouth. i meant kswapd doesn't free any memory *by simply swapping*. that's what i get for typing when i'm hungry. > > if you need evidence that shrink_mmap() will keep a system running without > > swapping, just run 2.3.8 :) :) > > 2.3.8 shows up slower on several benchmarks because of its reluctance to > swap. right, agreed. but it doesn't stall, it just slows down. - Chuck Lever -- corporate: <chuckl@netscape.com> personal: <chucklever@netscape.net> or <cel@monkey.org> The Linux Scalability project: http://www.citi.umich.edu/projects/linux-scalability/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 1999-06-28 20:51 ` Kanoj Sarcar 1999-06-28 21:32 ` Chuck Lever @ 1999-06-28 22:08 ` Stephen C. Tweedie 1999-06-28 22:59 ` Andrea Arcangeli 1999-06-29 0:53 ` Chuck Lever 1 sibling, 2 replies; 60+ messages in thread From: Stephen C. Tweedie @ 1999-06-28 22:08 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: Chuck Lever, andrea, torvalds, sct, linux-mm Hi, On Mon, 28 Jun 1999 13:51:03 -0700 (PDT), kanoj@google.engr.sgi.com (Kanoj Sarcar) said: >> or perhaps the kernel could start more than one kswapd (one per swap >> partition?). with my patch, regular processes never wait for swap out >> I/O, only kswapd does. This is a mistake: such blocking is one of the prime ways in which we can limit the rate at which processes can consume memory. > Oh no, I was not talking about exotic stuff like RT ... I was > simply pointing out that to prevent deadlocks, and guarantee forward > progress, you have to show that despite what underlying fs/driver > code does, at least one memory freer is free to do its job. Yep, which is why we have a separate kpiod right now: it guarantees that potential recursive fs locking stalls get shifted from kswapd to a separate thread to make sure that kswapd can always make progress. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 1999-06-28 22:08 ` Stephen C. Tweedie @ 1999-06-28 22:59 ` Andrea Arcangeli 1999-06-29 0:53 ` Chuck Lever 1 sibling, 0 replies; 60+ messages in thread From: Andrea Arcangeli @ 1999-06-28 22:59 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Kanoj Sarcar, Chuck Lever, torvalds, linux-mm On Mon, 28 Jun 1999, Stephen C. Tweedie wrote: >Hi, > >On Mon, 28 Jun 1999 13:51:03 -0700 (PDT), kanoj@google.engr.sgi.com (Kanoj Sarcar) said: > >>> or perhaps the kernel could start more than one kswapd (one per swap >>> partition?). with my patch, regular processes never wait for swap out >>> I/O, only kswapd does. > >This is a mistake: such blocking is one of the prime ways in which we >can limit the rate at which processes can consume memory. Agreed. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 1999-06-28 22:08 ` Stephen C. Tweedie 1999-06-28 22:59 ` Andrea Arcangeli @ 1999-06-29 0:53 ` Chuck Lever 1999-06-29 11:14 ` Stephen C. Tweedie 1 sibling, 1 reply; 60+ messages in thread From: Chuck Lever @ 1999-06-29 0:53 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Kanoj Sarcar, andrea, torvalds, linux-mm On Mon, 28 Jun 1999, Stephen C. Tweedie wrote: > On Mon, 28 Jun 1999 13:51:03 -0700 (PDT), kanoj@google.engr.sgi.com (Kanoj Sarcar) said: > > >> or perhaps the kernel could start more than one kswapd (one per swap > >> partition?). with my patch, regular processes never wait for swap out > >> I/O, only kswapd does. > > This is a mistake: such blocking is one of the prime ways in which we > can limit the rate at which processes can consume memory. whoops. i'm sorry, i mis-typed. i meant that regular processes never *dispatch* I/O. neither kswapd nor regular processes will wait. - Chuck Lever -- corporate: <chuckl@netscape.com> personal: <chucklever@netscape.net> or <cel@monkey.org> The Linux Scalability project: http://www.citi.umich.edu/projects/linux-scalability/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 1999-06-29 0:53 ` Chuck Lever @ 1999-06-29 11:14 ` Stephen C. Tweedie 0 siblings, 0 replies; 60+ messages in thread From: Stephen C. Tweedie @ 1999-06-29 11:14 UTC (permalink / raw) To: Chuck Lever; +Cc: Stephen C. Tweedie, Kanoj Sarcar, andrea, torvalds, linux-mm Hi, On Mon, 28 Jun 1999 20:53:23 -0400 (EDT), Chuck Lever <cel@monkey.org> said: > whoops. i'm sorry, i mis-typed. i meant that regular processes never > *dispatch* I/O. neither kswapd nor regular processes will wait. Sorry? That's just the same problem, restated. If a regular process will never wait on a memory allocation then you have no way of throttling the memory allocation rate to the rate at which you can swap stuff out. That will kill your machine stone dead very rapidly under heavy memory load. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 1999-06-28 19:55 ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Kanoj Sarcar 1999-06-28 20:33 ` Chuck Lever @ 1999-06-28 22:09 ` Stephen C. Tweedie 1 sibling, 0 replies; 60+ messages in thread From: Stephen C. Tweedie @ 1999-06-28 22:09 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: Chuck Lever, andrea, torvalds, sct, linux-mm Hi, On Mon, 28 Jun 1999 12:55:23 -0700 (PDT), kanoj@google.engr.sgi.com (Kanoj Sarcar) said: > Agreed this would be a nice thing to be able to do ... Other than the > deadlock problem, there's another issue involved, I think. Processes > can go to sleep (inside drivers/fs for example while > mmaping/munmaping/faulting) holding their mmap_sem, so any solution > should be able to guarantee that (at least one of) the memory free'ers > do not go to sleep indefinitely (or for some time that is upto > driver/fs code to determine). Which is why we don't take the mm semaphore in swapout. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races 1999-06-28 19:39 ` Chuck Lever 1999-06-28 19:55 ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Kanoj Sarcar @ 1999-06-28 20:45 ` Stephen C. Tweedie 1999-06-28 21:14 ` Chuck Lever 1 sibling, 1 reply; 60+ messages in thread From: Stephen C. Tweedie @ 1999-06-28 20:45 UTC (permalink / raw) To: Chuck Lever; +Cc: Kanoj Sarcar, Andrea Arcangeli, torvalds, sct, linux-mm Hi, On Mon, 28 Jun 1999 15:39:43 -0400 (EDT), Chuck Lever <cel@monkey.org> said: > i'm already working on a patch that will allow kswapd to grab the > mmap_sem for the task that is about to be swapped. this takes a > slightly different approach, since i'm focusing on kswapd and not on > swapoff. Don't, it will create a whole pile of new deadlock conditions. Think carefully about what happens when you take a page fault, lock the mm, and then need to allocate a new page in memory to satisfy the fault. You end up recursively calling try_to_free_page, and if that needs to reacquire the mm semaphore then you are in major trouble. The same mechanism can also block kswapd from making progress. We've looked at this before: the reason swapout doesn't take the semaphore is because the deadlock cases are worse than just living with the current unlocked behaviour. There's also the fact that swapping can deal with multiple mms at the same time: if you fork, you can get two mms which share the same COW page in memory or on swap. As a result, mm locking doesn't actually buy you enough extra protection for data pages to be worth it. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races 1999-06-28 20:45 ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races Stephen C. Tweedie @ 1999-06-28 21:14 ` Chuck Lever 1999-06-28 21:25 ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Kanoj Sarcar ` (2 more replies) 0 siblings, 3 replies; 60+ messages in thread From: Chuck Lever @ 1999-06-28 21:14 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Kanoj Sarcar, Andrea Arcangeli, torvalds, linux-mm On Mon, 28 Jun 1999, Stephen C. Tweedie wrote: > On Mon, 28 Jun 1999 15:39:43 -0400 (EDT), Chuck Lever <cel@monkey.org> > said: > > i'm already working on a patch that will allow kswapd to grab the > > mmap_sem for the task that is about to be swapped. this takes a > > slightly different approach, since i'm focusing on kswapd and not on > > swapoff. > > Don't, it will create a whole pile of new deadlock conditions. Think > carefully about what happens when you take a page fault, lock the mm, > and then need to allocate a new page in memory to satisfy the fault. > You end up recursively calling try_to_free_page, and if that needs to > reacquire the mm semaphore then you are in major trouble. that doesn't hurt because try_to_free_page() doesn't acquire anything but the kernel lock in my patch. it looks something like: int try_to_free_pages(unsigned int gfp_mask) { int priority = 6; int count = pager_daemon.swap_cluster; wake_up_process(kswapd_process); lock_kernel(); do { while (shrink_mmap(priority, gfp_mask)) { if (!--count) goto done; } shrink_dcache_memory(priority, gfp_mask); } while (--priority >= 0); done: /* maybe slow this thread down while kswapd catches up */ if (gfp_mask & __GFP_WAIT) { current->policy |= SCHED_YIELD; schedule(); } unlock_kernel(); return 1; } > The same mechanism can also block kswapd from making progress. i'm re-using the mmap_sem, not the mm_sem. only the mmap_sem for the about-to-be-swapped object is acquired by kswapd. is that unsafe? or just silly? > There's also the fact that > swapping can deal with multiple mms at the same time: if you fork, you > can get two mms which share the same COW page in memory or on swap. > As a result, mm locking doesn't actually buy you enough extra > protection for data pages to be worth it. the eventual goal of my adventure is to drop the kernel lock while doing the page COW in do_wp_page, since in 2.3.6+, the COW is again protected because of race conditions with kswapd. this "protection" serializes all page faults behind a very expensive memory copy. what other ways are there to protect the COW operation while allowing some parallelism? it seems like this is worth a little complexity, IMO. - Chuck Lever -- corporate: <chuckl@netscape.com> personal: <chucklever@netscape.net> or <cel@monkey.org> The Linux Scalability project: http://www.citi.umich.edu/projects/linux-scalability/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 1999-06-28 21:14 ` Chuck Lever @ 1999-06-28 21:25 ` Kanoj Sarcar 1999-06-28 22:15 ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races Stephen C. Tweedie 1999-06-28 22:48 ` Andrea Arcangeli 2 siblings, 0 replies; 60+ messages in thread From: Kanoj Sarcar @ 1999-06-28 21:25 UTC (permalink / raw) To: Chuck Lever; +Cc: sct, andrea, torvalds, linux-mm > > the eventual goal of my adventure is to drop the kernel lock while doing > the page COW in do_wp_page, since in 2.3.6+, the COW is again protected > because of race conditions with kswapd. this "protection" serializes all > page faults behind a very expensive memory copy. what other ways are > there to protect the COW operation while allowing some parallelism? it > seems like this is worth a little complexity, IMO. > I have already commented on my reservations about holding mmap_sem in kswapd/try_to_free_pages. Just thought I would point out that I have been thinking on the lines of eliminating kernel_lock from the vm code (experimentally initially under a CONFIG option), but I am yet to come up with a complete design. My current ideas involve a per mm spinning pte lock, a sleeping vmalist mutex (which processes never go to sleep holding). I am still struggling to understand whether a per page lock is needed or swapcache lock will do. In any case, if someone on the list is working on something similar, maybe we can exchange notes offline. Of course, we can not perturb performance for kernels which does not have the CONFIG option set. And Linus has to agree its worthwhile doing this work .... Thanks. Kanoj kanoj@engr.sgi.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races 1999-06-28 21:14 ` Chuck Lever 1999-06-28 21:25 ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Kanoj Sarcar @ 1999-06-28 22:15 ` Stephen C. Tweedie 1999-06-28 22:48 ` Andrea Arcangeli 2 siblings, 0 replies; 60+ messages in thread From: Stephen C. Tweedie @ 1999-06-28 22:15 UTC (permalink / raw) To: Chuck Lever Cc: Stephen C. Tweedie, Kanoj Sarcar, Andrea Arcangeli, torvalds, linux-mm Hi, On Mon, 28 Jun 1999 17:14:17 -0400 (EDT), Chuck Lever <cel@monkey.org> said: > that doesn't hurt because try_to_free_page() doesn't acquire anything but > the kernel lock in my patch. Removing swapout from try_to_free_page is fundamentally broken, since it removes a critical rate limiter from the vm allocator paths. Acquiring it in kswapd is still a deadlock situation. > the eventual goal of my adventure is to drop the kernel lock while doing > the page COW in do_wp_page, since in 2.3.6+, the COW is again protected > because of race conditions with kswapd. OK, but doing this by adding extra mm locks to the swapout path is itself fraught with deadlocks, and doesn't get around the fact that multiple different mm's can reference the same swap page so you don't actually eliminate all of the races anyway by adding that locking. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races 1999-06-28 21:14 ` Chuck Lever 1999-06-28 21:25 ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Kanoj Sarcar 1999-06-28 22:15 ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races Stephen C. Tweedie @ 1999-06-28 22:48 ` Andrea Arcangeli 1999-06-29 1:29 ` Chuck Lever ` (2 more replies) 2 siblings, 3 replies; 60+ messages in thread From: Andrea Arcangeli @ 1999-06-28 22:48 UTC (permalink / raw) To: Chuck Lever; +Cc: Stephen C. Tweedie, Kanoj Sarcar, torvalds, linux-mm On Mon, 28 Jun 1999, Chuck Lever wrote: >that doesn't hurt because try_to_free_page() doesn't acquire anything but >the kernel lock in my patch. it looks something like: > >int try_to_free_pages(unsigned int gfp_mask) >{ > int priority = 6; > int count = pager_daemon.swap_cluster; > > wake_up_process(kswapd_process); > > lock_kernel(); > do { > while (shrink_mmap(priority, gfp_mask)) { > if (!--count) > goto done; > } > > shrink_dcache_memory(priority, gfp_mask); > } while (--priority >= 0); >done: > /* maybe slow this thread down while kswapd catches up */ > if (gfp_mask & __GFP_WAIT) { > current->policy |= SCHED_YIELD; > schedule(); > } > unlock_kernel(); > return 1; >} How do you get the information about "when" to start the swap activities? Maybe you have a separate try_to_free_pages() that does the plain-current try_to_free_pages() and you call it only from kswapd? My guess is that you'll end with zero cache and you'll have to page-in from disk like h*ell when you reach swap with a resulting really bad iteractive behaviour. I think that being able to swapout from the process context is a very nice feature because it cause the trashing task to block. This may looks not very important with the current low_on_memory bit, but here I have a per-task `trashing_memory' bitflag :). Anyway we may re-implement recursive semaphores to avoid deadlocking into the page fault path... >the eventual goal of my adventure is to drop the kernel lock while doing >the page COW in do_wp_page, since in 2.3.6+, the COW is again protected >because of race conditions with kswapd. this "protection" serializes all I thought a bit about that as well. I also coded a maybe possible solution. Look at this snapshot: Index: linux/mm/memory.c =================================================================== RCS file: /var/cvs/linux/mm/memory.c,v retrieving revision 1.1.1.10 retrieving revision 1.1.2.39 diff -u -r1.1.1.10 -r1.1.2.39 --- linux/mm/memory.c 1999/06/28 15:10:09 1.1.1.10 +++ linux/mm/memory.c 1999/06/28 17:08:59 1.1.2.39 @@ -607,16 +618,23 @@ struct page * page; new_page = __get_free_page(GFP_USER); - /* Did swap_out() unmap the protected page while we slept? */ - if (pte_val(*page_table) != pte_val(pte)) - goto end_wp_page; old_page = pte_page(pte); if (MAP_NR(old_page) >= max_mapnr) goto bad_wp_page; tsk->min_flt++; page = mem_map + MAP_NR(old_page); - + + lock_page(page); /* + * We can release the big kernel lock here since + * kswapd will see the page locked. -Andrea + */ + unlock_kernel(); + /* Did swap_out() unmap the protected page while we slept? */ + if (pte_val(*page_table) != pte_val(pte)) + goto end_wp_page; + + /* * We can avoid the copy if: * - we're the only user (count == 1) * - the only other user is the swap cache, @@ -630,19 +648,15 @@ break; if (swap_count(page->offset) != 1) break; + lru_unmap_cache(page); delete_from_swap_cache(page); + put_page_refcount(page); /* FallThrough */ case 1: flush_cache_page(vma, address); set_pte(page_table, pte_mkdirty(pte_mkwrite(pte))); flush_tlb_page(vma, address); -end_wp_page: - /* - * We can release the kernel lock now.. Now swap_out will see - * a dirty page and so won't get confused and flush_tlb_page - * won't SMP race. -Andrea - */ - unlock_kernel(); + UnlockPage(page); if (new_page) free_page(new_page); @@ -652,6 +666,7 @@ if (!new_page) goto no_new_page; + lru_unmap_cache(page); if (PageReserved(page)) ++vma->vm_mm->rss; copy_cow_page(old_page,new_page); @@ -660,18 +675,26 @@ flush_cache_page(vma, address); set_pte(page_table, pte_mkwrite(pte_mkdirty(mk_pte(new_page, vma->vm_page_prot)))); flush_tlb_page(vma, address); - unlock_kernel(); + UnlockPage(page); __free_page(page); return 1; bad_wp_page: + unlock_kernel(); printk("do_wp_page: bogus page at address %08lx (%08lx)\n",address,old_page); send_sig(SIGKILL, tsk, 1); -no_new_page: - unlock_kernel(); if (new_page) free_page(new_page); return 0; +no_new_page: + UnlockPage(page); + oom(tsk); + return 0; +end_wp_page: + UnlockPage(page); + if (new_page) + free_page(new_page); + return 1; } /* It's only a partial snapshot, but it should show the picture. Basically I am locking down the page with the lock held, then when I have the page locked (I may sleep as well to lock it) I check if kswapd freed the mapping or if I can go ahead without the big kernel lock. It basically works but I had not the time to test it carefully yet. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races 1999-06-28 22:48 ` Andrea Arcangeli @ 1999-06-29 1:29 ` Chuck Lever 1999-06-29 11:58 ` Stephen C. Tweedie 1999-06-29 12:09 ` Andrea Arcangeli 1999-06-29 11:55 ` Stephen C. Tweedie 1999-06-29 20:08 ` Andrea Arcangeli 2 siblings, 2 replies; 60+ messages in thread From: Chuck Lever @ 1999-06-29 1:29 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Stephen C. Tweedie, Kanoj Sarcar, torvalds, linux-mm On Tue, 29 Jun 1999, Andrea Arcangeli wrote: > On Mon, 28 Jun 1999, Chuck Lever wrote: > >that doesn't hurt because try_to_free_page() doesn't acquire anything but > >the kernel lock in my patch. it looks something like: > > > >int try_to_free_pages(unsigned int gfp_mask) > >{ > > int priority = 6; > > int count = pager_daemon.swap_cluster; > > > > wake_up_process(kswapd_process); > > > > lock_kernel(); > > do { > > while (shrink_mmap(priority, gfp_mask)) { > > if (!--count) > > goto done; > > } > > > > shrink_dcache_memory(priority, gfp_mask); > > } while (--priority >= 0); > >done: > > /* maybe slow this thread down while kswapd catches up */ > > if (gfp_mask & __GFP_WAIT) { > > current->policy |= SCHED_YIELD; > > schedule(); > > } > > unlock_kernel(); > > return 1; > >} > > How do you get the information about "when" to start the swap activities? try_to_free_pages() still wakes up kswapd whenever it is called. > Maybe you have a separate try_to_free_pages() that does the plain-current > try_to_free_pages() and you call it only from kswapd? yes, that's exactly what i did. what i can't figure out is why do the shrink_mmap in both places? seems like the shrink_mmap in kswapd is overkill if it has just been awoken by try_to_free_pages. > My guess is that you'll end with zero cache and you'll have to page-in > from disk like h*ell when you reach swap with a resulting really bad > iteractive behaviour. nope. it appears to work as well as the old way, maybe even a little faster. i still need to do more testing, though. > I think that being able to swapout from the process context is a very nice > feature because it cause the trashing task to block. This may looks not > very important with the current low_on_memory bit, but here I have a > per-task `trashing_memory' bitflag :). swapping out never blocks a thread, since the swap out I/O request is always asynchronous. line 162 of mm/vmscan.c :: /* OK, do a physical asynchronous write to swap. */ rw_swap_page(WRITE, entry, (char *) page, 0); stephen also mentioned "rate controlling" a trashing process, but since nothing in swap_out spins or sleeps, how could a process be slowed except by a little extra CPU time spent behind the global lock? that will slow everyone else down too, yes? seems like try_to_free_pages ought to make a clear effort to recognize a process that is growing quickly and slow it down by causing it to sleep. > >the eventual goal of my adventure is to drop the kernel lock while doing > >the page COW in do_wp_page, since in 2.3.6+, the COW is again protected > >because of race conditions with kswapd. this "protection" serializes all > > It's only a partial snapshot, but it should show the picture. Basically I > am locking down the page with the lock held, then when I have the page > locked (I may sleep as well to lock it) I check if kswapd freed the > mapping or if I can go ahead without the big kernel lock. It basically > works but I had not the time to test it carefully yet. locking pages is probably the right answer, IMHO. - Chuck Lever -- corporate: <chuckl@netscape.com> personal: <chucklever@netscape.net> or <cel@monkey.org> The Linux Scalability project: http://www.citi.umich.edu/projects/linux-scalability/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races 1999-06-29 1:29 ` Chuck Lever @ 1999-06-29 11:58 ` Stephen C. Tweedie 1999-06-29 12:09 ` Andrea Arcangeli 1 sibling, 0 replies; 60+ messages in thread From: Stephen C. Tweedie @ 1999-06-29 11:58 UTC (permalink / raw) To: Chuck Lever; +Cc: Andrea Arcangeli, Stephen C. Tweedie, Kanoj Sarcar, linux-mm Hi, On Mon, 28 Jun 1999 21:29:07 -0400 (EDT), Chuck Lever <cel@monkey.org> said: > yes, that's exactly what i did. what i can't figure out is why do the > shrink_mmap in both places? seems like the shrink_mmap in kswapd is > overkill if it has just been awoken by try_to_free_pages. It hasn't necessarily. It may have been woken by networking activity. If the memory requirements are being driven by interrupts, not processes, then kswapd is the only chance for shrink_mmap to be called. > stephen also mentioned "rate controlling" a trashing process, but since > nothing in swap_out spins or sleeps, how could a process be slowed except > by a little extra CPU time spent behind the global lock? that will slow > everyone else down too, yes? There are IO queue limits which will eventually stall the process. The ll_rw_block itself one rate limiter. We also have a test in rw_swap_page_base: /* Don't allow too many pending pages in flight.. */ if (atomic_read(&nr_async_pages) > pager_daemon.swap_cluster) wait = 1; which causes the swapout to become synchronous once we have filled the swapper queues. > seems like try_to_free_pages ought to make a clear effort to recognize a > process that is growing quickly and slow it down by causing it to sleep. It does. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races 1999-06-29 1:29 ` Chuck Lever 1999-06-29 11:58 ` Stephen C. Tweedie @ 1999-06-29 12:09 ` Andrea Arcangeli 1999-06-29 15:27 ` Chuck Lever 1 sibling, 1 reply; 60+ messages in thread From: Andrea Arcangeli @ 1999-06-29 12:09 UTC (permalink / raw) To: Chuck Lever; +Cc: Stephen C. Tweedie, Kanoj Sarcar, torvalds, linux-mm On Mon, 28 Jun 1999, Chuck Lever wrote: >yes, that's exactly what i did. what i can't figure out is why do the >shrink_mmap in both places? seems like the shrink_mmap in kswapd is >overkill if it has just been awoken by try_to_free_pages. If you remove the shrink_mmap from kswapd then you'll start swapping all the time. shrink_mmap give us the information about the state of the VM. So if you run it then you know if you should start swapping or not. >faster. i still need to do more testing, though. I suggest you to run some memory hog that rotate 20/30mbyte of data in the swap to check iteractive performances. >swapping out never blocks a thread, since the swap out I/O request is >always asynchronous. line 162 of mm/vmscan.c :: > > /* OK, do a physical asynchronous write to swap. */ > rw_swap_page(WRITE, entry, (char *) page, 0); At some point you must stop. As worse when you go out of request. The rate at which you eat memory is far higher than the swapout speed. Since the out-of-request is a too large bank, we have a nr_async_pages limit after which we do sync I/O (set to SWAP_CLUSTER_MAX as default, 32 pages async than sync I/O). >stephen also mentioned "rate controlling" a trashing process, but since >nothing in swap_out spins or sleeps, how could a process be slowed except >by a little extra CPU time spent behind the global lock? that will slow >everyone else down too, yes? swapout stall. It has to stall since memory is faster than disk. >> It's only a partial snapshot, but it should show the picture. Basically I >> am locking down the page with the lock held, then when I have the page >> locked (I may sleep as well to lock it) I check if kswapd freed the >> mapping or if I can go ahead without the big kernel lock. It basically >> works but I had not the time to test it carefully yet. > >locking pages is probably the right answer, IMHO. Happy to hear that :). Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races 1999-06-29 12:09 ` Andrea Arcangeli @ 1999-06-29 15:27 ` Chuck Lever 0 siblings, 0 replies; 60+ messages in thread From: Chuck Lever @ 1999-06-29 15:27 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Stephen C. Tweedie, Kanoj Sarcar, torvalds, linux-mm On Tue, 29 Jun 1999, Andrea Arcangeli wrote: > On Mon, 28 Jun 1999, Chuck Lever wrote: > >yes, that's exactly what i did. what i can't figure out is why do the > >shrink_mmap in both places? seems like the shrink_mmap in kswapd is > >overkill if it has just been awoken by try_to_free_pages. > > If you remove the shrink_mmap from kswapd then you'll start swapping all > the time. yes, i discovered that rather quickly when i tried it. :) > shrink_mmap give us the information about the state of > the VM. So if you run it then you know if you should start swapping or > not. but it also "destroys" that state while it's running. it would be much nicer, i think, if there was a way to ascertain the state cheaply, then decide whether to shrink caches or swap, or both. i think a better decision could be made this way. what do you think about separating shrink_mmap's function into two separate pieces: maintain state information, and trim caches? i've been studying a hard knee that occurs just as the system exhausts memory and try_to_free_pages is invoked. performance drops rather dramatically. while i was playing around with kswapd, i noticed that when my system started to swap more during low-memory scenarios, it seemed to perform better; the knee is "softened". by switching back and forth between an "all swap all the time" model and an "all shrink_mmap all the time" model, it was clear to me, at least for my workload, that shrink_mmap is valuable up to a point, but swapping is quite effective at increasing available memory because it's heuristic for choosing a memory-idle process is very good (based on watching subsequent swap-in numbers), and there is probably 10-12M of idle crap that can be flushed if the system gets loaded down, that currently is left in RAM. in my opinion, the kernel is using shrink_mmap too much and not swapping enough. but it isn't clear to me exactly how to rebalance the two, or how to gather more information in do_try_to_free_pages to make a better decision about how to get back some memory. > I suggest you to run some memory hog that rotate 20/30mbyte of data in the > swap to check iteractive performances. i have a test that does roughly this -- diff two kernel source trees. however, it's clear that breaking try_to_free_pages and kswapd into two separate paths won't provide the locking gain i was after. however, unrelated to the above discussion, do_try_to_free_pages may hold onto the kernel lock for a long time, so finding a safe place for shrink_mmap and/or swap_out to release it occassionally would help. - Chuck Lever -- corporate: <chuckl@netscape.com> personal: <chucklever@netscape.net> or <cel@monkey.org> The Linux Scalability project: http://www.citi.umich.edu/projects/linux-scalability/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races 1999-06-28 22:48 ` Andrea Arcangeli 1999-06-29 1:29 ` Chuck Lever @ 1999-06-29 11:55 ` Stephen C. Tweedie 1999-06-29 20:08 ` Andrea Arcangeli 2 siblings, 0 replies; 60+ messages in thread From: Stephen C. Tweedie @ 1999-06-29 11:55 UTC (permalink / raw) To: Andrea Arcangeli Cc: Chuck Lever, Stephen C. Tweedie, Kanoj Sarcar, torvalds, linux-mm Hi, On Tue, 29 Jun 1999 00:48:18 +0200 (CEST), Andrea Arcangeli <andrea@suse.de> said: > I thought a bit about that as well. I also coded a maybe possible > solution. Look at this snapshot: Much better: the synchronisation between the page fault and the swapper is per-page, not per-mm, this way. That way the swapper can afford just to skip the one locked page rather than block for an mm lock. My only reservation is that it's a bit ugly to overload the "locked" bit this way, but it's the only obvious test in try_to_swap_out that we can use. Adding a new PG_Locked_PTE flag for the page, to indicate that somebody is relying on this pte for COW operation and kswapd should skip it, would be an alternative: it makes the intent much more clear (and keeps PG_Locked purely for IO locking, which is really as it should be). --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races 1999-06-28 22:48 ` Andrea Arcangeli 1999-06-29 1:29 ` Chuck Lever 1999-06-29 11:55 ` Stephen C. Tweedie @ 1999-06-29 20:08 ` Andrea Arcangeli 2 siblings, 0 replies; 60+ messages in thread From: Andrea Arcangeli @ 1999-06-29 20:08 UTC (permalink / raw) To: Chuck Lever; +Cc: Stephen C. Tweedie, Kanoj Sarcar, torvalds, linux-mm On Tue, 29 Jun 1999, Andrea Arcangeli wrote: For the record: the snapshot wasn't SMP safe. > /* >+ * We can release the big kernel lock here since >+ * kswapd will see the page locked. -Andrea >+ */ >+ unlock_kernel(); This was a bit too early (pefectly ok for kswapd but not ok for the swap cache SMP safety). We must first take over the swap cache and run swap_count before be allowed to release the big kernel lock. So this should be moved a bit lower... Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions 1999-06-21 18:46 ` Kanoj Sarcar 1999-06-21 23:44 ` Kanoj Sarcar @ 1999-06-28 22:36 ` Stephen C. Tweedie 1999-06-28 23:24 ` Kanoj Sarcar 1 sibling, 1 reply; 60+ messages in thread From: Stephen C. Tweedie @ 1999-06-28 22:36 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: Stephen C. Tweedie, linux-mm Hi, On Mon, 21 Jun 1999 11:46:27 -0700 (PDT), kanoj@google.engr.sgi.com (Kanoj Sarcar) said: >> Look no further than swap_in(), which knows that there is no pte (so >> swapout concurrency is not a problem) and it holds the mmap lock (so >> there are no concurrent swap_ins on the page). It reads in the page adn >> unconditionally sets up the pte to point to it, assuming that nobody >> else can conceivably set the pte while we do the swap outselves. > Hmm, am I being fooled by the comment in swap_in? > /* > * The tests may look silly, but it essentially makes sure that > * no other process did a swap-in on us just as we were waiting. > * afaik only swapoff can trigger that. Concurrent swap-in on the same entry can occur into the page cache, but not into the page tables because those are protected by the semaphore. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: filecache/swapcache questions 1999-06-28 22:36 ` filecache/swapcache questions Stephen C. Tweedie @ 1999-06-28 23:24 ` Kanoj Sarcar 0 siblings, 0 replies; 60+ messages in thread From: Kanoj Sarcar @ 1999-06-28 23:24 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: linux-mm > > Hi, > > On Mon, 21 Jun 1999 11:46:27 -0700 (PDT), kanoj@google.engr.sgi.com > (Kanoj Sarcar) said: > > >> Look no further than swap_in(), which knows that there is no pte (so > >> swapout concurrency is not a problem) and it holds the mmap lock (so > >> there are no concurrent swap_ins on the page). It reads in the page adn > >> unconditionally sets up the pte to point to it, assuming that nobody > >> else can conceivably set the pte while we do the swap outselves. > > > Hmm, am I being fooled by the comment in swap_in? > > > /* > > * The tests may look silly, but it essentially makes sure that > > * no other process did a swap-in on us just as we were waiting. > > * > > afaik only swapoff can trigger that. Concurrent swap-in on the same > entry can occur into the page cache, but not into the page tables > because those are protected by the semaphore. > > --Stephen > Right ... I was trying to counter your argument that swapoff needs to hold the mmap_sem to protect ptes (except for the fork/exit/swapin races) by pointing out that pte updates are already protected by kernel_lock. Kanoj -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 60+ messages in thread
end of thread, other threads:[~1999-06-30 18:05 UTC | newest] Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 1999-06-21 5:29 filecache/swapcache questions Kanoj Sarcar 1999-06-21 11:25 ` Stephen C. Tweedie 1999-06-21 16:46 ` Kanoj Sarcar 1999-06-21 16:57 ` Stephen C. Tweedie 1999-06-21 17:36 ` Kanoj Sarcar 1999-06-21 17:49 ` Stephen C. Tweedie 1999-06-21 18:46 ` Kanoj Sarcar 1999-06-21 23:44 ` Kanoj Sarcar 1999-06-24 22:23 ` Andrea Arcangeli 1999-06-24 23:55 ` Kanoj Sarcar 1999-06-25 0:26 ` Andrea Arcangeli 1999-06-28 1:48 ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races Kanoj Sarcar 1999-06-28 10:35 ` Andrea Arcangeli 1999-06-28 17:11 ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Kanoj Sarcar 1999-06-28 16:32 ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races Stephen C. Tweedie 1999-06-28 17:25 ` Kanoj Sarcar 1999-06-28 20:40 ` Stephen C. Tweedie 1999-06-28 21:11 ` Kanoj Sarcar 1999-06-28 22:12 ` Stephen C. Tweedie 1999-06-28 23:43 ` Kanoj Sarcar 1999-06-29 11:44 ` Stephen C. Tweedie 1999-06-29 22:01 ` Kanoj Sarcar 1999-06-30 17:28 ` Stephen C. Tweedie 1999-06-30 18:05 ` Kanoj Sarcar 1999-06-28 19:39 ` Chuck Lever 1999-06-28 19:55 ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Kanoj Sarcar 1999-06-28 20:33 ` Chuck Lever 1999-06-28 20:51 ` Kanoj Sarcar 1999-06-28 21:32 ` Chuck Lever 1999-06-28 21:38 ` Kanoj Sarcar 1999-06-28 21:50 ` Chuck Lever 1999-06-28 22:15 ` Kanoj Sarcar 1999-06-29 11:23 ` Stephen C. Tweedie 1999-06-29 17:36 ` Kanoj Sarcar 1999-06-28 22:22 ` Stephen C. Tweedie 1999-06-28 22:21 ` Stephen C. Tweedie 1999-06-28 22:57 ` Andrea Arcangeli 1999-06-29 2:13 ` Chuck Lever 1999-06-29 12:01 ` Stephen C. Tweedie 1999-06-29 12:32 ` Andrea Arcangeli 1999-06-30 15:59 ` Stephen C. Tweedie 1999-06-29 1:00 ` Chuck Lever 1999-06-28 22:08 ` Stephen C. Tweedie 1999-06-28 22:59 ` Andrea Arcangeli 1999-06-29 0:53 ` Chuck Lever 1999-06-29 11:14 ` Stephen C. Tweedie 1999-06-28 22:09 ` Stephen C. Tweedie 1999-06-28 20:45 ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races Stephen C. Tweedie 1999-06-28 21:14 ` Chuck Lever 1999-06-28 21:25 ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Kanoj Sarcar 1999-06-28 22:15 ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races Stephen C. Tweedie 1999-06-28 22:48 ` Andrea Arcangeli 1999-06-29 1:29 ` Chuck Lever 1999-06-29 11:58 ` Stephen C. Tweedie 1999-06-29 12:09 ` Andrea Arcangeli 1999-06-29 15:27 ` Chuck Lever 1999-06-29 11:55 ` Stephen C. Tweedie 1999-06-29 20:08 ` Andrea Arcangeli 1999-06-28 22:36 ` filecache/swapcache questions Stephen C. Tweedie 1999-06-28 23:24 ` Kanoj Sarcar
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox