* PG_swap_entry bug in recent kernels @ 2000-04-03 22:22 Ben LaHaise 2000-04-04 15:06 ` Andrea Arcangeli 0 siblings, 1 reply; 47+ messages in thread From: Ben LaHaise @ 2000-04-03 22:22 UTC (permalink / raw) To: torvalds; +Cc: linux-mm The following one-liner is a painful bug present in recent kernels: swap cache pages left in the LRU lists and subsequently reclaimed by shrink_mmap were resulting in new pages having the PG_swap_entry bit set. This leads to invalid swap entries being put into users page tables if the page is eventually swapped out. This was nasty to track down. -ben diff -ur 2.3.99-pre4-3/mm/swap_state.c test-pre4-3/mm/swap_state.c --- 2.3.99-pre4-3/mm/swap_state.c Mon Dec 6 13:19:45 1999 +++ test-pre4-3/mm/swap_state.c Mon Apr 3 17:59:30 2000 @@ -80,6 +80,7 @@ #endif remove_from_swap_cache(page); swap_free(entry); + clear_bit(PG_swap_entry, &page->flags); } /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: PG_swap_entry bug in recent kernels 2000-04-03 22:22 PG_swap_entry bug in recent kernels Ben LaHaise @ 2000-04-04 15:06 ` Andrea Arcangeli 2000-04-04 15:46 ` Rik van Riel 0 siblings, 1 reply; 47+ messages in thread From: Andrea Arcangeli @ 2000-04-04 15:06 UTC (permalink / raw) To: Ben LaHaise; +Cc: torvalds, linux-mm On Mon, 3 Apr 2000, Ben LaHaise wrote: >The following one-liner is a painful bug present in recent kernels: swap >cache pages left in the LRU lists and subsequently reclaimed by >shrink_mmap were resulting in new pages having the PG_swap_entry bit set. The patch is obviously wrong and shouldn't be applied. You missed the semantics of the PG_swap_entry bitflage enterely. Such bit is meant _only_ to allow the swapping out the same page in the same swap place across write-pagein faults (to avoid swap fragmentation upon write-page-faults). If you clear such bit within remove_from_swap_cache (that gets recalled upon a write swapin fault) you can as well drop such bitflag completly. The PG_swap_entry is meant to give persistence to pages in the swap space. >This leads to invalid swap entries being put into users page tables if the >page is eventually swapped out. This was nasty to track down. That's obviously not the case. If you have that problem you'd better continue to debug it. The PG_swap_entry can _only_ deal with performance (not with stability). You can set modify such bit as you want at any time everywhere and the system will remains rock solid, only performance can change. You can trivially verify that. The bitflag is read _only_ in acquire_swap_entry, and such function will run succesfully an safely despite of the PG_swap_entry bitflag settings. Said the above, I obviously agree free pages shouldn't have such bit set, since they aren't mapped anymore and so it make no sense to provide persistence on the swap space to not allocated pages :). I seen where we have a problem in not clearing such bit, but the fix definitely isn't to clear the bit in the swapin-modify path. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: PG_swap_entry bug in recent kernels 2000-04-04 15:06 ` Andrea Arcangeli @ 2000-04-04 15:46 ` Rik van Riel 2000-04-04 16:50 ` Andrea Arcangeli 0 siblings, 1 reply; 47+ messages in thread From: Rik van Riel @ 2000-04-04 15:46 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Ben LaHaise, torvalds, linux-mm On Tue, 4 Apr 2000, Andrea Arcangeli wrote: > On Mon, 3 Apr 2000, Ben LaHaise wrote: > > >The following one-liner is a painful bug present in recent kernels: swap > >cache pages left in the LRU lists and subsequently reclaimed by > >shrink_mmap were resulting in new pages having the PG_swap_entry bit set. > > The patch is obviously wrong and shouldn't be applied. You missed the > semantics of the PG_swap_entry bitflage enterely. [snip] > Said the above, I obviously agree free pages shouldn't have such > bit set, since they aren't mapped anymore and so it make no > sense to provide persistence on the swap space to not allocated > pages :). I seen where we have a problem in not clearing such > bit, but the fix definitely isn't to clear the bit in the > swapin-modify path. You might want to have read _where_ Ben's patch applies. void __delete_from_swap_cache(struct page *page) { swp_entry_t entry; entry.val = page->index; #ifdef SWAP_CACHE_INFO swap_cache_del_total++; #endif remove_from_swap_cache(page); swap_free(entry); + clear_bit(PG_swap_entry, &page->flags); } When we remove a page from the swap cache, it seems fair to me that we _really_ remove it from the swap cache. If __delete_from_swap_cache() is called from a wrong code path, that's something that should be fixed, of course (but that's orthogonal to this). To quote from memory.c::do_swap_page() : if (write_access && !is_page_shared(page)) { delete_from_swap_cache_nolock(page); UnlockPage(page); If you think this is a bug, please fix it here... cheers, Rik -- The Internet is not a network of computers. It is a network of people. That is its real strength. Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: PG_swap_entry bug in recent kernels 2000-04-04 15:46 ` Rik van Riel @ 2000-04-04 16:50 ` Andrea Arcangeli 2000-04-04 17:06 ` Ben LaHaise 0 siblings, 1 reply; 47+ messages in thread From: Andrea Arcangeli @ 2000-04-04 16:50 UTC (permalink / raw) To: riel; +Cc: Ben LaHaise, Linus Torvalds, linux-mm On Tue, 4 Apr 2000, Rik van Riel wrote: >You might want to have read _where_ Ben's patch applies. > >void __delete_from_swap_cache(struct page *page) >{ > swp_entry_t entry; > > entry.val = page->index; > >#ifdef SWAP_CACHE_INFO > swap_cache_del_total++; >#endif > remove_from_swap_cache(page); > swap_free(entry); >+ clear_bit(PG_swap_entry, &page->flags); >} > >When we remove a page from the swap cache, it seems fair to me >that we _really_ remove it from the swap cache. Are you sure you didn't mistaken PG_swap_entry for PG_swap_cache? We're here talking about PG_swap_entry. The only object of that bit is to remains set on anonymous pages that aren't in the swap cache, so next time we'll re-add them to the swap cache we'll try to swap out them in the same swap entry as the page were before. >If __delete_from_swap_cache() is called from a wrong code path, >that's something that should be fixed, of course (but that's >orthogonal to this). __delete_from_swap_cache is called by delete_from_swap_cache_nolock that is called by do_swap_page that does the swapin. >To quote from memory.c::do_swap_page() : > > if (write_access && !is_page_shared(page)) { > delete_from_swap_cache_nolock(page); > UnlockPage(page); > >If you think this is a bug, please fix it here... The above quoted code is correct. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: PG_swap_entry bug in recent kernels 2000-04-04 16:50 ` Andrea Arcangeli @ 2000-04-04 17:06 ` Ben LaHaise 2000-04-04 18:03 ` Andrea Arcangeli 0 siblings, 1 reply; 47+ messages in thread From: Ben LaHaise @ 2000-04-04 17:06 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: riel, Linus Torvalds, linux-mm On Tue, 4 Apr 2000, Andrea Arcangeli wrote: > On Tue, 4 Apr 2000, Rik van Riel wrote: > >When we remove a page from the swap cache, it seems fair to me > >that we _really_ remove it from the swap cache. > > Are you sure you didn't mistaken PG_swap_entry for PG_swap_cache? Yes I'm sure. What's happening is that the presence of PG_swap_entry on the page means that page->index is being returned as the swap entry. This is completely bogus after the page has been freed and reallocated. Just try running a swap test under any recent devel kernel -- eventually you will see an invalid swap entry show up in someone's pte causing a random SIGSEGV (it's as if the entry was marked PROT_NONE -- present, but no user access). > We're here talking about PG_swap_entry. The only object of that bit is to > remains set on anonymous pages that aren't in the swap cache, so next time > we'll re-add them to the swap cache we'll try to swap out them in the same > swap entry as the page were before. Which is bogus. If it's an anonymous page that we want to swap out to the same swap entry, leave it in the swap cache. > >If __delete_from_swap_cache() is called from a wrong code path, > >that's something that should be fixed, of course (but that's > >orthogonal to this). > > __delete_from_swap_cache is called by delete_from_swap_cache_nolock that > is called by do_swap_page that does the swapin. As well as from shrink_mmap. > >To quote from memory.c::do_swap_page() : > > > > if (write_access && !is_page_shared(page)) { > > delete_from_swap_cache_nolock(page); > > UnlockPage(page); > > > >If you think this is a bug, please fix it here... > > The above quoted code is correct. The code path that this patch really affects is shrink_mmap -> __delete_from_swap_cache. Clearing the bit from shrink_mmap is an option, but it doesn't make that much sense to me; if we're removing a page from the swap cache, why aren't we clearing the PG_swap_entry bit? I'd rather leave the page in the swap cache and set PG_dirty on it that have such obscure sematics. -ben -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: PG_swap_entry bug in recent kernels 2000-04-04 17:06 ` Ben LaHaise @ 2000-04-04 18:03 ` Andrea Arcangeli 2000-04-06 22:11 ` [patch] take 2 " Ben LaHaise 0 siblings, 1 reply; 47+ messages in thread From: Andrea Arcangeli @ 2000-04-04 18:03 UTC (permalink / raw) To: Ben LaHaise; +Cc: riel, Linus Torvalds, linux-mm On Tue, 4 Apr 2000, Ben LaHaise wrote: >try running a swap test under any recent devel kernel -- eventually you >will see an invalid swap entry show up in someone's pte causing a random acquire_swap_entry() should be just doing all the safety checking to make sure the page->index is caching a _valid_ swap entry. If it's garbage get_swap_page() will be recalled and a valid entry will be allocated (and if get_swap_page() returns an invalid-entry that's definitely not related to the PG_swap_entry logic anymore). Could you explain me how acquire_swap_entry() can return an invalid swap entry (starting with a random page->index of course)? I can't exclude there's a bug, but acquire_swap_entry was meant to return only valid entries despite of the page->index possible garbage and it seems it's doing that. >SIGSEGV (it's as if the entry was marked PROT_NONE -- present, but no >user access). I'm not saying you are not seeing that, I'm only trying to understand how can it be related to the PG_swap_entry logic. >> We're here talking about PG_swap_entry. The only object of that bit is to >> remains set on anonymous pages that aren't in the swap cache, so next time >> we'll re-add them to the swap cache we'll try to swap out them in the same >> swap entry as the page were before. > >Which is bogus. If it's an anonymous page that we want to swap out to the >same swap entry, leave it in the swap cache. There's a very basic reason that we can't left it in the swap cache. We can't left it in the swap cache simply because the swap cache is a read only entity and if you do a write access you can't left the page in the swap cache and change it without updating its on-disk counterpart. So we always remove the anonymous pages from the swap cache upon swapin-_write_-faults. That's also how 2.2.x works. Then I noticed it was possible to give persistence on the swap also to the dirtified pages without making the swap-cache dirty, by adding the PG_swap_entry bit that tell us if the page->index is currently caching the last swap entry where the page was allocated on the swap. That does all the job and we don't need dirty swap cache anymore. >> >If __delete_from_swap_cache() is called from a wrong code path, >> >that's something that should be fixed, of course (but that's >> >orthogonal to this). >> >> __delete_from_swap_cache is called by delete_from_swap_cache_nolock that >> is called by do_swap_page that does the swapin. > >As well as from shrink_mmap. I would not be complaining your patch if you would put the clear_bit within shrink_mmap :). BTW, also the SHM unmap points have to be checked to make sure the PG_swap_entry gets cleared. Also SHM uses swap cache and shares the do_swap_page code. >> >To quote from memory.c::do_swap_page() : >> > >> > if (write_access && !is_page_shared(page)) { >> > delete_from_swap_cache_nolock(page); >> > UnlockPage(page); >> > >> >If you think this is a bug, please fix it here... >> >> The above quoted code is correct. > >The code path that this patch really affects is shrink_mmap -> >__delete_from_swap_cache. Clearing the bit from shrink_mmap is an option, >but it doesn't make that much sense to me; if we're removing a page from ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >the swap cache, why aren't we clearing the PG_swap_entry bit? I'd rather ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ As just said the whole point of the PG_swap_entry is to be set for regular anonymous swappable pages, _not_ for the swap cache at all. So it really make no sense to clear the PG_swap_entry bit while removing a page from the swap cache and I don't see your arguemnt. We have instead to cover the points where a swappable page become not a swappable page anymore. There's also to cover the shrink_mmap case where we free the page. I'm thinking about it to see if we can skip to cover the shrink_mmap case changing the point where we set the PG_swap_entry bit. I think it would be possible if we would avoid the double page fault in the write_swapin_access+page-is-shared case. But right now I think it's cleaner to keep the double page fault and to clear the PG_swap_entry within shrink_mmap while dropping page cache. >leave the page in the swap cache and set PG_dirty on it that have such >obscure sematics. If the page is shared you can't simply set the PG_dirty bit and change the contents of the page or you'll screwup all other in-core and on-disk users of the page. You have at least to do a COW and re-add the COWed page to a new swap cache entry so by definition failing to give persistence to the page in the swap space. In general (also for non-shared pages) setting the PG_dirty is inferior because it would waste swap space by keeping busy on the swap side a swap entry that it not uptodate and that's really not cache (there's no way somebody else can fault in the same swap cache page since the user who did the write-access is now the only user of the page and it has it just mapped in its pte so it can't fault on it). The reason we keep the page mapped in the read-only access is to skip a write to disk if we run low on memory but we can't skip the write to disk in the write-access case... In the case of multiple users on the swap side (for example when a process does a fork with some page in the swap) the PG_swap_entry can allow the two tasks to share the same swap entry if they gets swapped out and swapped in at different times. A PG_dirty swap cache wouldn't allow that because of the is-page-shared issue mentioned two paragraphs above. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-04 18:03 ` Andrea Arcangeli @ 2000-04-06 22:11 ` Ben LaHaise 2000-04-07 10:45 ` Andrea Arcangeli 0 siblings, 1 reply; 47+ messages in thread From: Ben LaHaise @ 2000-04-06 22:11 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: riel, Linus Torvalds, linux-mm On Tue, 4 Apr 2000, Andrea Arcangeli wrote: > Could you explain me how acquire_swap_entry() can return an invalid swap > entry (starting with a random page->index of course)? I can't exclude > there's a bug, but acquire_swap_entry was meant to return only valid > entries despite of the page->index possible garbage and it seems it's > doing that. First off, it doesn't verify that all of the bits which should be clear in a swap entry actually are -- eg bit 0 (present) can and is getting set. It would have to rebuild and compare the swap entry produced by SWP_ENTRY(type,offset) to be 100% sure that it is a valid swap entry. And yes, I see what you're doing with the PG_swap_entry code now, although I'm thinking it might be better done by taking a look at what swap entries are present in the page tables near the page (since otherwise a pair of fork()ed process could end up splitting contiguous swap entries if the swapout is timed just right). But that's for later... > >As well as from shrink_mmap. > > I would not be complaining your patch if you would put the clear_bit > within shrink_mmap :). Heheh, okay I've moved it there. Just in case I've also added a BUG() check in free_pages_okay to make sure there aren't any other places that have been missed. -ben diff -ur 2.3.99-pre4-4/mm/filemap.c linux-test/mm/filemap.c --- 2.3.99-pre4-4/mm/filemap.c Thu Apr 6 15:03:05 2000 +++ linux-test/mm/filemap.c Thu Apr 6 17:50:39 2000 @@ -300,6 +300,7 @@ if (PageSwapCache(page)) { spin_unlock(&pagecache_lock); __delete_from_swap_cache(page); + ClearPageSwapEntry(page); goto made_inode_progress; } diff -ur 2.3.99-pre4-4/mm/page_alloc.c linux-test/mm/page_alloc.c --- 2.3.99-pre4-4/mm/page_alloc.c Thu Apr 6 15:03:05 2000 +++ linux-test/mm/page_alloc.c Thu Apr 6 17:38:10 2000 @@ -110,6 +110,8 @@ BUG(); if (PageDecrAfter(page)) BUG(); + if (PageSwapEntry(page)) + BUG(); zone = page->zone; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-06 22:11 ` [patch] take 2 " Ben LaHaise @ 2000-04-07 10:45 ` Andrea Arcangeli 2000-04-07 11:29 ` Rik van Riel 2000-04-07 20:12 ` Kanoj Sarcar 0 siblings, 2 replies; 47+ messages in thread From: Andrea Arcangeli @ 2000-04-07 10:45 UTC (permalink / raw) To: Ben LaHaise; +Cc: riel, Linus Torvalds, linux-mm On Thu, 6 Apr 2000, Ben LaHaise wrote: >First off, it doesn't verify that all of the bits which should be clear in >a swap entry actually are -- eg bit 0 (present) can and is getting >set. It would have to rebuild and compare the swap entry produced by >SWP_ENTRY(type,offset) to be 100% sure that it is a valid swap entry. That shouldn't be a problem. What we cares is to return a _valid_ swap entry from acquire_swap_entry, if that wasn't a cached-swap-entry that were a minor problem (not a stability issue at least). However I agree the current design is not very clean and not very strict and it's too silent. So now I make sure that the swap-entry bitflag is set only on good page->index. >Heheh, okay I've moved it there. Just in case I've also added a >BUG() check in free_pages_okay to make sure there aren't any other places >that have been missed. That's certainly a good idea, I did that too indeed :). I'm working on that and right now I am stuck fixing up SMP/non-SMP races and bugs in the VM that I found while checking the swap entry stuff. However the swap entry stuff in my current tree should be fixed now. This is a snapshot of my tree so that we can stay in sync. I guess the below stuff it's ready for inclusion but if you don't agree with something let me know of course, thanks. It seems to work stable here. I included also some early fix (the non intrusive ones). One for a SMP race fix for swapoff/get_swap_page, an SMP race in the vmlist access lock during swapoff, do_wp_page that have to know the swap cache can have reference count 3 and not being shared, and the highmem page replacement should cache also the swap-entry bitflag (I just do a copy of all the flags to be faster). It doesn't need to copy the mapping pointer since it's always null and the new_page have it null too. It's against 2.3.99-pre4-4. diff -urN ref/mm/filemap.c swap-entry-1/mm/filemap.c --- ref/mm/filemap.c Thu Apr 6 01:00:51 2000 +++ swap-entry-1/mm/filemap.c Fri Apr 7 04:44:27 2000 @@ -300,6 +300,8 @@ if (PageSwapCache(page)) { spin_unlock(&pagecache_lock); __delete_from_swap_cache(page); + /* the page is local to us now */ + page->flags &= ~(1UL << PG_swap_entry); goto made_inode_progress; } diff -urN ref/mm/highmem.c swap-entry-1/mm/highmem.c --- ref/mm/highmem.c Mon Apr 3 03:21:59 2000 +++ swap-entry-1/mm/highmem.c Fri Apr 7 03:48:46 2000 @@ -75,7 +75,7 @@ /* Preserve the caching of the swap_entry. */ highpage->index = page->index; - highpage->mapping = page->mapping; + highpage->flags = page->flags; /* * We can just forget the old page since diff -urN ref/mm/memory.c swap-entry-1/mm/memory.c --- ref/mm/memory.c Fri Apr 7 12:17:24 2000 +++ swap-entry-1/mm/memory.c Fri Apr 7 12:26:53 2000 @@ -838,6 +838,7 @@ */ switch (page_count(old_page)) { case 2: + case 3: /* * Lock the page so that no one can look it up from * the swap cache, grab a reference and start using it. @@ -880,7 +881,19 @@ new_page = old_page; } spin_unlock(&tsk->mm->page_table_lock); - __free_page(new_page); + /* + * We're releasing a page, it can be an anonymous + * page as well. Since we don't hold any lock (except + * the mmap_sem semaphore) the other user of the anonymous + * page may have released it from under us and now we + * could be the only owner of the page, thus put_page_testzero() can + * return 1, and we have to clear the swap-entry + * bitflag in such case. + */ + if (put_page_testzero(new_page)) { + new_page->flags &= ~(1UL << PG_swap_entry); + __free_pages_ok(new_page, 0); + } return 1; bad_wp_page: diff -urN ref/mm/page_alloc.c swap-entry-1/mm/page_alloc.c --- ref/mm/page_alloc.c Thu Apr 6 01:00:52 2000 +++ swap-entry-1/mm/page_alloc.c Thu Apr 6 05:05:59 2000 @@ -110,6 +110,8 @@ BUG(); if (PageDecrAfter(page)) BUG(); + if (PageSwapEntry(page)) + BUG(); zone = page->zone; diff -urN ref/mm/swap_state.c swap-entry-1/mm/swap_state.c --- ref/mm/swap_state.c Thu Apr 6 01:00:52 2000 +++ swap-entry-1/mm/swap_state.c Fri Apr 7 12:29:00 2000 @@ -126,9 +126,14 @@ UnlockPage(page); } - ClearPageSwapEntry(page); - - __free_page(page); + /* + * Only the last unmap have to lose the swap entry + * information that we have cached into page->index. + */ + if (put_page_testzero(page)) { + page->flags &= ~(1UL << PG_swap_entry); + __free_pages_ok(page, 0); + } } @@ -151,7 +156,7 @@ * Right now the pagecache is 32-bit only. But it's a 32 bit index. =) */ repeat: - found = find_lock_page(&swapper_space, entry.val); + found = find_get_page(&swapper_space, entry.val); if (!found) return 0; /* @@ -163,7 +168,6 @@ * is enough to check whether the page is still in the scache. */ if (!PageSwapCache(found)) { - UnlockPage(found); __free_page(found); goto repeat; } @@ -172,13 +176,11 @@ #ifdef SWAP_CACHE_INFO swap_cache_find_success++; #endif - UnlockPage(found); return found; } out_bad: printk (KERN_ERR "VM: Found a non-swapper swap page!\n"); - UnlockPage(found); __free_page(found); return 0; } diff -urN ref/mm/swapfile.c swap-entry-1/mm/swapfile.c --- ref/mm/swapfile.c Thu Apr 6 01:00:52 2000 +++ swap-entry-1/mm/swapfile.c Fri Apr 7 12:35:59 2000 @@ -212,22 +212,22 @@ /* We have the old entry in the page offset still */ if (!page->index) - goto new_swap_entry; + goto null_swap_entry; entry.val = page->index; type = SWP_TYPE(entry); if (type >= nr_swapfiles) - goto new_swap_entry; + goto bad_nofile; + swap_list_lock(); p = type + swap_info; if ((p->flags & SWP_WRITEOK) != SWP_WRITEOK) - goto new_swap_entry; + goto unlock_list; offset = SWP_OFFSET(entry); if (offset >= p->max) - goto new_swap_entry; + goto bad_offset; /* Has it been re-used for something else? */ - swap_list_lock(); swap_device_lock(p); if (p->swap_map[offset]) - goto unlock_new_swap_entry; + goto unlock; /* We're cool, we can just use the old one */ p->swap_map[offset] = 1; @@ -236,11 +236,24 @@ swap_list_unlock(); return entry; -unlock_new_swap_entry: +unlock: swap_device_unlock(p); +unlock_list: swap_list_unlock(); +clear_swap_entry: + ClearPageSwapEntry(page); new_swap_entry: return get_swap_page(); + +null_swap_entry: + printk(KERN_WARNING __FUNCTION__ " null swap entry\n"); + goto clear_swap_entry; +bad_nofile: + printk(KERN_WARNING __FUNCTION__ " nonexistent swap file\n"); + goto clear_swap_entry; +bad_offset: + printk(KERN_WARNING __FUNCTION__ " bad offset\n"); + goto unlock_list; } /* @@ -263,8 +276,11 @@ /* If this entry is swap-cached, then page must already hold the right address for any copies in physical memory */ - if (pte_page(pte) != page) + if (pte_page(pte) != page) { + if (page->index == entry.val) + ClearPageSwapEntry(page); return; + } /* We will be removing the swap cache in a moment, so... */ set_pte(dir, pte_mkdirty(pte)); return; @@ -358,10 +374,20 @@ */ if (!mm) return; + /* + * Avoid the vmas to go away from under us + * and also avoids the task to play with + * pagetables while we're running. If the + * vmlist_modify_lock wouldn't acquire the + * mm->page_table_lock spinlock we should + * acquire it by hand. + */ + vmlist_access_lock(mm); for (vma = mm->mmap; vma; vma = vma->vm_next) { pgd_t * pgd = pgd_offset(mm, vma->vm_start); unuse_vma(vma, pgd, entry, page); } + vmlist_access_unlock(mm); return; } @@ -418,8 +444,10 @@ shm_unuse(entry, page); /* Now get rid of the extra reference to the temporary page we've been using. */ - if (PageSwapCache(page)) + if (PageSwapCache(page)) { delete_from_swap_cache(page); + ClearPageSwapEntry(page); + } __free_page(page); /* * Check for and clear any overflowed swap map counts. @@ -488,8 +516,8 @@ swap_list.next = swap_list.head; } nr_swap_pages -= p->pages; - swap_list_unlock(); p->flags = SWP_USED; + swap_list_unlock(); err = try_to_unuse(type); if (err) { /* re-insert swap space back into swap_list */ Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-07 10:45 ` Andrea Arcangeli @ 2000-04-07 11:29 ` Rik van Riel 2000-04-07 12:00 ` Andrea Arcangeli 2000-04-07 20:12 ` Kanoj Sarcar 1 sibling, 1 reply; 47+ messages in thread From: Rik van Riel @ 2000-04-07 11:29 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Ben LaHaise, Linus Torvalds, linux-mm On Fri, 7 Apr 2000, Andrea Arcangeli wrote: > spin_unlock(&pagecache_lock); > __delete_from_swap_cache(page); > + /* the page is local to us now */ > + page->flags &= ~(1UL << PG_swap_entry); > goto made_inode_progress; > } Please use the clear_bit() macro for this, the code is unreadable enough in its current state... cheers, Rik -- The Internet is not a network of computers. It is a network of people. That is its real strength. Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-07 11:29 ` Rik van Riel @ 2000-04-07 12:00 ` Andrea Arcangeli 2000-04-07 12:54 ` Rik van Riel 0 siblings, 1 reply; 47+ messages in thread From: Andrea Arcangeli @ 2000-04-07 12:00 UTC (permalink / raw) To: riel; +Cc: Ben LaHaise, Linus Torvalds, linux-mm On Fri, 7 Apr 2000, Rik van Riel wrote: >Please use the clear_bit() macro for this, the code is >unreadable enough in its current state... I didn't used the ClearPageSwapEntry macro to avoid executing locked asm instructions where not necessary. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-07 12:00 ` Andrea Arcangeli @ 2000-04-07 12:54 ` Rik van Riel 2000-04-07 13:14 ` Andrea Arcangeli 0 siblings, 1 reply; 47+ messages in thread From: Rik van Riel @ 2000-04-07 12:54 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Ben LaHaise, Linus Torvalds, linux-mm On Fri, 7 Apr 2000, Andrea Arcangeli wrote: > On Fri, 7 Apr 2000, Rik van Riel wrote: > > >Please use the clear_bit() macro for this, the code is > >unreadable enough in its current state... > > I didn't used the ClearPageSwapEntry macro to avoid executing > locked asm instructions where not necessary. Hmmmmm... Won't this screw up when another processor is atomically setting the bit just after we removed it and we still have it in the store queue? from include/asm-i386/spinlock.h /* * Sadly, some early PPro chips require the locked access, * otherwise we could just always simply do * * #define spin_unlock_string \ * "movb $0,%0" * * Which is noticeably faster. */ I don't know if it is relevant here, but would like to be sure ... cheers, Rik -- The Internet is not a network of computers. It is a network of people. That is its real strength. Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-07 12:54 ` Rik van Riel @ 2000-04-07 13:14 ` Andrea Arcangeli 0 siblings, 0 replies; 47+ messages in thread From: Andrea Arcangeli @ 2000-04-07 13:14 UTC (permalink / raw) To: riel; +Cc: Ben LaHaise, Linus Torvalds, linux-mm On Fri, 7 Apr 2000, Rik van Riel wrote: >Won't this screw up when another processor is atomically >setting the bit just after we removed it and we still have >it in the store queue? > >from include/asm-i386/spinlock.h >/* > * Sadly, some early PPro chips require the locked access, > * otherwise we could just always simply do > * > * #define spin_unlock_string \ > * "movb $0,%0" > * > * Which is noticeably faster. > */ > >I don't know if it is relevant here, but would like to >be sure ... The spin_unlock case is actually not relevant, I wasn't relying on it in first place since I was using C (which can implement the read/change/modify in multiple instruction playing with registers). The reason we can use C before putting the page into the freelist, is because we know we don't risk to race with other processors. We are putting the page into the freelist and if another processor would be playing someway with the page we couldn't put it on the freelist in first place. If some other processor/task is referencing the page while we call free_pages_ok, then that would be a major MM bug. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-07 10:45 ` Andrea Arcangeli 2000-04-07 11:29 ` Rik van Riel @ 2000-04-07 20:12 ` Kanoj Sarcar 2000-04-07 23:26 ` Andrea Arcangeli ` (2 more replies) 1 sibling, 3 replies; 47+ messages in thread From: Kanoj Sarcar @ 2000-04-07 20:12 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm Andrea, The swaplist locking changes in swapfile.c in your patch are okay, but are unneeded when you consider the kernel_lock is already held in most of those paths. (Complete removal of kernel_lock in those paths are a little harder, at least when last I tried it). A bigger problem might be that you are violating lock orders when you grab the vmlist_lock from inside code that already has tasklist_lock in readmode (your change in unuse_process()). I may be wrong, so you should try stress testing with swapdevice removal with a large number of runnable processes. Also, did you have a good reason to want to make lookup_swap_cache() invoke find_get_page(), and not find_lock_page()? I coded some of the MP race fixes with the swap cache, some of the logic is in Documentation/vm/locking. I remember some intense reasoning and thinking of obscure conditions, so I am just cautious about any locking changes. Kanoj > --- ref/mm/swap_state.c Thu Apr 6 01:00:52 2000 > +++ swap-entry-1/mm/swap_state.c Fri Apr 7 12:29:00 2000 > @@ -126,9 +126,14 @@ > UnlockPage(page); > } > > - ClearPageSwapEntry(page); > - > - __free_page(page); > + /* > + * Only the last unmap have to lose the swap entry > + * information that we have cached into page->index. > + */ > + if (put_page_testzero(page)) { > + page->flags &= ~(1UL << PG_swap_entry); > + __free_pages_ok(page, 0); > + } > } > > > @@ -151,7 +156,7 @@ > * Right now the pagecache is 32-bit only. But it's a 32 bit index. =) > */ > repeat: > - found = find_lock_page(&swapper_space, entry.val); > + found = find_get_page(&swapper_space, entry.val); > if (!found) > return 0; > /* > @@ -163,7 +168,6 @@ > * is enough to check whether the page is still in the scache. > */ > if (!PageSwapCache(found)) { > - UnlockPage(found); > __free_page(found); > goto repeat; > } > @@ -172,13 +176,11 @@ > #ifdef SWAP_CACHE_INFO > swap_cache_find_success++; > #endif > - UnlockPage(found); > return found; > } > > out_bad: > printk (KERN_ERR "VM: Found a non-swapper swap page!\n"); > - UnlockPage(found); > __free_page(found); > return 0; > } > diff -urN ref/mm/swapfile.c swap-entry-1/mm/swapfile.c > --- ref/mm/swapfile.c Thu Apr 6 01:00:52 2000 > +++ swap-entry-1/mm/swapfile.c Fri Apr 7 12:35:59 2000 > @@ -212,22 +212,22 @@ > > /* We have the old entry in the page offset still */ > if (!page->index) > - goto new_swap_entry; > + goto null_swap_entry; > entry.val = page->index; > type = SWP_TYPE(entry); > if (type >= nr_swapfiles) > - goto new_swap_entry; > + goto bad_nofile; > + swap_list_lock(); > p = type + swap_info; > if ((p->flags & SWP_WRITEOK) != SWP_WRITEOK) > - goto new_swap_entry; > + goto unlock_list; > offset = SWP_OFFSET(entry); > if (offset >= p->max) > - goto new_swap_entry; > + goto bad_offset; > /* Has it been re-used for something else? */ > - swap_list_lock(); > swap_device_lock(p); > if (p->swap_map[offset]) > - goto unlock_new_swap_entry; > + goto unlock; > > /* We're cool, we can just use the old one */ > p->swap_map[offset] = 1; > @@ -236,11 +236,24 @@ > swap_list_unlock(); > return entry; > > -unlock_new_swap_entry: > +unlock: > swap_device_unlock(p); > +unlock_list: > swap_list_unlock(); > +clear_swap_entry: > + ClearPageSwapEntry(page); > new_swap_entry: > return get_swap_page(); > + > +null_swap_entry: > + printk(KERN_WARNING __FUNCTION__ " null swap entry\n"); > + goto clear_swap_entry; > +bad_nofile: > + printk(KERN_WARNING __FUNCTION__ " nonexistent swap file\n"); > + goto clear_swap_entry; > +bad_offset: > + printk(KERN_WARNING __FUNCTION__ " bad offset\n"); > + goto unlock_list; > } > > /* > @@ -263,8 +276,11 @@ > /* If this entry is swap-cached, then page must already > hold the right address for any copies in physical > memory */ > - if (pte_page(pte) != page) > + if (pte_page(pte) != page) { > + if (page->index == entry.val) > + ClearPageSwapEntry(page); > return; > + } > /* We will be removing the swap cache in a moment, so... */ > set_pte(dir, pte_mkdirty(pte)); > return; > @@ -358,10 +374,20 @@ > */ > if (!mm) > return; > + /* > + * Avoid the vmas to go away from under us > + * and also avoids the task to play with > + * pagetables while we're running. If the > + * vmlist_modify_lock wouldn't acquire the > + * mm->page_table_lock spinlock we should > + * acquire it by hand. > + */ > + vmlist_access_lock(mm); > for (vma = mm->mmap; vma; vma = vma->vm_next) { > pgd_t * pgd = pgd_offset(mm, vma->vm_start); > unuse_vma(vma, pgd, entry, page); > } > + vmlist_access_unlock(mm); > return; > } > > @@ -418,8 +444,10 @@ > shm_unuse(entry, page); > /* Now get rid of the extra reference to the temporary > page we've been using. */ > - if (PageSwapCache(page)) > + if (PageSwapCache(page)) { > delete_from_swap_cache(page); > + ClearPageSwapEntry(page); > + } > __free_page(page); > /* > * Check for and clear any overflowed swap map counts. > @@ -488,8 +516,8 @@ > swap_list.next = swap_list.head; > } > nr_swap_pages -= p->pages; > - swap_list_unlock(); > p->flags = SWP_USED; > + swap_list_unlock(); > err = try_to_unuse(type); > if (err) { > /* re-insert swap space back into swap_list */ > > Andrea > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux.eu.org/Linux-MM/ > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-07 20:12 ` Kanoj Sarcar @ 2000-04-07 23:26 ` Andrea Arcangeli 2000-04-08 0:11 ` Kanoj Sarcar 2000-04-07 23:54 ` Andrea Arcangeli 2000-04-08 0:04 ` Andrea Arcangeli 2 siblings, 1 reply; 47+ messages in thread From: Andrea Arcangeli @ 2000-04-07 23:26 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm On Fri, 7 Apr 2000, Kanoj Sarcar wrote: >[..] you should try stress >testing with swapdevice removal with a large number of runnable >processes.[..] swapdevice removal during swapin activity is broken right now as far I can see. I'm trying to fix that stuff right now. >Also, did you have a good reason to want to make lookup_swap_cache() >invoke find_get_page(), and not find_lock_page()? I coded some of the Using find_lock_page and then unlocking the page is meaningless. If you are going to unconditionally unlock the page then you shouldn't lock it in first place. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-07 23:26 ` Andrea Arcangeli @ 2000-04-08 0:11 ` Kanoj Sarcar 2000-04-08 0:37 ` Kanoj Sarcar 2000-04-08 13:30 ` Andrea Arcangeli 0 siblings, 2 replies; 47+ messages in thread From: Kanoj Sarcar @ 2000-04-08 0:11 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm > > On Fri, 7 Apr 2000, Kanoj Sarcar wrote: > > >[..] you should try stress > >testing with swapdevice removal with a large number of runnable > >processes.[..] > > swapdevice removal during swapin activity is broken right now as far I can > see. I'm trying to fix that stuff right now. Be aware that I already have a patch for this. I have been meaning to clean it up against latest 2.3 and submit it to Linus ... FWIW, it has been broken since 2.2. > > >Also, did you have a good reason to want to make lookup_swap_cache() > >invoke find_get_page(), and not find_lock_page()? I coded some of the > > Using find_lock_page and then unlocking the page is meaningless. If you > are going to unconditionally unlock the page then you shouldn't lock it in > first place. I will have to think a little bit about why the code does what it does currently. I will let you know ... Kanoj > > Andrea > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-08 0:11 ` Kanoj Sarcar @ 2000-04-08 0:37 ` Kanoj Sarcar 2000-04-08 13:20 ` Andrea Arcangeli 2000-04-08 13:30 ` Andrea Arcangeli 1 sibling, 1 reply; 47+ messages in thread From: Kanoj Sarcar @ 2000-04-08 0:37 UTC (permalink / raw) To: Kanoj Sarcar Cc: Andrea Arcangeli, Ben LaHaise, riel, Linus Torvalds, linux-mm > > > > >Also, did you have a good reason to want to make lookup_swap_cache() > > >invoke find_get_page(), and not find_lock_page()? I coded some of the > > > > Using find_lock_page and then unlocking the page is meaningless. If you > > are going to unconditionally unlock the page then you shouldn't lock it in > > first place. > > I will have to think a little bit about why the code does what it does > currently. I will let you know ... > Okay, I think I found at least one reason why the lockpage was being done in lookup_swap_cache(). It was effectively to check the PageSwapCache bit, since shrink_mmap:__delete_from_swap_cache could race with a lookup_swap_cache. Yes, I did notice the recent shrink_mmap SMP race fixes that you posted, now it _*might*_ be unneccesary to do a find_lock_page() in lookup_swap_cache() (just for this race). I will have to look at the latest prepatch to confirm that. In any case, there are still races with swapspace deletion and swapcache lookup, so unless you had a good reason to want to replace the find_lock_page with lookup_swap_cache, I would much rather see it the way it is currently ... Kanoj -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-08 0:37 ` Kanoj Sarcar @ 2000-04-08 13:20 ` Andrea Arcangeli 2000-04-08 21:39 ` Kanoj Sarcar 0 siblings, 1 reply; 47+ messages in thread From: Andrea Arcangeli @ 2000-04-08 13:20 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm On Fri, 7 Apr 2000, Kanoj Sarcar wrote: >Okay, I think I found at least one reason why the lockpage was being done >in lookup_swap_cache(). It was effectively to check the PageSwapCache bit, >since shrink_mmap:__delete_from_swap_cache could race with a >lookup_swap_cache. shrink_mmap can't race with a find_get_cache. find_get_page increments the reference count within the critical section and shrink_mmap checks the page count and drop the page in one whole transaction within a mutually exclusive critical section. >Yes, I did notice the recent shrink_mmap SMP race fixes that you posted, They weren't relative to the cache, but only to the LRU list inserction/deletion. There wasn't races between shrink_mmap and find_get_page and friends. >now it _*might*_ be unneccesary to do a find_lock_page() in >lookup_swap_cache() (just for this race). I will have to look at the It isn't. Checking PageSwapCache while the page is locked is not necessary. The only thing which can drop the page from the swap cache is swapoff that will do that as soon as you do the unlock before returning from lookup_swap_cache anyway. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-08 13:20 ` Andrea Arcangeli @ 2000-04-08 21:39 ` Kanoj Sarcar 2000-04-08 23:02 ` Andrea Arcangeli 0 siblings, 1 reply; 47+ messages in thread From: Kanoj Sarcar @ 2000-04-08 21:39 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm As I said before, unless you have a _good_ reason, I don't see any point in changing the code, just because it appears cleaner. As you note, there are ample races with the swapdeletion code removing the page from the swapcache, while a lookup is in progress, the PageLock _might_ be used to fix those. In any case, after we agree that the races have been fixed (and that's theoretically, there will probably be races observed under real stress), it would be okay to then go and change the find_lock_page to find_get_page in lookup_swap_cache(). IMO, of course ... > > On Fri, 7 Apr 2000, Kanoj Sarcar wrote: > > >Okay, I think I found at least one reason why the lockpage was being done > >in lookup_swap_cache(). It was effectively to check the PageSwapCache bit, > >since shrink_mmap:__delete_from_swap_cache could race with a > >lookup_swap_cache. > > shrink_mmap can't race with a find_get_cache. find_get_page increments the > reference count within the critical section and shrink_mmap checks the > page count and drop the page in one whole transaction within a mutually > exclusive critical section. Okay, how's this (from pre3): shrink_mmap -------------- __find_get_page get pagemap_lru_lock ---------------- LockPage drop pagemap_lru_lock Fail if page_count(page) > 1 get pagecache_lock get_page Fail if page_count(page) != 2 if PageSwapCache, drop pagecache_lock get pagecache_lock Finds page in swapcache, does get_page drop pagecache_lock and __delete_from_swap_cache, which releases PageLock. LockPage succeeds, erronesouly believes he has swapcache page. Did I miss some interlocking step that would prevent this from happening? > > >Yes, I did notice the recent shrink_mmap SMP race fixes that you posted, > > They weren't relative to the cache, but only to the LRU list > inserction/deletion. There wasn't races between shrink_mmap and > find_get_page and friends. Ok, so that's irrelevant ... > > >now it _*might*_ be unneccesary to do a find_lock_page() in > >lookup_swap_cache() (just for this race). I will have to look at the > > It isn't. Checking PageSwapCache while the page is locked is not > necessary. The only thing which can drop the page from the swap cache is > swapoff that will do that as soon as you do the unlock before returning > from lookup_swap_cache anyway. Yes, see above too. Its probably better to have overenthusiastic locking, than having lesser locking. As I mentioned before, the shrink_mmap race is probably one of the reasons I did the PageLocking in lookup_swap_cache(), I was probably thinking of swapdeletion too ... Kanoj > > Andrea > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux.eu.org/Linux-MM/ > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-08 21:39 ` Kanoj Sarcar @ 2000-04-08 23:02 ` Andrea Arcangeli 2000-04-08 23:18 ` Kanoj Sarcar 0 siblings, 1 reply; 47+ messages in thread From: Andrea Arcangeli @ 2000-04-08 23:02 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm On Sat, 8 Apr 2000, Kanoj Sarcar wrote: >shrink_mmap >-------------- __find_get_page >get pagemap_lru_lock ---------------- >LockPage >drop pagemap_lru_lock >Fail if page_count(page) > 1 >get pagecache_lock >get_page >Fail if page_count(page) != 2 >if PageSwapCache, drop pagecache_lock > get pagecache_lock > Finds page in swapcache, > does get_page > drop pagecache_lock > and __delete_from_swap_cache, > which releases PageLock. > LockPage succeeds, > erronesouly believes he > has swapcache page. > >Did I miss some interlocking step that would prevent this from happening? Oh, very good point indeed, I don't think you are missing anything. Thanks for showing me that! It seems to me the only reason we was dropping the lock earlier for the swap cache was to be able to use the remove_inode_page and so avoding having to export a secondary remove_inode_page that doesn't grab the page_cache_lock. It looks the only reason was an implementation issue. So IMHVO it would be nicer to change the locking in shrink_mmap() instead of putting the page-cache check in the swap cache lookup fast path. Swap cache and page cache are sharing the same locking rules w.r.t. the hashtable. That was the only exception as far I can tell and removing it would give us a cleaner design IMHO. What do you think about something like this? diff -urN swap-entry-2/include/linux/mm.h swap-entry-3/include/linux/mm.h --- swap-entry-2/include/linux/mm.h Sat Apr 8 19:16:28 2000 +++ swap-entry-3/include/linux/mm.h Sun Apr 9 00:18:43 2000 @@ -449,6 +449,7 @@ struct zone_t; /* filemap.c */ extern void remove_inode_page(struct page *); +extern void __remove_inode_page(struct page *); extern unsigned long page_unuse(struct page *); extern int shrink_mmap(int, int, zone_t *); extern void truncate_inode_pages(struct address_space *, loff_t); diff -urN swap-entry-2/include/linux/swap.h swap-entry-3/include/linux/swap.h --- swap-entry-2/include/linux/swap.h Sat Apr 8 18:08:37 2000 +++ swap-entry-3/include/linux/swap.h Sun Apr 9 00:47:42 2000 @@ -105,6 +105,7 @@ /* * Make these inline later once they are working properly. */ +extern void shrink_swap_cache(struct page *); extern void unlink_from_swap_cache(struct page *); extern void __delete_from_swap_cache(struct page *page); extern void delete_from_swap_cache(struct page *page); diff -urN swap-entry-2/mm/filemap.c swap-entry-3/mm/filemap.c --- swap-entry-2/mm/filemap.c Sat Apr 8 04:46:04 2000 +++ swap-entry-3/mm/filemap.c Sun Apr 9 00:39:23 2000 @@ -77,6 +77,13 @@ atomic_dec(&page_cache_size); } +inline void __remove_inode_page(struct page *page) +{ + remove_page_from_inode_queue(page); + remove_page_from_hash_queue(page); + page->mapping = NULL; +} + /* * Remove a page from the page cache and free it. Caller has to make * sure the page is locked and that nobody else uses it - or that usage @@ -88,9 +95,7 @@ PAGE_BUG(page); spin_lock(&pagecache_lock); - remove_page_from_inode_queue(page); - remove_page_from_hash_queue(page); - page->mapping = NULL; + __remove_inode_page(page); spin_unlock(&pagecache_lock); } @@ -298,8 +303,8 @@ * were to be marked referenced.. */ if (PageSwapCache(page)) { + shrink_swap_cache(page); spin_unlock(&pagecache_lock); - __delete_from_swap_cache(page); /* the page is local to us now */ page->flags &= ~(1UL << PG_swap_entry); goto made_inode_progress; diff -urN swap-entry-2/mm/swap_state.c swap-entry-3/mm/swap_state.c --- swap-entry-2/mm/swap_state.c Sat Apr 8 17:29:46 2000 +++ swap-entry-3/mm/swap_state.c Sun Apr 9 00:39:17 2000 @@ -55,6 +55,19 @@ return ret; } +static inline void __remove_from_swap_cache(struct page *page) +{ + struct address_space *mapping = page->mapping; + + if (mapping != &swapper_space) + BUG(); + if (!PageSwapCache(page) || !PageLocked(page)) + PAGE_BUG(page); + + PageClearSwapCache(page); + __remove_inode_page(page); +} + static inline void remove_from_swap_cache(struct page *page) { struct address_space *mapping = page->mapping; @@ -76,6 +89,20 @@ lru_cache_del(page); remove_from_swap_cache(page); __free_page(page); +} + +/* called by shrink_mmap() with the page_cache_lock held */ +void shrink_swap_cache(struct page *page) +{ + swp_entry_t entry; + + entry.val = page->index; + +#ifdef SWAP_CACHE_INFO + swap_cache_del_total++; +#endif + __remove_from_swap_cache(page); + swap_free(entry); } /* The other option is to keep the checks in the lookup swap cache fast path. Comments? Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-08 23:02 ` Andrea Arcangeli @ 2000-04-08 23:18 ` Kanoj Sarcar 2000-04-08 23:58 ` Andrea Arcangeli 0 siblings, 1 reply; 47+ messages in thread From: Kanoj Sarcar @ 2000-04-08 23:18 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm > > On Sat, 8 Apr 2000, Kanoj Sarcar wrote: > > >shrink_mmap > >-------------- __find_get_page > >get pagemap_lru_lock ---------------- > >LockPage > >drop pagemap_lru_lock > >Fail if page_count(page) > 1 > >get pagecache_lock > >get_page > >Fail if page_count(page) != 2 > >if PageSwapCache, drop pagecache_lock > > get pagecache_lock > > Finds page in swapcache, > > does get_page > > drop pagecache_lock > > and __delete_from_swap_cache, > > which releases PageLock. > > LockPage succeeds, > > erronesouly believes he > > has swapcache page. > > > >Did I miss some interlocking step that would prevent this from happening? > > Oh, very good point indeed, I don't think you are missing anything. Thanks > for showing me that! > > It seems to me the only reason we was dropping the lock earlier for the > swap cache was to be able to use the remove_inode_page and so avoding > having to export a secondary remove_inode_page that doesn't grab the > page_cache_lock. It looks the only reason was an implementation issue. Agreed. > > So IMHVO it would be nicer to change the locking in shrink_mmap() instead > of putting the page-cache check in the swap cache lookup fast path. Swap > cache and page cache are sharing the same locking rules w.r.t. the > hashtable. That was the only exception as far I can tell and removing it > would give us a cleaner design IMHO. Yes, its easy enough to do, I am sure your attached patch does it properly. Notwithstanding, I will repeat this again. Just because you change how shrink_mmap deletes swapcache pages, does not make it a good idea to change lookup_swap_cache(). Lets first fix the swap cache deletion races, then you can change lookup_swap_cache() not to lock the page ... which is trivial anyway, once we think we have all the holes nailed. I am off to polishing my previous swapoff patch. Andrea, it might make it easier to understand your previous patch if you were to send out a list of the specific races it is fixing. (Something like my above race example). Kanoj > > What do you think about something like this? > > diff -urN swap-entry-2/include/linux/mm.h swap-entry-3/include/linux/mm.h > --- swap-entry-2/include/linux/mm.h Sat Apr 8 19:16:28 2000 > +++ swap-entry-3/include/linux/mm.h Sun Apr 9 00:18:43 2000 > @@ -449,6 +449,7 @@ > struct zone_t; > /* filemap.c */ > extern void remove_inode_page(struct page *); > +extern void __remove_inode_page(struct page *); > extern unsigned long page_unuse(struct page *); > extern int shrink_mmap(int, int, zone_t *); > extern void truncate_inode_pages(struct address_space *, loff_t); > diff -urN swap-entry-2/include/linux/swap.h swap-entry-3/include/linux/swap.h > --- swap-entry-2/include/linux/swap.h Sat Apr 8 18:08:37 2000 > +++ swap-entry-3/include/linux/swap.h Sun Apr 9 00:47:42 2000 > @@ -105,6 +105,7 @@ > /* > * Make these inline later once they are working properly. > */ > +extern void shrink_swap_cache(struct page *); > extern void unlink_from_swap_cache(struct page *); > extern void __delete_from_swap_cache(struct page *page); > extern void delete_from_swap_cache(struct page *page); > diff -urN swap-entry-2/mm/filemap.c swap-entry-3/mm/filemap.c > --- swap-entry-2/mm/filemap.c Sat Apr 8 04:46:04 2000 > +++ swap-entry-3/mm/filemap.c Sun Apr 9 00:39:23 2000 > @@ -77,6 +77,13 @@ > atomic_dec(&page_cache_size); > } > > +inline void __remove_inode_page(struct page *page) > +{ > + remove_page_from_inode_queue(page); > + remove_page_from_hash_queue(page); > + page->mapping = NULL; > +} > + > /* > * Remove a page from the page cache and free it. Caller has to make > * sure the page is locked and that nobody else uses it - or that usage > @@ -88,9 +95,7 @@ > PAGE_BUG(page); > > spin_lock(&pagecache_lock); > - remove_page_from_inode_queue(page); > - remove_page_from_hash_queue(page); > - page->mapping = NULL; > + __remove_inode_page(page); > spin_unlock(&pagecache_lock); > } > > @@ -298,8 +303,8 @@ > * were to be marked referenced.. > */ > if (PageSwapCache(page)) { > + shrink_swap_cache(page); > spin_unlock(&pagecache_lock); > - __delete_from_swap_cache(page); > /* the page is local to us now */ > page->flags &= ~(1UL << PG_swap_entry); > goto made_inode_progress; > diff -urN swap-entry-2/mm/swap_state.c swap-entry-3/mm/swap_state.c > --- swap-entry-2/mm/swap_state.c Sat Apr 8 17:29:46 2000 > +++ swap-entry-3/mm/swap_state.c Sun Apr 9 00:39:17 2000 > @@ -55,6 +55,19 @@ > return ret; > } > > +static inline void __remove_from_swap_cache(struct page *page) > +{ > + struct address_space *mapping = page->mapping; > + > + if (mapping != &swapper_space) > + BUG(); > + if (!PageSwapCache(page) || !PageLocked(page)) > + PAGE_BUG(page); > + > + PageClearSwapCache(page); > + __remove_inode_page(page); > +} > + > static inline void remove_from_swap_cache(struct page *page) > { > struct address_space *mapping = page->mapping; > @@ -76,6 +89,20 @@ > lru_cache_del(page); > remove_from_swap_cache(page); > __free_page(page); > +} > + > +/* called by shrink_mmap() with the page_cache_lock held */ > +void shrink_swap_cache(struct page *page) > +{ > + swp_entry_t entry; > + > + entry.val = page->index; > + > +#ifdef SWAP_CACHE_INFO > + swap_cache_del_total++; > +#endif > + __remove_from_swap_cache(page); > + swap_free(entry); > } > > /* > > > The other option is to keep the checks in the lookup swap cache fast path. > > Comments? > > Andrea > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-08 23:18 ` Kanoj Sarcar @ 2000-04-08 23:58 ` Andrea Arcangeli 0 siblings, 0 replies; 47+ messages in thread From: Andrea Arcangeli @ 2000-04-08 23:58 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm On Sat, 8 Apr 2000, Kanoj Sarcar wrote: >[..] if you were to send out a list of >the specific races it is fixing. (Something like my above race example). Hug, writing the traces for all the possible races would take me really lots of time. Writing traces is fine for showing _the_ buggy path, but it doesn't seems the right approch for explaning and understanding how some new code works. The comment should explain how the new locking works against swapoff. I'd very much prefer specific questions and comments. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-08 0:11 ` Kanoj Sarcar 2000-04-08 0:37 ` Kanoj Sarcar @ 2000-04-08 13:30 ` Andrea Arcangeli 2000-04-08 17:39 ` Andrea Arcangeli 1 sibling, 1 reply; 47+ messages in thread From: Andrea Arcangeli @ 2000-04-08 13:30 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm On Fri, 7 Apr 2000, Kanoj Sarcar wrote: >Be aware that I already have a patch for this. I have been meaning to I've a patch for this too now. Are you using read_swap_cache from any swapin event? The problem is swapin can't use read_swap_cache because with read_swap_cache we would never know if we're doing I/O on an inactive swap entry. Only swapoff can use read_swap_cache. My current tree is doing this and it's using the swap cache as locking entity to serialize with unuse_process plus checks on the pte with the page cache lock acquired to know if lookup_swap_cache (or the swap cache miss path) returned a swap cache or not (if not then we have to giveup without changing the pte since swapoff just solved the page fault from under us). Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-08 13:30 ` Andrea Arcangeli @ 2000-04-08 17:39 ` Andrea Arcangeli 0 siblings, 0 replies; 47+ messages in thread From: Andrea Arcangeli @ 2000-04-08 17:39 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm On Sat, 8 Apr 2000, Andrea Arcangeli wrote: >I've a patch for this too now. Are you using read_swap_cache from any Here is my approch. I tried to explain the subtle thoughts in the comments. It's against 2.3.99-pre4-4 + swap-entry fix previously posted on the list in mail with Message-ID: <Pine.LNX.4.21.0004071205300.737-100000@alpha.random>. It seems stable here. The only minor weird thing I noticed so far after the swapoff+swapin stress testing is this: [..] Swap cache: add 107029, delete 105331, find 34046/62876 [..] 0 pages swap cached [..] but that's explained by the fact the stat infomation aren't increased with atomic_inc() (if we really want perfect stat info we should split the counter in a per-CPU counter and sum all the per-CPU counters to get the info) diff -urN swap-entry-1/include/linux/pagemap.h swap-entry-2/include/linux/pagemap.h --- swap-entry-1/include/linux/pagemap.h Fri Apr 7 18:16:10 2000 +++ swap-entry-2/include/linux/pagemap.h Sat Apr 8 19:16:28 2000 @@ -80,6 +80,9 @@ extern void __add_page_to_hash_queue(struct page * page, struct page **p); extern void add_to_page_cache(struct page * page, struct address_space *mapping, unsigned long index); +extern int __add_to_page_cache_unique(struct page *, struct address_space *, unsigned long, struct page **); +#define add_to_page_cache_unique(page, mapping, index) \ + __add_to_page_cache_unique(page, mapping, index, page_hash(mapping, index)) extern inline void add_page_to_hash_queue(struct page * page, struct inode * inode, unsigned long index) { diff -urN swap-entry-1/include/linux/swap.h swap-entry-2/include/linux/swap.h --- swap-entry-1/include/linux/swap.h Fri Apr 7 02:00:28 2000 +++ swap-entry-2/include/linux/swap.h Sat Apr 8 18:08:37 2000 @@ -95,15 +95,17 @@ /* linux/mm/swap_state.c */ extern void show_swap_cache_info(void); -extern void add_to_swap_cache(struct page *, swp_entry_t); +extern int add_to_swap_cache_unique(struct page *, swp_entry_t); extern int swap_check_entry(unsigned long); -extern struct page * lookup_swap_cache(swp_entry_t); +extern struct page * find_get_swap_cache(swp_entry_t); +extern struct page * find_lock_swap_cache(swp_entry_t); extern struct page * read_swap_cache_async(swp_entry_t, int); #define read_swap_cache(entry) read_swap_cache_async(entry, 1); /* * Make these inline later once they are working properly. */ +extern void unlink_from_swap_cache(struct page *); extern void __delete_from_swap_cache(struct page *page); extern void delete_from_swap_cache(struct page *page); extern void delete_from_swap_cache_nolock(struct page *page); diff -urN swap-entry-1/ipc/shm.c swap-entry-2/ipc/shm.c --- swap-entry-1/ipc/shm.c Fri Apr 7 18:11:37 2000 +++ swap-entry-2/ipc/shm.c Sat Apr 8 04:15:20 2000 @@ -1334,7 +1334,7 @@ swp_entry_t entry = pte_to_swp_entry(pte); shm_unlock(shp->id); - page = lookup_swap_cache(entry); + page = find_get_swap_cache(entry); if (!page) { lock_kernel(); swapin_readahead(entry); @@ -1416,7 +1416,8 @@ reading a not yet uptodate block from disk. NOTE: we just accounted the swap space reference for this swap cache page at __get_swap_page() time. */ - add_to_swap_cache(*outpage = page_map, swap_entry); + if (add_to_swap_cache_unique(*outpage = page_map, swap_entry)) + BUG(); return OKAY; } diff -urN swap-entry-1/mm/filemap.c swap-entry-2/mm/filemap.c --- swap-entry-1/mm/filemap.c Fri Apr 7 18:27:22 2000 +++ swap-entry-2/mm/filemap.c Sat Apr 8 04:46:04 2000 @@ -488,7 +488,7 @@ spin_unlock(&pagecache_lock); } -static int add_to_page_cache_unique(struct page * page, +int __add_to_page_cache_unique(struct page * page, struct address_space *mapping, unsigned long offset, struct page **hash) { @@ -529,7 +529,7 @@ if (!page) return -ENOMEM; - if (!add_to_page_cache_unique(page, mapping, offset, hash)) { + if (!__add_to_page_cache_unique(page, mapping, offset, hash)) { int error = mapping->a_ops->readpage(file->f_dentry, page); page_cache_release(page); return error; @@ -2291,7 +2291,7 @@ return ERR_PTR(-ENOMEM); } page = cached_page; - if (add_to_page_cache_unique(page, mapping, index, hash)) + if (__add_to_page_cache_unique(page, mapping, index, hash)) goto repeat; cached_page = NULL; err = filler(data, page); @@ -2318,7 +2318,7 @@ return NULL; } page = *cached_page; - if (add_to_page_cache_unique(page, mapping, index, hash)) + if (__add_to_page_cache_unique(page, mapping, index, hash)) goto repeat; *cached_page = NULL; } diff -urN swap-entry-1/mm/memory.c swap-entry-2/mm/memory.c --- swap-entry-1/mm/memory.c Fri Apr 7 18:27:22 2000 +++ swap-entry-2/mm/memory.c Sat Apr 8 19:08:16 2000 @@ -217,7 +217,8 @@ if (pte_none(pte)) goto cont_copy_pte_range; if (!pte_present(pte)) { - swap_duplicate(pte_to_swp_entry(pte)); + if (!swap_duplicate(pte_to_swp_entry(pte))) + BUG(); set_pte(dst_pte, pte); goto cont_copy_pte_range; } @@ -1019,47 +1020,120 @@ void swapin_readahead(swp_entry_t entry) { int i, num; - struct page *new_page; + struct page * page = NULL; unsigned long offset; + swp_entry_t __entry; /* * Get the number of handles we should do readahead io to. Also, * grab temporary references on them, releasing them as io completes. + * + * At this point we only know the swap device can't go away from under + * us because of the caller locking. + * + * Ugly: we're serializing with swapoff using the big kernel lock. */ num = valid_swaphandles(entry, &offset); for (i = 0; i < num; offset++, i++) { /* Don't block on I/O for read-ahead */ - if (atomic_read(&nr_async_pages) >= pager_daemon.swap_cluster) { - while (i++ < num) - swap_free(SWP_ENTRY(SWP_TYPE(entry), offset++)); + if (atomic_read(&nr_async_pages) >= pager_daemon.swap_cluster) break; - } /* Ok, do the async read-ahead now */ - new_page = read_swap_cache_async(SWP_ENTRY(SWP_TYPE(entry), offset), 0); - if (new_page != NULL) - __free_page(new_page); - swap_free(SWP_ENTRY(SWP_TYPE(entry), offset)); + if (!page) { + page = alloc_page(GFP_USER); + if (!page) + return; + } + __entry = SWP_ENTRY(SWP_TYPE(entry), offset); + if (add_to_swap_cache_unique(page, __entry)) + continue; + if (!swap_duplicate(__entry)) { + swap_free(__entry); + unlink_from_swap_cache(page); + UnlockPage(page); + continue; + } + rw_swap_page(READ, page, 0); + __free_page(page); + page = NULL; } + if (page) + __free_page(page); return; } +#define pte_changed(page_table, entry) \ + (pte_val(*page_table) != pte_val(swp_entry_to_pte(entry))) + +/* This is lock-land */ static int do_swap_page(struct task_struct * tsk, struct vm_area_struct * vma, unsigned long address, pte_t * page_table, swp_entry_t entry, int write_access) { - struct page *page = lookup_swap_cache(entry); + struct page *page; pte_t pte; + spinlock_t * page_table_lock = &vma->vm_mm->page_table_lock; + /* + * find_lock_swap_cache() can return a non swap cache page + * (because find_lock_page() acquires the lock after + * dropping the page_cache_lock). + * We handle the coherency with unmap_process later + * while checking if the pagetable is changed from + * under us. If the pagetable isn't changed from + * under us then `page' is a swap cache page. + */ + repeat: + page = find_lock_swap_cache(entry); if (!page) { + page = alloc_page(GFP_USER); + if (!page) + return -1; + + if (add_to_swap_cache_unique(page, entry)) { + __free_page(page); + goto repeat; + } + + spin_lock(page_table_lock); + /* + * If the pte is changed and we added the page to + * the swap cache successfully it means the entry + * is gone away and also the swap device is + * potentially gone away. + */ + if (pte_changed(page_table, entry)) + goto unlink; + spin_unlock(page_table_lock); + + /* + * Account the swap cache reference on the swap + * side. We have the swap entry locked via + * swap cache locking protocol described below. + * If the entry gone away it means something + * gone badly wrong... + */ + if (!swap_duplicate(entry)) + BUG(); + + /* + * At this point we know unuse_process() have + * not yet processed this pte and we also hold + * the lock on the page so unuse_process() will + * wait for us to finish the I/O. This way we are + * sure to do I/O from a still SWP_USED swap device + * and that the swap device won't go away while + * we're waiting I/O completation. + */ lock_kernel(); swapin_readahead(entry); - page = read_swap_cache(entry); + rw_swap_page(READ, page, 1); unlock_kernel(); - if (!page) - return -1; flush_page_to_ram(page); flush_icache_page(vma, page); + + lock_page(page); } vma->vm_mm->rss++; @@ -1067,6 +1141,10 @@ pte = mk_pte(page, vma->vm_page_prot); + spin_lock(page_table_lock); + if (pte_changed(page_table, entry)) + goto unlock; + SetPageSwapEntry(page); /* @@ -1074,7 +1152,6 @@ * Must lock page before transferring our swap count to already * obtained page count. */ - lock_page(page); swap_free(entry); if (write_access && !is_page_shared(page)) { delete_from_swap_cache_nolock(page); @@ -1086,10 +1163,31 @@ UnlockPage(page); set_pte(page_table, pte); + spin_unlock(page_table_lock); /* No need to invalidate - it was non-present before */ update_mmu_cache(vma, address, pte); return 1; + + unlink: + spin_unlock(page_table_lock); + unlink_from_swap_cache(page); + UnlockPage(page); + __free_page(page); + return 1; + + unlock: + /* + * If the page is still in the swap cache it + * will be swapoff that will remove it from there + * later. + */ + spin_unlock(page_table_lock); + UnlockPage(page); + __free_page(page); + return 1; } + +#undef pte_changed /* * This only needs the MM semaphore diff -urN swap-entry-1/mm/page_io.c swap-entry-2/mm/page_io.c --- swap-entry-1/mm/page_io.c Wed Dec 8 00:05:28 1999 +++ swap-entry-2/mm/page_io.c Sat Apr 8 02:49:51 2000 @@ -128,8 +128,9 @@ { struct page *page = mem_map + MAP_NR(buf); - if (!PageLocked(page)) + if (PageLocked(page)) PAGE_BUG(page); + lock_page(page); if (PageSwapCache(page)) PAGE_BUG(page); if (!rw_swap_page_base(rw, entry, page, wait)) diff -urN swap-entry-1/mm/swap_state.c swap-entry-2/mm/swap_state.c --- swap-entry-1/mm/swap_state.c Fri Apr 7 18:27:22 2000 +++ swap-entry-2/mm/swap_state.c Sat Apr 8 17:29:46 2000 @@ -40,16 +40,19 @@ } #endif -void add_to_swap_cache(struct page *page, swp_entry_t entry) +int add_to_swap_cache_unique(struct page * page, swp_entry_t entry) { + int ret; + #ifdef SWAP_CACHE_INFO swap_cache_add_total++; #endif - if (PageTestandSetSwapCache(page)) - BUG(); if (page->mapping) BUG(); - add_to_page_cache(page, &swapper_space, entry.val); + ret = add_to_page_cache_unique(page, &swapper_space, entry.val); + if (!ret && PageTestandSetSwapCache(page)) + BUG(); + return ret; } static inline void remove_from_swap_cache(struct page *page) @@ -65,6 +68,16 @@ remove_inode_page(page); } +void unlink_from_swap_cache(struct page * page) +{ +#ifdef SWAP_CACHE_INFO + swap_cache_del_total++; +#endif + lru_cache_del(page); + remove_from_swap_cache(page); + __free_page(page); +} + /* * This must be called only on pages that have * been verified to be in the swap cache. @@ -95,7 +108,7 @@ lru_cache_del(page); __delete_from_swap_cache(page); - page_cache_release(page); + __free_page(page); } /* @@ -144,45 +157,53 @@ * lock before returning. */ -struct page * lookup_swap_cache(swp_entry_t entry) +struct page * find_get_swap_cache(swp_entry_t entry) { struct page *found; #ifdef SWAP_CACHE_INFO swap_cache_find_total++; #endif - while (1) { - /* - * Right now the pagecache is 32-bit only. But it's a 32 bit index. =) - */ -repeat: - found = find_get_page(&swapper_space, entry.val); - if (!found) - return 0; - /* - * Though the "found" page was in the swap cache an instant - * earlier, it might have been removed by shrink_mmap etc. - * Re search ... Since find_lock_page grabs a reference on - * the page, it can not be reused for anything else, namely - * it can not be associated with another swaphandle, so it - * is enough to check whether the page is still in the scache. - */ - if (!PageSwapCache(found)) { - __free_page(found); - goto repeat; - } - if (found->mapping != &swapper_space) - goto out_bad; + + found = find_get_page(&swapper_space, entry.val); + if (!found) + return NULL; + if (found->mapping != &swapper_space) + goto out_bad; #ifdef SWAP_CACHE_INFO - swap_cache_find_success++; + swap_cache_find_success++; #endif - return found; - } + return found; out_bad: printk (KERN_ERR "VM: Found a non-swapper swap page!\n"); __free_page(found); - return 0; + return NULL; +} + +struct page * find_lock_swap_cache(swp_entry_t entry) +{ + struct page *found; + +#ifdef SWAP_CACHE_INFO + swap_cache_find_total++; +#endif + + found = find_lock_page(&swapper_space, entry.val); + if (!found) + return NULL; + if (found->mapping != &swapper_space) + goto out_bad; +#ifdef SWAP_CACHE_INFO + swap_cache_find_success++; +#endif + return found; + +out_bad: + printk (KERN_ERR "VM: Found a non-swapper locked swap page!\n"); + UnlockPage(found); + __free_page(found); + return NULL; } /* @@ -192,47 +213,40 @@ * * A failure return means that either the page allocation failed or that * the swap entry is no longer in use. + * WARNING: only swapoff can use this function. */ struct page * read_swap_cache_async(swp_entry_t entry, int wait) { struct page *found_page = 0, *new_page; - unsigned long new_page_addr; - - /* - * Make sure the swap entry is still in use. - */ - if (!swap_duplicate(entry)) /* Account for the swap cache */ - goto out; + /* * Look for the page in the swap cache. */ - found_page = lookup_swap_cache(entry); + found_page = find_get_swap_cache(entry); if (found_page) - goto out_free_swap; + goto out; - new_page_addr = __get_free_page(GFP_USER); - if (!new_page_addr) - goto out_free_swap; /* Out of memory */ - new_page = mem_map + MAP_NR(new_page_addr); + new_page = alloc_page(GFP_USER); + if (!new_page) + goto out; /* Out of memory */ - /* - * Check the swap cache again, in case we stalled above. - */ - found_page = lookup_swap_cache(entry); - if (found_page) - goto out_free_page; /* * Add it to the swap cache and read its contents. */ - add_to_swap_cache(new_page, entry); + while (add_to_swap_cache_unique(new_page, entry)) { + found_page = find_get_swap_cache(entry); + if (found_page) + goto out_free_page; + } + + swap_duplicate(entry); + rw_swap_page(READ, new_page, wait); return new_page; out_free_page: __free_page(new_page); -out_free_swap: - swap_free(entry); out: return found_page; } diff -urN swap-entry-1/mm/swapfile.c swap-entry-2/mm/swapfile.c --- swap-entry-1/mm/swapfile.c Fri Apr 7 18:27:22 2000 +++ swap-entry-2/mm/swapfile.c Sat Apr 8 19:02:20 2000 @@ -437,17 +437,38 @@ swap_free(entry); return -ENOMEM; } + + lock_page(page); + /* + * Only swapout can drop referenced pages from + * the swap cache. + */ + if (!PageSwapCache(page)) + BUG(); + /* + * Do a fast check to see if it's an orphaned swap cache + * entry to learn if we really need to slowly browse the ptes. + */ + if (!is_page_shared(page)) + goto orphaned; + UnlockPage(page); + read_lock(&tasklist_lock); for_each_task(p) unuse_process(p->mm, entry, page); read_unlock(&tasklist_lock); shm_unuse(entry, page); + + lock_page(page); /* Now get rid of the extra reference to the temporary page we've been using. */ - if (PageSwapCache(page)) { - delete_from_swap_cache(page); - ClearPageSwapEntry(page); - } + if (!PageSwapCache(page)) + BUG(); + orphaned: + delete_from_swap_cache_nolock(page); + ClearPageSwapEntry(page); + + UnlockPage(page); __free_page(page); /* * Check for and clear any overflowed swap map counts. @@ -710,7 +731,6 @@ goto bad_swap; } - lock_page(mem_map + MAP_NR(swap_header)); rw_swap_page_nolock(READ, SWP_ENTRY(type,0), (char *) swap_header, 1); if (!memcmp("SWAP-SPACE",swap_header->magic.magic,10)) @@ -902,22 +922,19 @@ offset = SWP_OFFSET(entry); if (offset >= p->max) goto bad_offset; - if (!p->swap_map[offset]) - goto bad_unused; /* * Entry is valid, so increment the map count. */ swap_device_lock(p); if (p->swap_map[offset] < SWAP_MAP_MAX) - p->swap_map[offset]++; + result = p->swap_map[offset]++; else { static int overflow = 0; if (overflow++ < 5) printk("VM: swap entry overflow\n"); - p->swap_map[offset] = SWAP_MAP_MAX; + result = p->swap_map[offset] = SWAP_MAP_MAX; } swap_device_unlock(p); - result = 1; out: return result; @@ -927,9 +944,6 @@ bad_offset: printk("Bad swap offset entry %08lx\n", entry.val); goto out; -bad_unused: - printk("Unused swap offset entry in swap_dup %08lx\n", entry.val); - goto out; } /* @@ -1027,20 +1041,22 @@ *offset = SWP_OFFSET(entry); toff = *offset = (*offset >> page_cluster) << page_cluster; + if ((swapdev->flags & SWP_WRITEOK) != SWP_WRITEOK) + goto out; swap_device_lock(swapdev); do { /* Don't read-ahead past the end of the swap area */ if (toff >= swapdev->max) break; - /* Don't read in bad or busy pages */ + /* Don't read in bad or empty pages */ if (!swapdev->swap_map[toff]) break; if (swapdev->swap_map[toff] == SWAP_MAP_BAD) break; - swapdev->swap_map[toff]++; toff++; ret++; } while (--i); swap_device_unlock(swapdev); + out: return ret; } diff -urN swap-entry-1/mm/vmscan.c swap-entry-2/mm/vmscan.c --- swap-entry-1/mm/vmscan.c Thu Apr 6 01:00:52 2000 +++ swap-entry-2/mm/vmscan.c Sat Apr 8 17:38:10 2000 @@ -72,7 +72,8 @@ */ if (PageSwapCache(page)) { entry.val = page->index; - swap_duplicate(entry); + if (!swap_duplicate(entry)) + BUG(); set_pte(page_table, swp_entry_to_pte(entry)); drop_pte: vma->vm_mm->rss--; @@ -157,10 +158,12 @@ if (!(page = prepare_highmem_swapout(page))) goto out_swap_free; - swap_duplicate(entry); /* One for the process, one for the swap cache */ - + if (!swap_duplicate(entry)) /* One for the process, one for the swap cache */ + BUG(); /* This will also lock the page */ - add_to_swap_cache(page, entry); + if (add_to_swap_cache_unique(page, entry)) + BUG(); + /* Put the swap entry into the pte after the page is in swapcache */ vma->vm_mm->rss--; set_pte(page_table, swp_entry_to_pte(entry)); I have not looked into the shm_unuse part yet but I guess it needs fixing too (however shm compiles and it should keep working as before at least). Comments are welcome. Thanks. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-07 20:12 ` Kanoj Sarcar 2000-04-07 23:26 ` Andrea Arcangeli @ 2000-04-07 23:54 ` Andrea Arcangeli 2000-04-08 0:15 ` Kanoj Sarcar 2000-04-08 0:04 ` Andrea Arcangeli 2 siblings, 1 reply; 47+ messages in thread From: Andrea Arcangeli @ 2000-04-07 23:54 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm On Fri, 7 Apr 2000, Kanoj Sarcar wrote: >[..] A bigger problem might >be that you are violating lock orders when you grab the vmlist_lock >from inside code that already has tasklist_lock in readmode [..] Conceptually it's the obviously right locking order. The mm exists in function of a task struct. So first grabbing the tasklist lock, finding the task_struct and then locking its mm before playing with it looks the natural ordering of things and how things should be done. BTW, swap_out() always used the same locking order that I added to swapoff so if my patch is wrong, swap_out() is always been wrong as well ;). I had a fast look and it seems nobody is going to harm swap_out and swapoff but if somebody is using the inverse lock I'd much prefer to fix that path because the locking design of swapoff and swap_out looks the obviously right one to me. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-07 23:54 ` Andrea Arcangeli @ 2000-04-08 0:15 ` Kanoj Sarcar 2000-04-08 13:14 ` Andrea Arcangeli 0 siblings, 1 reply; 47+ messages in thread From: Kanoj Sarcar @ 2000-04-08 0:15 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm > > On Fri, 7 Apr 2000, Kanoj Sarcar wrote: > > >[..] A bigger problem might > >be that you are violating lock orders when you grab the vmlist_lock > >from inside code that already has tasklist_lock in readmode [..] > > Conceptually it's the obviously right locking order. The mm exists in > function of a task struct. So first grabbing the tasklist lock, finding > the task_struct and then locking its mm before playing with it looks the > natural ordering of things and how things should be done. > > BTW, swap_out() always used the same locking order that I added to swapoff > so if my patch is wrong, swap_out() is always been wrong as well ;). Not sure what you mean ... swap_out never grabbed the mmap_sem/page_table_lock before (in 2.2. too). > > I had a fast look and it seems nobody is going to harm swap_out and > swapoff but if somebody is using the inverse lock I'd much prefer to fix > that path because the locking design of swapoff and swap_out looks the > obviously right one to me. Okay, give it a shot, but I think changing the places which hold tasklist_lock might be a bigger effort. In any case, for vm activity like swap space deletion, holding of the tasklist_lock is the worst possible alternative, and should be done only if other alternatives are too intrusive. I know, it has been like this for a while now ... Oh, btw, before we start discussing this, you should run that stress test to make sure whether a lock order violation actually exists or not ... Kanoj > > Andrea > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-08 0:15 ` Kanoj Sarcar @ 2000-04-08 13:14 ` Andrea Arcangeli 2000-04-08 21:47 ` Kanoj Sarcar 0 siblings, 1 reply; 47+ messages in thread From: Andrea Arcangeli @ 2000-04-08 13:14 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm On Fri, 7 Apr 2000, Kanoj Sarcar wrote: >> BTW, swap_out() always used the same locking order that I added to swapoff >> so if my patch is wrong, swap_out() is always been wrong as well ;). > >Not sure what you mean ... swap_out never grabbed the mmap_sem/page_table_lock >before (in 2.2. too). In 2.2.x page_table_lock wasn't necessary because we was holding the big kernel lock. In 2.3.x vmlist_*_lock is alias to spin_lock(&mm->page_table_lock) and swap_out isn't even calling the spin_lock explicitly but it's doing what the fixed swapoff does. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-08 13:14 ` Andrea Arcangeli @ 2000-04-08 21:47 ` Kanoj Sarcar 2000-04-08 23:10 ` Andrea Arcangeli 2000-04-10 19:10 ` Stephen C. Tweedie 0 siblings, 2 replies; 47+ messages in thread From: Kanoj Sarcar @ 2000-04-08 21:47 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm > > On Fri, 7 Apr 2000, Kanoj Sarcar wrote: > > >> BTW, swap_out() always used the same locking order that I added to swapoff > >> so if my patch is wrong, swap_out() is always been wrong as well ;). > > > >Not sure what you mean ... swap_out never grabbed the mmap_sem/page_table_lock > >before (in 2.2. too). > > In 2.2.x page_table_lock wasn't necessary because we was holding the big > kernel lock. You have answered your own question in a later email. I quote you: "Are you using read_swap_cache from any swapin event? The problem is swapin can't use read_swap_cache because with read_swap_cache we would never know if we're doing I/O on an inactive swap entry" Grabbing lock_kernel in swap_in is not enough since it might get dropped if the swap_in code goes to sleep for some reason. Kanoj > > In 2.3.x vmlist_*_lock is alias to spin_lock(&mm->page_table_lock) and > swap_out isn't even calling the spin_lock explicitly but it's doing what > the fixed swapoff does. > > Andrea > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux.eu.org/Linux-MM/ > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-08 21:47 ` Kanoj Sarcar @ 2000-04-08 23:10 ` Andrea Arcangeli 2000-04-08 23:21 ` Kanoj Sarcar 2000-04-10 19:10 ` Stephen C. Tweedie 1 sibling, 1 reply; 47+ messages in thread From: Andrea Arcangeli @ 2000-04-08 23:10 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm On Sat, 8 Apr 2000, Kanoj Sarcar wrote: >> >> On Fri, 7 Apr 2000, Kanoj Sarcar wrote: >> >> >> BTW, swap_out() always used the same locking order that I added to swapoff >> >> so if my patch is wrong, swap_out() is always been wrong as well ;). >> > >> >Not sure what you mean ... swap_out never grabbed the mmap_sem/page_table_lock >> >before (in 2.2. too). >> >> In 2.2.x page_table_lock wasn't necessary because we was holding the big >> kernel lock. > >You have answered your own question in a later email. I quote you: >"Are you using read_swap_cache from any >swapin event? The problem is swapin can't use read_swap_cache because with >read_swap_cache we would never know if we're doing I/O on an inactive swap >entry" > >Grabbing lock_kernel in swap_in is not enough since it might get dropped >if the swap_in code goes to sleep for some reason. I wasn't talking about races between swapoff/swapin. I was talking about the locking order issue you raised about the necessary vmlist_*_lock I added in swapoff. What I meant is that in 2.2.x there was no need of the vmlist_*_lock/page_cache_lock in swapoff because we was relying on the big kernel lock while playing with pagetables and vmas (same in swap_out()). In 2.3.x both swap_out and swapoff needs to grab first the tasklist_lock (as in 2.2.x) and then the vmlist_*_lock (otherwise as first the vma browsing may happen during a vma list modification). Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-08 23:10 ` Andrea Arcangeli @ 2000-04-08 23:21 ` Kanoj Sarcar 2000-04-08 23:39 ` Andrea Arcangeli 0 siblings, 1 reply; 47+ messages in thread From: Kanoj Sarcar @ 2000-04-08 23:21 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm > > I was talking about the locking order issue you raised about the necessary > vmlist_*_lock I added in swapoff. > > What I meant is that in 2.2.x there was no need of the > vmlist_*_lock/page_cache_lock in swapoff because we was relying on the big > kernel lock while playing with pagetables and vmas (same in swap_out()). > > In 2.3.x both swap_out and swapoff needs to grab first the tasklist_lock > (as in 2.2.x) and then the vmlist_*_lock (otherwise as first the vma > browsing may happen during a vma list modification). > > Andrea > As I mentioned before, have you stress tested this to make sure grabbing read_lock(tasklist_lock), then spin_lock(vmlist_lock) is not deadlock prone? I _think_ you will run into problems ... and we can then stop discussing this. Kanoj -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-08 23:21 ` Kanoj Sarcar @ 2000-04-08 23:39 ` Andrea Arcangeli 2000-04-09 0:40 ` Kanoj Sarcar 0 siblings, 1 reply; 47+ messages in thread From: Andrea Arcangeli @ 2000-04-08 23:39 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm On Sat, 8 Apr 2000, Kanoj Sarcar wrote: >As I mentioned before, have you stress tested this to make sure grabbing I have stress tested the whole thing (also a few minutes ago to check the latest patch) but it never locked up so we have to think about it. Could you explain why you think it's the inverse lock ordering? Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-08 23:39 ` Andrea Arcangeli @ 2000-04-09 0:40 ` Kanoj Sarcar 2000-04-10 8:55 ` andrea 0 siblings, 1 reply; 47+ messages in thread From: Kanoj Sarcar @ 2000-04-09 0:40 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm > > On Sat, 8 Apr 2000, Kanoj Sarcar wrote: > > >As I mentioned before, have you stress tested this to make sure grabbing > > I have stress tested the whole thing (also a few minutes ago to check the > latest patch) but it never locked up so we have to think about it. Okay good. > > Could you explain why you think it's the inverse lock ordering? Let me see, if I can come up with something, I will let you know. If it survives stress testing, it probably is not inverse locking. Btw, I am looking at your patch with message id <Pine.LNX.4.21.0004081924010.317-100000@alpha.random>, that does not seem to be holding vmlist/pagetable lock in the swapdelete code (at least at first blush). That was partly why I wanted to know what fixes are in your patch ... Note: I prefer being able to hold mmap_sem in the swapdelete path, that will provide protection against fork/exit races too. I will try to port over my version of the patch, and list the problems it fixes ... Kanoj -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-09 0:40 ` Kanoj Sarcar @ 2000-04-10 8:55 ` andrea 2000-04-11 2:45 ` Kanoj Sarcar 0 siblings, 1 reply; 47+ messages in thread From: andrea @ 2000-04-10 8:55 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm On Sat, 8 Apr 2000, Kanoj Sarcar wrote: >Btw, I am looking at your patch with message id ><Pine.LNX.4.21.0004081924010.317-100000@alpha.random>, that does not >seem to be holding vmlist/pagetable lock in the swapdelete code (at >least at first blush). That was partly why I wanted to know what fixes >are in your patch ... The patch was against the earlier swapentry patch that was also fixing the vma/pte locking in swapoff. All the three patches I posted were incremental. >Note: I prefer being able to hold mmap_sem in the swapdelete path, that >will provide protection against fork/exit races too. I will try to port With my approch swapoff is serialized w.r.t. to fork/exit the same way as swap_out(). However I see the potential future problem in exit_mmap() that makes the entries not reachable before swapoff starts and that does the swap_free() after swapoff completed and after the swapdevice gone away (== too late). That's not an issue right now though, since both swapoff and do_exit() are holding the big kernel lock but it will become an issue eventually. Probably exit_mmap() should unlink and unmap the vmas bit by bit using locking to unlink and lefting them visible if they are not yet released. That should get rid of that future race. About grabbing the mmap semaphore in unuse_process: we don't need to do that because we aren't changing vmas from swapoff. Swapoff only browses and changes pagetables so it only needs the vmalist-access read-spinlock that avoids vma to go away, and the pagtable exclusive spinlock because we'll change the pagetables (and the latter one is implied in the vmlist_access_lock as we know from the vmlist_access_lock implementation). swap_out() can't grab the mmap_sem for obvious reasons, so if you only grab the mmap_sem you'll have to rely only on the big kernel lock to avoid swap_out() to race with your swapoff, right? It doesn't look like a right long term solution. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-10 8:55 ` andrea @ 2000-04-11 2:45 ` Kanoj Sarcar 2000-04-11 16:22 ` Andrea Arcangeli 0 siblings, 1 reply; 47+ messages in thread From: Kanoj Sarcar @ 2000-04-11 2:45 UTC (permalink / raw) To: andrea; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm > > On Sat, 8 Apr 2000, Kanoj Sarcar wrote: > > >Btw, I am looking at your patch with message id > ><Pine.LNX.4.21.0004081924010.317-100000@alpha.random>, that does not > >seem to be holding vmlist/pagetable lock in the swapdelete code (at > >least at first blush). That was partly why I wanted to know what fixes > >are in your patch ... > > The patch was against the earlier swapentry patch that was also fixing the > vma/pte locking in swapoff. All the three patches I posted were > incremental. > > >Note: I prefer being able to hold mmap_sem in the swapdelete path, that > >will provide protection against fork/exit races too. I will try to port > > With my approch swapoff is serialized w.r.t. to fork/exit the same way as > swap_out(). However I see the potential future problem in exit_mmap() that While forking, a parent might copy a swap handle into the child, but we might entirely miss scanning the child because he is not on the process list (kernel_lock is not enough, forking code might sleep). Eg: in the body of dup_mmap, we go to sleep due to memory shortage which kicks page stealing (at highly asynchronous swapio). Same problem exists in exit_mmap. In this case, one of the routines inside exit_mmap() can very plausibly go to sleep. Eg: file_unmap. > makes the entries not reachable before swapoff starts and that does the > swap_free() after swapoff completed and after the swapdevice gone away (== > too late). That's not an issue right now though, since both swapoff and > do_exit() are holding the big kernel lock but it will become an issue > eventually. Probably exit_mmap() should unlink and unmap the vmas bit by > bit using locking to unlink and lefting them visible if they are not yet > released. That should get rid of that future race. > > About grabbing the mmap semaphore in unuse_process: we don't need to do > that because we aren't changing vmas from swapoff. Swapoff only browses > and changes pagetables so it only needs the vmalist-access read-spinlock > that avoids vma to go away, and the pagtable exclusive spinlock because > we'll change the pagetables (and the latter one is implied in the > vmlist_access_lock as we know from the vmlist_access_lock implementation). See above. > > swap_out() can't grab the mmap_sem for obvious reasons, so if you only Why not? Of course, not with tasklist_lock held (Hehehe, I am not that stupid :-)). But other mechanisms are possible. > grab the mmap_sem you'll have to rely only on the big kernel lock to avoid > swap_out() to race with your swapoff, right? It doesn't look like a right > long term solution. Actually, let me put out the patch, for different reasons, IMO, it is the right long term solution ... Kanoj > > Andrea > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux.eu.org/Linux-MM/ > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-11 2:45 ` Kanoj Sarcar @ 2000-04-11 16:22 ` Andrea Arcangeli 2000-04-11 17:40 ` Rik van Riel 2000-04-11 18:26 ` Kanoj Sarcar 0 siblings, 2 replies; 47+ messages in thread From: Andrea Arcangeli @ 2000-04-11 16:22 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm On Mon, 10 Apr 2000, Kanoj Sarcar wrote: >While forking, a parent might copy a swap handle into the child, but we That's a bug in fork. Simply let fork to check if the swaphandle is SWAPOK or not before increasing the swap count. If it's SWAPOK then swap_duplicate succesfully, otherwise do the swapin using swap cache based locking as swapin now does in my current tree (check if the pte is changed before go to increase the swap side and undo the swapcache insertion in such case and serialize with swapoff and swap_out with page_cache_lock). >Same problem exists in exit_mmap. In this case, one of the routines inside >exit_mmap() can very plausibly go to sleep. Eg: file_unmap. exit_mmap can sleep there. But it have not to hide the mmap as said in earlier email. It have to zap_page_range and then unlink the vmas all bit by bit serializing using the vmlist_modify_lock. >> swap_out() can't grab the mmap_sem for obvious reasons, so if you only > >Why not? Of course, not with tasklist_lock held (Hehehe, I am not that >stupid :-)). But other mechanisms are possible. Lock recursion -> deadlock. userspace runs page fault down(¤t->mm->mmap_sem); try_to_free_pages(); swap_out(); down(¤t->mm->mmap_sem); <- you deadlocked We are serializing swap_out/do_wp_page with the page_cache_lock (in swapout the page_cache_lock is implied by the vmlist_access_lock). In the same way I'm serializing swapoff with swapin using swap cache based on locking and pagetable checks with page_cache_lock acquired and protecting swapoff with the vmlist_access_lock() that imply the page_cache_lock. Using the page_cache_lock and rechecking page table looks the right way to go to me. It have no design problems that ends in lock recursion. >Actually, let me put out the patch, for different reasons, IMO, it is the >right long term solution ... The patch is welcome indeed. However relying on the mmap_sem looks the wrong way to me. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-11 16:22 ` Andrea Arcangeli @ 2000-04-11 17:40 ` Rik van Riel 2000-04-11 18:20 ` Kanoj Sarcar 2000-04-21 18:23 ` Andrea Arcangeli 2000-04-11 18:26 ` Kanoj Sarcar 1 sibling, 2 replies; 47+ messages in thread From: Rik van Riel @ 2000-04-11 17:40 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Kanoj Sarcar, Ben LaHaise, Linus Torvalds, linux-mm On Tue, 11 Apr 2000, Andrea Arcangeli wrote: > On Mon, 10 Apr 2000, Kanoj Sarcar wrote: > > >While forking, a parent might copy a swap handle into the child, but we > > That's a bug in fork. Simply let fork to check if the swaphandle > is SWAPOK or not before increasing the swap count. If it's > SWAPOK then swap_duplicate succesfully, "it was hard to write, it should be hard to maintain" Relying on pieces of magic like this, spread out all over the kernel source will make the code more fragile and hell to maintain. Unless somebody writes the documentation for all of this, of course... regards, Rik -- The Internet is not a network of computers. It is a network of people. That is its real strength. Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-11 17:40 ` Rik van Riel @ 2000-04-11 18:20 ` Kanoj Sarcar 2000-04-21 18:23 ` Andrea Arcangeli 1 sibling, 0 replies; 47+ messages in thread From: Kanoj Sarcar @ 2000-04-11 18:20 UTC (permalink / raw) To: riel; +Cc: Andrea Arcangeli, Ben LaHaise, Linus Torvalds, linux-mm > > On Tue, 11 Apr 2000, Andrea Arcangeli wrote: > > On Mon, 10 Apr 2000, Kanoj Sarcar wrote: > > > > >While forking, a parent might copy a swap handle into the child, but we > > > > That's a bug in fork. Simply let fork to check if the swaphandle > > is SWAPOK or not before increasing the swap count. If it's > > SWAPOK then swap_duplicate succesfully, > > "it was hard to write, it should be hard to maintain" > > Relying on pieces of magic like this, spread out all over > the kernel source will make the code more fragile and hell > to maintain. No, its not magic, since it still doesn't work. Watch out for mail from me as to why ... > > Unless somebody writes the documentation for all of this, > of course... > Precisely, that's why I started off Documentation/vm/locking. It would be _really_ nice if folks doing locking (or for that matter, any other delicate) work were to update these files ... KAnoj > regards, > > Rik > -- > The Internet is not a network of computers. It is a network > of people. That is its real strength. > > Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies > http://www.conectiva.com/ http://www.surriel.com/ > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-11 17:40 ` Rik van Riel 2000-04-11 18:20 ` Kanoj Sarcar @ 2000-04-21 18:23 ` Andrea Arcangeli 2000-04-21 21:00 ` Rik van Riel 1 sibling, 1 reply; 47+ messages in thread From: Andrea Arcangeli @ 2000-04-21 18:23 UTC (permalink / raw) To: riel; +Cc: Kanoj Sarcar, Ben LaHaise, Linus Torvalds, linux-mm The swap-entry fixes cleared by the swap locking changes are here: ftp://ftp.*.kernel.org/pub/linux/kernel/people/andrea/patches/v2.3/2.3.99-pre6-pre3/swap-entry-3 Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-21 18:23 ` Andrea Arcangeli @ 2000-04-21 21:00 ` Rik van Riel 2000-04-22 1:12 ` Andrea Arcangeli 0 siblings, 1 reply; 47+ messages in thread From: Rik van Riel @ 2000-04-21 21:00 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Kanoj Sarcar, Ben LaHaise, Linus Torvalds, linux-mm On Fri, 21 Apr 2000, Andrea Arcangeli wrote: > The swap-entry fixes cleared by the swap locking changes are here: > > ftp://ftp.*.kernel.org/pub/linux/kernel/people/andrea/patches/v2.3/2.3.99-pre6-pre3/swap-entry-3 The patch looks "obviously correct", but it would be nice if you could use the PageClearSwapCache and related macros for changing the bitflags. Things like new_page->flags &= ~(1UL << PG_swap_entry); just make the code less readable than it has to be. (and yes, loads of people already run away screaming when they look at the memory management code, I really think we should make maintainability a higher priority target for the code) regards, Rik -- The Internet is not a network of computers. It is a network of people. That is its real strength. Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-21 21:00 ` Rik van Riel @ 2000-04-22 1:12 ` Andrea Arcangeli 2000-04-22 1:51 ` Linus Torvalds 0 siblings, 1 reply; 47+ messages in thread From: Andrea Arcangeli @ 2000-04-22 1:12 UTC (permalink / raw) To: riel; +Cc: Kanoj Sarcar, Ben LaHaise, Linus Torvalds, linux-mm On Fri, 21 Apr 2000, Rik van Riel wrote: >you could use the PageClearSwapCache and related macros for >changing the bitflags. BTW, thinking more I think the clearbit in shrink_mmap should really be atomic (lookup_swap_cache can run from under it and try to lock the page playing with the page->flags while we're clearing the swap_entry bitflag). The other places doesn't need to be atomic as far I can tell so (as just said) I'd prefer not to add unscalable SMP locking. Suggest a name for a new macro that doesn't use asm and I can use it of course. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-22 1:12 ` Andrea Arcangeli @ 2000-04-22 1:51 ` Linus Torvalds 2000-04-22 18:29 ` Rik van Riel 0 siblings, 1 reply; 47+ messages in thread From: Linus Torvalds @ 2000-04-22 1:51 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: riel, Kanoj Sarcar, Ben LaHaise, linux-mm On Sat, 22 Apr 2000, Andrea Arcangeli wrote: > > On Fri, 21 Apr 2000, Rik van Riel wrote: > > >you could use the PageClearSwapCache and related macros for > >changing the bitflags. > > BTW, thinking more I think the clearbit in shrink_mmap should really be > atomic (lookup_swap_cache can run from under it and try to lock the page > playing with the page->flags while we're clearing the swap_entry bitflag). Actually, I was toying with the much simpler rule: - "PG_locked" is always atomic - all other flags can only be tested/changed if PG_locked holds This simple rule would allow for not using the (slow) atomic operations, because the other bits are always protected by the one-bit lock. PG_accessed is probably a special case, as it's just a hint bit - so it might be ok to say something like "you can change PG_accessed without holding the PG_locked bit, but then you have to use an atomic operation". That way changes to PG_accessed might be lost (by other non-atomic updaters that hold the PG_lock), but at least changing PG_accessed without holding PG_lock cannot cause other bits to become lost. Does the above sound sane? It might be reasonably easy to verify that the rule holds (ie make all the macros check that PG_locked is set, and hand-check the few places where we access page->flags directly). The rules sound safe to me, and means that most of the updates could be non-atomic. Comments? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-22 1:51 ` Linus Torvalds @ 2000-04-22 18:29 ` Rik van Riel 2000-04-22 19:58 ` Linus Torvalds 0 siblings, 1 reply; 47+ messages in thread From: Rik van Riel @ 2000-04-22 18:29 UTC (permalink / raw) To: Linus Torvalds; +Cc: Andrea Arcangeli, Kanoj Sarcar, Ben LaHaise, linux-mm On Fri, 21 Apr 2000, Linus Torvalds wrote: > On Sat, 22 Apr 2000, Andrea Arcangeli wrote: > > On Fri, 21 Apr 2000, Rik van Riel wrote: > > > > >you could use the PageClearSwapCache and related macros for > > >changing the bitflags. > > > > BTW, thinking more I think the clearbit in shrink_mmap should really be > > atomic (lookup_swap_cache can run from under it and try to lock the page > > playing with the page->flags while we're clearing the swap_entry bitflag). > > Actually, I was toying with the much simpler rule: > - "PG_locked" is always atomic > - all other flags can only be tested/changed if PG_locked holds > > This simple rule would allow for not using the (slow) atomic operations, > because the other bits are always protected by the one-bit lock. It all depends on the source code. If we're holding the page lock anyway in places where we play with the other flags, that's probably the best strategy, but if we're updating the page flags in a lot of places without holding the page lock, then it's probably better to just do everything with the current atomic bitops. Btw, here's a result from 2.3.99-pre6-3 ... line number 3 and 4 are extremely suspect... [riel@duckman mm]$ grep 'page->flags' *.c filemap.c: if (test_and_clear_bit(PG_referenced, &page->flags)) filemap.c: set_bit(PG_referenced, &page->flags); filemap.c: flags = page->flags & ~((1 << PG_uptodate) | (1 << PG_error) | (1 << PG_dirty)); filemap.c: page->flags = flags | (1 << PG_locked) | (1 << PG_referenced); page_io.c: set_bit(PG_decr_after, &page->flags); vmscan.c: set_bit(PG_referenced, &page->flags); Here's the suspect piece of code (filemap.c::__add_to_page_cache()): flags = page->flags & ~((1 << PG_uptodate) | (1 << PG_error) | (1 << PG_dirty)); page->flags = flags | (1 << PG_locked) | (1 << PG_referenced); get_page(page); So here we play with a number of page flags _before_ taking the page or locking it. It's probably safe because of some circumstances under which we are called, but somehow it just doesn't look clean to me ;) regards, Rik -- The Internet is not a network of computers. It is a network of people. That is its real strength. Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-22 18:29 ` Rik van Riel @ 2000-04-22 19:58 ` Linus Torvalds 0 siblings, 0 replies; 47+ messages in thread From: Linus Torvalds @ 2000-04-22 19:58 UTC (permalink / raw) To: riel; +Cc: Andrea Arcangeli, Kanoj Sarcar, Ben LaHaise, linux-mm On Sat, 22 Apr 2000, Rik van Riel wrote: > > It all depends on the source code. If we're holding the page > lock anyway in places where we play with the other flags, that's > probably the best strategy, but if we're updating the page flags > in a lot of places without holding the page lock, then it's > probably better to just do everything with the current atomic > bitops. I suspect that we hold the page lock already. And it just seems wrong that we could ever alter some of the flags (like dirty or uptodate) without holding the page lock. So the logic is more than just "bitfield coherency": it's also a higher-level coherency guarantee. > Btw, here's a result from 2.3.99-pre6-3 ... line number 3 and > 4 are extremely suspect... > > [riel@duckman mm]$ grep 'page->flags' *.c > filemap.c: if (test_and_clear_bit(PG_referenced, &page->flags)) > filemap.c: set_bit(PG_referenced, &page->flags); > filemap.c: flags = page->flags & ~((1 << PG_uptodate) | (1 << PG_error) | (1 << PG_dirty)); > filemap.c: page->flags = flags | (1 << PG_locked) | (1 << PG_referenced); Both 3 and 4 are from the same sequence: it's the initialization code for the page. They are run before the page is added to the page cache, so atomicity is not even an issue, because nobody can get at the page beforeit ison any of the page lists. Think of it as an anonymous page that is just about to get truly instantiated into the page cache. It's also one of the few timeswhere we truly touch many bits. In most other caseswe touch just one or two at a time. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-11 16:22 ` Andrea Arcangeli 2000-04-11 17:40 ` Rik van Riel @ 2000-04-11 18:26 ` Kanoj Sarcar 1 sibling, 0 replies; 47+ messages in thread From: Kanoj Sarcar @ 2000-04-11 18:26 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm > > On Mon, 10 Apr 2000, Kanoj Sarcar wrote: > > >While forking, a parent might copy a swap handle into the child, but we > > That's a bug in fork. Simply let fork to check if the swaphandle is SWAPOK > or not before increasing the swap count. If it's SWAPOK then > swap_duplicate succesfully, otherwise do the swapin using swap cache based > locking as swapin now does in my current tree (check if the pte is changed > before go to increase the swap side and undo the swapcache insertion in > such case and serialize with swapoff and swap_out with page_cache_lock). I still can't see how this will help. You still might end up having an unreachable mm, since the task is not on the active list, and swapoff unconditionally decrements the swap count after printing a warning message. Later on, the process also tries decrementing the count, or accessing the swap handle ... > > >Same problem exists in exit_mmap. In this case, one of the routines inside > >exit_mmap() can very plausibly go to sleep. Eg: file_unmap. > > exit_mmap can sleep there. But it have not to hide the mmap as said in > earlier email. It have to zap_page_range and then unlink the vmas all bit > by bit serializing using the vmlist_modify_lock. Sure, write the patch. It seems to me just getting the mmap_sem once should be enough, instead of multiple acquire/releases of the vmlist_modify_lock, if for no other reason than performance. But I will agree to any solution that fixes races. Correctness first, then performance ... > > >> swap_out() can't grab the mmap_sem for obvious reasons, so if you only > > > >Why not? Of course, not with tasklist_lock held (Hehehe, I am not that > >stupid :-)). But other mechanisms are possible. > > Lock recursion -> deadlock. > > userspace runs > page fault > down(¤t->mm->mmap_sem); > try_to_free_pages(); > swap_out(); > down(¤t->mm->mmap_sem); <- you deadlocked I am not sure what you mean. Maybe we should just stop talking about this till I get a chance to post the patch, then you can show me the holes in it. I am not sure if I have been able to communicate clearly what I want to do, and how ... Kanoj > > We are serializing swap_out/do_wp_page with the page_cache_lock (in > swapout the page_cache_lock is implied by the vmlist_access_lock). > > In the same way I'm serializing swapoff with swapin using swap cache based > on locking and pagetable checks with page_cache_lock acquired and > protecting swapoff with the vmlist_access_lock() that imply the > page_cache_lock. > > Using the page_cache_lock and rechecking page table looks the right way to > go to me. It have no design problems that ends in lock recursion. > > >Actually, let me put out the patch, for different reasons, IMO, it is the > >right long term solution ... > > The patch is welcome indeed. However relying on the mmap_sem looks the > wrong way to me. > > Andrea > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-08 21:47 ` Kanoj Sarcar 2000-04-08 23:10 ` Andrea Arcangeli @ 2000-04-10 19:10 ` Stephen C. Tweedie 1 sibling, 0 replies; 47+ messages in thread From: Stephen C. Tweedie @ 2000-04-10 19:10 UTC (permalink / raw) To: Kanoj Sarcar Cc: Andrea Arcangeli, Ben LaHaise, riel, Linus Torvalds, linux-mm, Stephen Tweedie Hi, On Sat, Apr 08, 2000 at 02:47:29PM -0700, Kanoj Sarcar wrote: > > You have answered your own question in a later email. I quote you: > "Are you using read_swap_cache from any > swapin event? The problem is swapin can't use read_swap_cache because with > read_swap_cache we would never know if we're doing I/O on an inactive swap > entry" Right. The way the swap synchronisation always worked was that we must have a swap cache entry before _any_ IO, so the page lock bit on that swap cache page could also serve as an IO lock on the swap entry. (Actually, it didn't _always_ work that way, but that's the basic mechanism with the current swap cache.) That relied on the swapper being able to do an atomic operation to search for a page cache page, and to create a new page and lock it if it wasn't already there. That was made atomic only by use of the big lock. If you don't have all page cache activity using that lock, then yes, you'll need the page cache lock while you set all of this up. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels 2000-04-07 20:12 ` Kanoj Sarcar 2000-04-07 23:26 ` Andrea Arcangeli 2000-04-07 23:54 ` Andrea Arcangeli @ 2000-04-08 0:04 ` Andrea Arcangeli 2 siblings, 0 replies; 47+ messages in thread From: Andrea Arcangeli @ 2000-04-08 0:04 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm On Fri, 7 Apr 2000, Kanoj Sarcar wrote: >are unneeded when you consider the kernel_lock is already held in most >of those paths.[..] Good point. However I'm not thinking and I'm not going to think with the big kernel lock in mind in the paths where we incidentally hold the big kernel lock because somebody _else_ still needs it (like with acquire_swap_page/get_swap_page/swap_free). The setting of SWP_USED in swapoff have to be done inside the critical section protected by the swaplist lock. That was at least a conceptual bug even if it couldn't trigger due swap_out and swapoff that both holds the big kernel lock. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <yttem7xstk2.fsf@vexeta.dc.fi.udc.es>]
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels [not found] <yttem7xstk2.fsf@vexeta.dc.fi.udc.es> @ 2000-04-23 0:52 ` Andrea Arcangeli 0 siblings, 0 replies; 47+ messages in thread From: Andrea Arcangeli @ 2000-04-23 0:52 UTC (permalink / raw) To: Juan J. Quintela Cc: Linus Torvalds, riel, Kanoj Sarcar, Ben LaHaise, linux-mm On 22 Apr 2000, Juan J. Quintela wrote: >swap. Without this line, my systems BUG always running this >application. > >I tested also to put something like: > > if (!PageLocked(page)) > BUG(); > > >but with that the kernel doesn't boot. It BUG. That's normal, a new allocated page is not locked (you should BUG if the page was locked instead). Before hashing the page, the page is not visible and we can do whatever we want with it and it's also not locked since it doesn't need to be locked in first place. Locking a page make sense only as far as the page can be shared by more than one user. The point of setting the locked bit there is to give visibility to the page only _after_ we set the page lock on it to make sure nobody will play with it before we finished. It doesn't matter how we set the page locked as far as we set the page locked before hashing it. The thing I'm not 100% sure about previous Linus's email (not about add_to_page_cache but about not needing the lock on the bus to change the other bitflags while we're holding the page lock) is if we can be sure that the other atomic trylock can't run from under us and invalidate our nonatomic clearbit. If that's not the case (so if the other atomic trylock keep us away while it's running) I fully agree we don't need the lock on the bus to change the other bitflags while we're holding the page lock. I want to try a little userspace simulation to make sure we're safe that way. I want to make sure we don't need the lock on the bus too to avoid to mess with the other trylock that tries to run from under us. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
[parent not found: <yttk8ho26s8.fsf@vexeta.dc.fi.udc.es>]
* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels [not found] <yttk8ho26s8.fsf@vexeta.dc.fi.udc.es> @ 2000-04-23 16:07 ` Andrea Arcangeli 0 siblings, 0 replies; 47+ messages in thread From: Andrea Arcangeli @ 2000-04-23 16:07 UTC (permalink / raw) To: Juan J. Quintela Cc: Linus Torvalds, riel, Kanoj Sarcar, Ben LaHaise, linux-mm On 23 Apr 2000, Juan J. Quintela wrote: >[..] The page is >never locked when we enter there. [..] That how it's designed to work. Please check my last email to linux-mm for an explanation of why it's correct behaviour. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 47+ messages in thread
end of thread, other threads:[~2000-04-23 16:07 UTC | newest]
Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-04-03 22:22 PG_swap_entry bug in recent kernels Ben LaHaise
2000-04-04 15:06 ` Andrea Arcangeli
2000-04-04 15:46 ` Rik van Riel
2000-04-04 16:50 ` Andrea Arcangeli
2000-04-04 17:06 ` Ben LaHaise
2000-04-04 18:03 ` Andrea Arcangeli
2000-04-06 22:11 ` [patch] take 2 " Ben LaHaise
2000-04-07 10:45 ` Andrea Arcangeli
2000-04-07 11:29 ` Rik van Riel
2000-04-07 12:00 ` Andrea Arcangeli
2000-04-07 12:54 ` Rik van Riel
2000-04-07 13:14 ` Andrea Arcangeli
2000-04-07 20:12 ` Kanoj Sarcar
2000-04-07 23:26 ` Andrea Arcangeli
2000-04-08 0:11 ` Kanoj Sarcar
2000-04-08 0:37 ` Kanoj Sarcar
2000-04-08 13:20 ` Andrea Arcangeli
2000-04-08 21:39 ` Kanoj Sarcar
2000-04-08 23:02 ` Andrea Arcangeli
2000-04-08 23:18 ` Kanoj Sarcar
2000-04-08 23:58 ` Andrea Arcangeli
2000-04-08 13:30 ` Andrea Arcangeli
2000-04-08 17:39 ` Andrea Arcangeli
2000-04-07 23:54 ` Andrea Arcangeli
2000-04-08 0:15 ` Kanoj Sarcar
2000-04-08 13:14 ` Andrea Arcangeli
2000-04-08 21:47 ` Kanoj Sarcar
2000-04-08 23:10 ` Andrea Arcangeli
2000-04-08 23:21 ` Kanoj Sarcar
2000-04-08 23:39 ` Andrea Arcangeli
2000-04-09 0:40 ` Kanoj Sarcar
2000-04-10 8:55 ` andrea
2000-04-11 2:45 ` Kanoj Sarcar
2000-04-11 16:22 ` Andrea Arcangeli
2000-04-11 17:40 ` Rik van Riel
2000-04-11 18:20 ` Kanoj Sarcar
2000-04-21 18:23 ` Andrea Arcangeli
2000-04-21 21:00 ` Rik van Riel
2000-04-22 1:12 ` Andrea Arcangeli
2000-04-22 1:51 ` Linus Torvalds
2000-04-22 18:29 ` Rik van Riel
2000-04-22 19:58 ` Linus Torvalds
2000-04-11 18:26 ` Kanoj Sarcar
2000-04-10 19:10 ` Stephen C. Tweedie
2000-04-08 0:04 ` Andrea Arcangeli
[not found] <yttem7xstk2.fsf@vexeta.dc.fi.udc.es>
2000-04-23 0:52 ` Andrea Arcangeli
[not found] <yttk8ho26s8.fsf@vexeta.dc.fi.udc.es>
2000-04-23 16:07 ` Andrea Arcangeli
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox