PG_swap_entry bug in recent kernels

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* PG_swap_entry bug in recent kernels
@ 2000-04-03 22:22 Ben LaHaise
  2000-04-04 15:06 ` Andrea Arcangeli
  0 siblings, 1 reply; 47+ messages in thread
From: Ben LaHaise @ 2000-04-03 22:22 UTC (permalink / raw)
  To: torvalds; +Cc: linux-mm

The following one-liner is a painful bug present in recent kernels: swap
cache pages left in the LRU lists and subsequently reclaimed by
shrink_mmap were resulting in new pages having the PG_swap_entry bit set.  
This leads to invalid swap entries being put into users page tables if the
page is eventually swapped out.  This was nasty to track down.

		-ben

diff -ur 2.3.99-pre4-3/mm/swap_state.c test-pre4-3/mm/swap_state.c
--- 2.3.99-pre4-3/mm/swap_state.c	Mon Dec  6 13:19:45 1999
+++ test-pre4-3/mm/swap_state.c	Mon Apr  3 17:59:30 2000
@@ -80,6 +80,7 @@
 #endif
 	remove_from_swap_cache(page);
 	swap_free(entry);
+	clear_bit(PG_swap_entry, &page->flags);
 }

 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: PG_swap_entry bug in recent kernels
  2000-04-03 22:22 PG_swap_entry bug in recent kernels Ben LaHaise
@ 2000-04-04 15:06 ` Andrea Arcangeli
  2000-04-04 15:46   ` Rik van Riel
  0 siblings, 1 reply; 47+ messages in thread
From: Andrea Arcangeli @ 2000-04-04 15:06 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: torvalds, linux-mm

On Mon, 3 Apr 2000, Ben LaHaise wrote:

>The following one-liner is a painful bug present in recent kernels: swap
>cache pages left in the LRU lists and subsequently reclaimed by
>shrink_mmap were resulting in new pages having the PG_swap_entry bit set.  

The patch is obviously wrong and shouldn't be applied. You missed the
semantics of the PG_swap_entry bitflage enterely.

Such bit is meant _only_ to allow the swapping out the same page in the
same swap place across write-pagein faults (to avoid swap fragmentation
upon write-page-faults). If you clear such bit within
remove_from_swap_cache (that gets recalled upon a write swapin fault) you
can as well drop such bitflag completly.

The PG_swap_entry is meant to give persistence to pages in the swap space.

>This leads to invalid swap entries being put into users page tables if the
>page is eventually swapped out.  This was nasty to track down.

That's obviously not the case. If you have that problem you'd better
continue to debug it.

The PG_swap_entry can _only_ deal with performance (not with stability).
You can set modify such bit as you want at any time everywhere and the
system will remains rock solid, only performance can change. You can
trivially verify that. The bitflag is read _only_ in acquire_swap_entry,
and such function will run succesfully an safely despite of the
PG_swap_entry bitflag settings.

Said the above, I obviously agree free pages shouldn't have such bit set,
since they aren't mapped anymore and so it make no sense to provide
persistence on the swap space to not allocated pages :). I seen where we
have a problem in not clearing such bit, but the fix definitely isn't to
clear the bit in the swapin-modify path.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: PG_swap_entry bug in recent kernels
  2000-04-04 15:06 ` Andrea Arcangeli
@ 2000-04-04 15:46   ` Rik van Riel
  2000-04-04 16:50     ` Andrea Arcangeli
  0 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2000-04-04 15:46 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Ben LaHaise, torvalds, linux-mm

On Tue, 4 Apr 2000, Andrea Arcangeli wrote:
> On Mon, 3 Apr 2000, Ben LaHaise wrote:
> 
> >The following one-liner is a painful bug present in recent kernels: swap
> >cache pages left in the LRU lists and subsequently reclaimed by
> >shrink_mmap were resulting in new pages having the PG_swap_entry bit set.  
> 
> The patch is obviously wrong and shouldn't be applied. You missed the
> semantics of the PG_swap_entry bitflage enterely.

[snip]

> Said the above, I obviously agree free pages shouldn't have such
> bit set, since they aren't mapped anymore and so it make no
> sense to provide persistence on the swap space to not allocated
> pages :). I seen where we have a problem in not clearing such
> bit, but the fix definitely isn't to clear the bit in the
> swapin-modify path.

You might want to have read _where_ Ben's patch applies.

void __delete_from_swap_cache(struct page *page)
{
        swp_entry_t entry;

        entry.val = page->index;

#ifdef SWAP_CACHE_INFO
        swap_cache_del_total++;
#endif
        remove_from_swap_cache(page);
        swap_free(entry);
+	clear_bit(PG_swap_entry, &page->flags);
}

When we remove a page from the swap cache, it seems fair to me
that we _really_ remove it from the swap cache.
If __delete_from_swap_cache() is called from a wrong code path,
that's something that should be fixed, of course (but that's
orthogonal to this).

To quote from memory.c::do_swap_page() :

        if (write_access && !is_page_shared(page)) {
                delete_from_swap_cache_nolock(page);
                UnlockPage(page);

If you think this is a bug, please fix it here...

cheers,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: PG_swap_entry bug in recent kernels
  2000-04-04 15:46   ` Rik van Riel
@ 2000-04-04 16:50     ` Andrea Arcangeli
  2000-04-04 17:06       ` Ben LaHaise
  0 siblings, 1 reply; 47+ messages in thread
From: Andrea Arcangeli @ 2000-04-04 16:50 UTC (permalink / raw)
  To: riel; +Cc: Ben LaHaise, Linus Torvalds, linux-mm

On Tue, 4 Apr 2000, Rik van Riel wrote:

>You might want to have read _where_ Ben's patch applies.
>
>void __delete_from_swap_cache(struct page *page)
>{
>        swp_entry_t entry;
>
>        entry.val = page->index;
>
>#ifdef SWAP_CACHE_INFO
>        swap_cache_del_total++;
>#endif
>        remove_from_swap_cache(page);
>        swap_free(entry);
>+	clear_bit(PG_swap_entry, &page->flags);
>}
>
>When we remove a page from the swap cache, it seems fair to me
>that we _really_ remove it from the swap cache.

Are you sure you didn't mistaken PG_swap_entry for PG_swap_cache?

We're here talking about PG_swap_entry. The only object of that bit is to
remains set on anonymous pages that aren't in the swap cache, so next time
we'll re-add them to the swap cache we'll try to swap out them in the same
swap entry as the page were before.

>If __delete_from_swap_cache() is called from a wrong code path,
>that's something that should be fixed, of course (but that's
>orthogonal to this).

__delete_from_swap_cache is called by delete_from_swap_cache_nolock that
is called by do_swap_page that does the swapin.

>To quote from memory.c::do_swap_page() :
>
>        if (write_access && !is_page_shared(page)) {
>                delete_from_swap_cache_nolock(page);
>                UnlockPage(page);
>
>If you think this is a bug, please fix it here...

The above quoted code is correct.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: PG_swap_entry bug in recent kernels
  2000-04-04 16:50     ` Andrea Arcangeli
@ 2000-04-04 17:06       ` Ben LaHaise
  2000-04-04 18:03         ` Andrea Arcangeli
  0 siblings, 1 reply; 47+ messages in thread
From: Ben LaHaise @ 2000-04-04 17:06 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: riel, Linus Torvalds, linux-mm

On Tue, 4 Apr 2000, Andrea Arcangeli wrote:

> On Tue, 4 Apr 2000, Rik van Riel wrote:

> >When we remove a page from the swap cache, it seems fair to me
> >that we _really_ remove it from the swap cache.
> 
> Are you sure you didn't mistaken PG_swap_entry for PG_swap_cache?

Yes I'm sure.  What's happening is that the presence of PG_swap_entry on
the page means that page->index is being returned as the swap entry.  This
is completely bogus after the page has been freed and reallocated.  Just
try running a swap test under any recent devel kernel -- eventually you
will see an invalid swap entry show up in someone's pte causing a random
SIGSEGV (it's as if the entry was marked PROT_NONE -- present, but no
user access).

> We're here talking about PG_swap_entry. The only object of that bit is to
> remains set on anonymous pages that aren't in the swap cache, so next time
> we'll re-add them to the swap cache we'll try to swap out them in the same
> swap entry as the page were before.

Which is bogus.  If it's an anonymous page that we want to swap out to the
same swap entry, leave it in the swap cache.

> >If __delete_from_swap_cache() is called from a wrong code path,
> >that's something that should be fixed, of course (but that's
> >orthogonal to this).
> 
> __delete_from_swap_cache is called by delete_from_swap_cache_nolock that
> is called by do_swap_page that does the swapin.

As well as from shrink_mmap.

> >To quote from memory.c::do_swap_page() :
> >
> >        if (write_access && !is_page_shared(page)) {
> >                delete_from_swap_cache_nolock(page);
> >                UnlockPage(page);
> >
> >If you think this is a bug, please fix it here...
> 
> The above quoted code is correct.

The code path that this patch really affects is shrink_mmap ->
__delete_from_swap_cache.  Clearing the bit from shrink_mmap is an option,
but it doesn't make that much sense to me; if we're removing a page from
the swap cache, why aren't we clearing the PG_swap_entry bit?  I'd rather
leave the page in the swap cache and set PG_dirty on it that have such
obscure sematics.

		-ben

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: PG_swap_entry bug in recent kernels
  2000-04-04 17:06       ` Ben LaHaise
@ 2000-04-04 18:03         ` Andrea Arcangeli
  2000-04-06 22:11           ` [patch] take 2 " Ben LaHaise
  0 siblings, 1 reply; 47+ messages in thread
From: Andrea Arcangeli @ 2000-04-04 18:03 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: riel, Linus Torvalds, linux-mm

On Tue, 4 Apr 2000, Ben LaHaise wrote:

>try running a swap test under any recent devel kernel -- eventually you
>will see an invalid swap entry show up in someone's pte causing a random

acquire_swap_entry() should be just doing all the safety checking to make
sure the page->index is caching a _valid_ swap entry. If it's garbage
get_swap_page() will be recalled and a valid entry will be allocated (and
if get_swap_page() returns an invalid-entry that's definitely not related
to the PG_swap_entry logic anymore).

Could you explain me how acquire_swap_entry() can return an invalid swap
entry (starting with a random page->index of course)? I can't exclude
there's a bug, but acquire_swap_entry was meant to return only valid
entries despite of the page->index possible garbage and it seems it's
doing that.

>SIGSEGV (it's as if the entry was marked PROT_NONE -- present, but no
>user access).

I'm not saying you are not seeing that, I'm only trying to understand how
can it be related to the PG_swap_entry logic.

>> We're here talking about PG_swap_entry. The only object of that bit is to
>> remains set on anonymous pages that aren't in the swap cache, so next time
>> we'll re-add them to the swap cache we'll try to swap out them in the same
>> swap entry as the page were before.
>
>Which is bogus.  If it's an anonymous page that we want to swap out to the
>same swap entry, leave it in the swap cache.

There's a very basic reason that we can't left it in the swap cache. We
can't left it in the swap cache simply because the swap cache is a read
only entity and if you do a write access you can't left the page in the
swap cache and change it without updating its on-disk counterpart.

So we always remove the anonymous pages from the swap cache upon
swapin-_write_-faults. That's also how 2.2.x works.

Then I noticed it was possible to give persistence on the swap also to the
dirtified pages without making the swap-cache dirty, by adding the
PG_swap_entry bit that tell us if the page->index is currently caching the
last swap entry where the page was allocated on the swap. That does all
the job and we don't need dirty swap cache anymore.

>> >If __delete_from_swap_cache() is called from a wrong code path,
>> >that's something that should be fixed, of course (but that's
>> >orthogonal to this).
>> 
>> __delete_from_swap_cache is called by delete_from_swap_cache_nolock that
>> is called by do_swap_page that does the swapin.
>
>As well as from shrink_mmap.

I would not be complaining your patch if you would put the clear_bit
within shrink_mmap :).

BTW, also the SHM unmap points have to be checked to make sure the
PG_swap_entry gets cleared. Also SHM uses swap cache and shares the
do_swap_page code.

>> >To quote from memory.c::do_swap_page() :
>> >
>> >        if (write_access && !is_page_shared(page)) {
>> >                delete_from_swap_cache_nolock(page);
>> >                UnlockPage(page);
>> >
>> >If you think this is a bug, please fix it here...
>> 
>> The above quoted code is correct.
>
>The code path that this patch really affects is shrink_mmap ->
>__delete_from_swap_cache.  Clearing the bit from shrink_mmap is an option,
>but it doesn't make that much sense to me; if we're removing a page from
					    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>the swap cache, why aren't we clearing the PG_swap_entry bit?  I'd rather
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

As just said the whole point of the PG_swap_entry is to be set for
regular anonymous swappable pages, _not_ for the swap cache at all.

So it really make no sense to clear the PG_swap_entry bit while removing a
page from the swap cache and I don't see your arguemnt.

We have instead to cover the points where a swappable page become not a
swappable page anymore. There's also to cover the shrink_mmap case where
we free the page. I'm thinking about it to see if we can skip to cover the
shrink_mmap case changing the point where we set the PG_swap_entry bit. I
think it would be possible if we would avoid the double page fault in the
write_swapin_access+page-is-shared case. But right now I think it's
cleaner to keep the double page fault and to clear the PG_swap_entry
within shrink_mmap while dropping page cache.

>leave the page in the swap cache and set PG_dirty on it that have such
>obscure sematics.

If the page is shared you can't simply set the PG_dirty bit and change the
contents of the page or you'll screwup all other in-core and on-disk users
of the page. You have at least to do a COW and re-add the COWed page to a
new swap cache entry so by definition failing to give persistence to the
page in the swap space.

In general (also for non-shared pages) setting the PG_dirty is inferior
because it would waste swap space by keeping busy on the swap side a swap
entry that it not uptodate and that's really not cache (there's no way
somebody else can fault in the same swap cache page since the user who did
the write-access is now the only user of the page and it has it just
mapped in its pte so it can't fault on it). The reason we keep the page
mapped in the read-only access is to skip a write to disk if we run low on
memory but we can't skip the write to disk in the write-access case...

In the case of multiple users on the swap side (for example when a process
does a fork with some page in the swap) the PG_swap_entry can allow the
two tasks to share the same swap entry if they gets swapped out and
swapped in at different times. A PG_dirty swap cache wouldn't allow that
because of the is-page-shared issue mentioned two paragraphs above.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-04 18:03         ` Andrea Arcangeli
@ 2000-04-06 22:11           ` Ben LaHaise
  2000-04-07 10:45             ` Andrea Arcangeli
  0 siblings, 1 reply; 47+ messages in thread
From: Ben LaHaise @ 2000-04-06 22:11 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: riel, Linus Torvalds, linux-mm

On Tue, 4 Apr 2000, Andrea Arcangeli wrote:

> Could you explain me how acquire_swap_entry() can return an invalid swap
> entry (starting with a random page->index of course)? I can't exclude
> there's a bug, but acquire_swap_entry was meant to return only valid
> entries despite of the page->index possible garbage and it seems it's
> doing that.

First off, it doesn't verify that all of the bits which should be clear in
a swap entry actually are -- eg bit 0 (present) can and is getting
set.  It would have to rebuild and compare the swap entry produced by
SWP_ENTRY(type,offset) to be 100% sure that it is a valid swap entry.

And yes, I see what you're doing with the PG_swap_entry code now, although
I'm thinking it might be better done by taking a look at what swap entries
are present in the page tables near the page (since otherwise a pair of
fork()ed process could end up splitting contiguous swap entries if the
swapout is timed just right).  But that's for later...

> >As well as from shrink_mmap.
> 
> I would not be complaining your patch if you would put the clear_bit
> within shrink_mmap :).

Heheh, okay I've moved it there.  Just in case I've also added a
BUG() check in free_pages_okay to make sure there aren't any other places
that have been missed.

		-ben

diff -ur 2.3.99-pre4-4/mm/filemap.c linux-test/mm/filemap.c
--- 2.3.99-pre4-4/mm/filemap.c	Thu Apr  6 15:03:05 2000
+++ linux-test/mm/filemap.c	Thu Apr  6 17:50:39 2000
@@ -300,6 +300,7 @@
 		if (PageSwapCache(page)) {
 			spin_unlock(&pagecache_lock);
 			__delete_from_swap_cache(page);
+			ClearPageSwapEntry(page);
 			goto made_inode_progress;
 		}	

diff -ur 2.3.99-pre4-4/mm/page_alloc.c linux-test/mm/page_alloc.c
--- 2.3.99-pre4-4/mm/page_alloc.c	Thu Apr  6 15:03:05 2000
+++ linux-test/mm/page_alloc.c	Thu Apr  6 17:38:10 2000
@@ -110,6 +110,8 @@
 		BUG();
 	if (PageDecrAfter(page))
 		BUG();
+	if (PageSwapEntry(page))
+		BUG();

 	zone = page->zone;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-06 22:11           ` [patch] take 2 " Ben LaHaise
@ 2000-04-07 10:45             ` Andrea Arcangeli
  2000-04-07 11:29               ` Rik van Riel
  2000-04-07 20:12               ` Kanoj Sarcar
  0 siblings, 2 replies; 47+ messages in thread
From: Andrea Arcangeli @ 2000-04-07 10:45 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: riel, Linus Torvalds, linux-mm

On Thu, 6 Apr 2000, Ben LaHaise wrote:

>First off, it doesn't verify that all of the bits which should be clear in
>a swap entry actually are -- eg bit 0 (present) can and is getting
>set.  It would have to rebuild and compare the swap entry produced by
>SWP_ENTRY(type,offset) to be 100% sure that it is a valid swap entry.

That shouldn't be a problem. What we cares is to return a _valid_ swap
entry from acquire_swap_entry, if that wasn't a cached-swap-entry that
were a minor problem (not a stability issue at least).

However I agree the current design is not very clean and not very strict
and it's too silent. So now I make sure that the swap-entry bitflag is set
only on good page->index.

>Heheh, okay I've moved it there.  Just in case I've also added a
>BUG() check in free_pages_okay to make sure there aren't any other places
>that have been missed.

That's certainly a good idea, I did that too indeed :).

I'm working on that and right now I am stuck fixing up SMP/non-SMP races
and bugs in the VM that I found while checking the swap entry stuff.
However the swap entry stuff in my current tree should be fixed now.

This is a snapshot of my tree so that we can stay in sync. I guess the
below stuff it's ready for inclusion but if you don't agree with something
let me know of course, thanks. It seems to work stable here. I included
also some early fix (the non intrusive ones). One for a SMP race fix for
swapoff/get_swap_page, an SMP race in the vmlist access lock during
swapoff, do_wp_page that have to know the swap cache can have reference
count 3 and not being shared, and the highmem page replacement should
cache also the swap-entry bitflag (I just do a copy of all the flags to be
faster). It doesn't need to copy the mapping pointer since it's always
null and the new_page have it null too.

It's against 2.3.99-pre4-4.

diff -urN ref/mm/filemap.c swap-entry-1/mm/filemap.c
--- ref/mm/filemap.c	Thu Apr  6 01:00:51 2000
+++ swap-entry-1/mm/filemap.c	Fri Apr  7 04:44:27 2000
@@ -300,6 +300,8 @@
 		if (PageSwapCache(page)) {
 			spin_unlock(&pagecache_lock);
 			__delete_from_swap_cache(page);
+			/* the page is local to us now */
+			page->flags &= ~(1UL << PG_swap_entry);
 			goto made_inode_progress;
 		}	
 
diff -urN ref/mm/highmem.c swap-entry-1/mm/highmem.c
--- ref/mm/highmem.c	Mon Apr  3 03:21:59 2000
+++ swap-entry-1/mm/highmem.c	Fri Apr  7 03:48:46 2000
@@ -75,7 +75,7 @@
 
 	/* Preserve the caching of the swap_entry. */
 	highpage->index = page->index;
-	highpage->mapping = page->mapping;
+	highpage->flags = page->flags;
 
 	/*
 	 * We can just forget the old page since 
diff -urN ref/mm/memory.c swap-entry-1/mm/memory.c
--- ref/mm/memory.c	Fri Apr  7 12:17:24 2000
+++ swap-entry-1/mm/memory.c	Fri Apr  7 12:26:53 2000
@@ -838,6 +838,7 @@
 	 */
 	switch (page_count(old_page)) {
 	case 2:
+	case 3:
 		/*
 		 * Lock the page so that no one can look it up from
 		 * the swap cache, grab a reference and start using it.
@@ -880,7 +881,19 @@
 		new_page = old_page;
 	}
 	spin_unlock(&tsk->mm->page_table_lock);
-	__free_page(new_page);
+	/*
+	 * We're releasing a page, it can be an anonymous
+	 * page as well. Since we don't hold any lock (except
+	 * the mmap_sem semaphore) the other user of the anonymous
+	 * page may have released it from under us and now we
+	 * could be the only owner of the page, thus put_page_testzero() can
+	 * return 1, and we have to clear the swap-entry
+	 * bitflag in such case.
+	 */
+	if (put_page_testzero(new_page)) {
+		new_page->flags &= ~(1UL << PG_swap_entry);
+		__free_pages_ok(new_page, 0);
+	}
 	return 1;
 
 bad_wp_page:
diff -urN ref/mm/page_alloc.c swap-entry-1/mm/page_alloc.c
--- ref/mm/page_alloc.c	Thu Apr  6 01:00:52 2000
+++ swap-entry-1/mm/page_alloc.c	Thu Apr  6 05:05:59 2000
@@ -110,6 +110,8 @@
 		BUG();
 	if (PageDecrAfter(page))
 		BUG();
+	if (PageSwapEntry(page))
+		BUG();
 
 	zone = page->zone;
 
diff -urN ref/mm/swap_state.c swap-entry-1/mm/swap_state.c
--- ref/mm/swap_state.c	Thu Apr  6 01:00:52 2000
+++ swap-entry-1/mm/swap_state.c	Fri Apr  7 12:29:00 2000
@@ -126,9 +126,14 @@
 		UnlockPage(page);
 	}
 
-	ClearPageSwapEntry(page);
-
-	__free_page(page);
+	/*
+	 * Only the last unmap have to lose the swap entry
+	 * information that we have cached into page->index.
+	 */
+	if (put_page_testzero(page)) {
+		page->flags &= ~(1UL << PG_swap_entry);
+		__free_pages_ok(page, 0);
+	}
 }
 
 
@@ -151,7 +156,7 @@
 		 * Right now the pagecache is 32-bit only.  But it's a 32 bit index. =)
 		 */
 repeat:
-		found = find_lock_page(&swapper_space, entry.val);
+		found = find_get_page(&swapper_space, entry.val);
 		if (!found)
 			return 0;
 		/*
@@ -163,7 +168,6 @@
 		 * is enough to check whether the page is still in the scache.
 		 */
 		if (!PageSwapCache(found)) {
-			UnlockPage(found);
 			__free_page(found);
 			goto repeat;
 		}
@@ -172,13 +176,11 @@
 #ifdef SWAP_CACHE_INFO
 		swap_cache_find_success++;
 #endif
-		UnlockPage(found);
 		return found;
 	}
 
 out_bad:
 	printk (KERN_ERR "VM: Found a non-swapper swap page!\n");
-	UnlockPage(found);
 	__free_page(found);
 	return 0;
 }
diff -urN ref/mm/swapfile.c swap-entry-1/mm/swapfile.c
--- ref/mm/swapfile.c	Thu Apr  6 01:00:52 2000
+++ swap-entry-1/mm/swapfile.c	Fri Apr  7 12:35:59 2000
@@ -212,22 +212,22 @@
 
 	/* We have the old entry in the page offset still */
 	if (!page->index)
-		goto new_swap_entry;
+		goto null_swap_entry;
 	entry.val = page->index;
 	type = SWP_TYPE(entry);
 	if (type >= nr_swapfiles)
-		goto new_swap_entry;
+		goto bad_nofile;
+	swap_list_lock();
 	p = type + swap_info;
 	if ((p->flags & SWP_WRITEOK) != SWP_WRITEOK)
-		goto new_swap_entry;
+		goto unlock_list;
 	offset = SWP_OFFSET(entry);
 	if (offset >= p->max)
-		goto new_swap_entry;
+		goto bad_offset;
 	/* Has it been re-used for something else? */
-	swap_list_lock();
 	swap_device_lock(p);
 	if (p->swap_map[offset])
-		goto unlock_new_swap_entry;
+		goto unlock;
 
 	/* We're cool, we can just use the old one */
 	p->swap_map[offset] = 1;
@@ -236,11 +236,24 @@
 	swap_list_unlock();
 	return entry;
 
-unlock_new_swap_entry:
+unlock:
 	swap_device_unlock(p);
+unlock_list:
 	swap_list_unlock();
+clear_swap_entry:
+	ClearPageSwapEntry(page);
 new_swap_entry:
 	return get_swap_page();
+
+null_swap_entry:
+	printk(KERN_WARNING __FUNCTION__ " null swap entry\n");
+	goto clear_swap_entry;
+bad_nofile:
+	printk(KERN_WARNING __FUNCTION__ " nonexistent swap file\n");
+	goto clear_swap_entry;
+bad_offset:
+	printk(KERN_WARNING __FUNCTION__ " bad offset\n");
+	goto unlock_list;
 }
 
 /*
@@ -263,8 +276,11 @@
 		/* If this entry is swap-cached, then page must already
                    hold the right address for any copies in physical
                    memory */
-		if (pte_page(pte) != page)
+		if (pte_page(pte) != page) {
+			if (page->index == entry.val)
+				ClearPageSwapEntry(page);
 			return;
+		}
 		/* We will be removing the swap cache in a moment, so... */
 		set_pte(dir, pte_mkdirty(pte));
 		return;
@@ -358,10 +374,20 @@
 	 */
 	if (!mm)
 		return;
+	/*
+	 * Avoid the vmas to go away from under us
+	 * and also avoids the task to play with
+	 * pagetables while we're running. If the
+	 * vmlist_modify_lock wouldn't acquire the
+	 * mm->page_table_lock spinlock we should
+	 * acquire it by hand.
+	 */
+	vmlist_access_lock(mm);
 	for (vma = mm->mmap; vma; vma = vma->vm_next) {
 		pgd_t * pgd = pgd_offset(mm, vma->vm_start);
 		unuse_vma(vma, pgd, entry, page);
 	}
+	vmlist_access_unlock(mm);
 	return;
 }
 
@@ -418,8 +444,10 @@
 		shm_unuse(entry, page);
 		/* Now get rid of the extra reference to the temporary
                    page we've been using. */
-		if (PageSwapCache(page))
+		if (PageSwapCache(page)) {
 			delete_from_swap_cache(page);
+			ClearPageSwapEntry(page);
+		}
 		__free_page(page);
 		/*
 		 * Check for and clear any overflowed swap map counts.
@@ -488,8 +516,8 @@
 		swap_list.next = swap_list.head;
 	}
 	nr_swap_pages -= p->pages;
-	swap_list_unlock();
 	p->flags = SWP_USED;
+	swap_list_unlock();
 	err = try_to_unuse(type);
 	if (err) {
 		/* re-insert swap space back into swap_list */

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-07 10:45             ` Andrea Arcangeli
@ 2000-04-07 11:29               ` Rik van Riel
  2000-04-07 12:00                 ` Andrea Arcangeli
  2000-04-07 20:12               ` Kanoj Sarcar
  1 sibling, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2000-04-07 11:29 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Ben LaHaise, Linus Torvalds, linux-mm

On Fri, 7 Apr 2000, Andrea Arcangeli wrote:

>  			spin_unlock(&pagecache_lock);
>  			__delete_from_swap_cache(page);
> +			/* the page is local to us now */
> +			page->flags &= ~(1UL << PG_swap_entry);
>  			goto made_inode_progress;
>  		}	

Please use the clear_bit() macro for this, the code is
unreadable enough in its current state...

cheers,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-07 11:29               ` Rik van Riel
@ 2000-04-07 12:00                 ` Andrea Arcangeli
  2000-04-07 12:54                   ` Rik van Riel
  0 siblings, 1 reply; 47+ messages in thread
From: Andrea Arcangeli @ 2000-04-07 12:00 UTC (permalink / raw)
  To: riel; +Cc: Ben LaHaise, Linus Torvalds, linux-mm

On Fri, 7 Apr 2000, Rik van Riel wrote:

>Please use the clear_bit() macro for this, the code is
>unreadable enough in its current state...

I didn't used the ClearPageSwapEntry macro to avoid executing locked asm
instructions where not necessary.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-07 12:00                 ` Andrea Arcangeli
@ 2000-04-07 12:54                   ` Rik van Riel
  2000-04-07 13:14                     ` Andrea Arcangeli
  0 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2000-04-07 12:54 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Ben LaHaise, Linus Torvalds, linux-mm

On Fri, 7 Apr 2000, Andrea Arcangeli wrote:
> On Fri, 7 Apr 2000, Rik van Riel wrote:
> 
> >Please use the clear_bit() macro for this, the code is
> >unreadable enough in its current state...
> 
> I didn't used the ClearPageSwapEntry macro to avoid executing
> locked asm instructions where not necessary.

Hmmmmm...

Won't this screw up when another processor is atomically
setting the bit just after we removed it and we still have
it in the store queue?

from include/asm-i386/spinlock.h
/*
 * Sadly, some early PPro chips require the locked access,
 * otherwise we could just always simply do
 *
 *      #define spin_unlock_string \
 *              "movb $0,%0"
 *
 * Which is noticeably faster.
 */

I don't know if it is relevant here, but would like to
be sure ...

cheers,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-07 12:54                   ` Rik van Riel
@ 2000-04-07 13:14                     ` Andrea Arcangeli
  0 siblings, 0 replies; 47+ messages in thread
From: Andrea Arcangeli @ 2000-04-07 13:14 UTC (permalink / raw)
  To: riel; +Cc: Ben LaHaise, Linus Torvalds, linux-mm

On Fri, 7 Apr 2000, Rik van Riel wrote:

>Won't this screw up when another processor is atomically
>setting the bit just after we removed it and we still have
>it in the store queue?
>
>from include/asm-i386/spinlock.h
>/*
> * Sadly, some early PPro chips require the locked access,
> * otherwise we could just always simply do
> *
> *      #define spin_unlock_string \
> *              "movb $0,%0"
> *
> * Which is noticeably faster.
> */
>
>I don't know if it is relevant here, but would like to
>be sure ...

The spin_unlock case is actually not relevant, I wasn't relying on it in
first place since I was using C (which can implement the
read/change/modify in multiple instruction playing with registers).

The reason we can use C before putting the page into the freelist, is
because we know we don't risk to race with other processors. We are
putting the page into the freelist and if another processor would be
playing someway with the page we couldn't put it on the freelist in first
place.

If some other processor/task is referencing the page while we call
free_pages_ok, then that would be a major MM bug.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-07 10:45             ` Andrea Arcangeli
  2000-04-07 11:29               ` Rik van Riel
@ 2000-04-07 20:12               ` Kanoj Sarcar
  2000-04-07 23:26                 ` Andrea Arcangeli
                                   ` (2 more replies)
  1 sibling, 3 replies; 47+ messages in thread
From: Kanoj Sarcar @ 2000-04-07 20:12 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm

Andrea, 

The swaplist locking changes in swapfile.c in your patch are okay, but
are unneeded when you consider the kernel_lock is already held in most
of those paths. (Complete removal of kernel_lock in those paths are a
little harder, at least when last I tried it). A bigger problem might
be that you are violating lock orders when you grab the vmlist_lock
from inside code that already has tasklist_lock in readmode (your
change in unuse_process()). I may be wrong, so you should try stress
testing with swapdevice removal with a large number of runnable
processes.

Also, did you have a good reason to want to make lookup_swap_cache()
invoke find_get_page(), and not find_lock_page()? I coded some of the 
MP race fixes with the swap cache, some of the logic is in 
Documentation/vm/locking. I remember some intense reasoning and
thinking of obscure conditions, so I am just cautious about any
locking changes.

Kanoj

> --- ref/mm/swap_state.c	Thu Apr  6 01:00:52 2000
> +++ swap-entry-1/mm/swap_state.c	Fri Apr  7 12:29:00 2000
> @@ -126,9 +126,14 @@
>  		UnlockPage(page);
>  	}
>  
> -	ClearPageSwapEntry(page);
> -
> -	__free_page(page);
> +	/*
> +	 * Only the last unmap have to lose the swap entry
> +	 * information that we have cached into page->index.
> +	 */
> +	if (put_page_testzero(page)) {
> +		page->flags &= ~(1UL << PG_swap_entry);
> +		__free_pages_ok(page, 0);
> +	}
>  }
>  
>  
> @@ -151,7 +156,7 @@
>  		 * Right now the pagecache is 32-bit only.  But it's a 32 bit index. =)
>  		 */
>  repeat:
> -		found = find_lock_page(&swapper_space, entry.val);
> +		found = find_get_page(&swapper_space, entry.val);
>  		if (!found)
>  			return 0;
>  		/*
> @@ -163,7 +168,6 @@
>  		 * is enough to check whether the page is still in the scache.
>  		 */
>  		if (!PageSwapCache(found)) {
> -			UnlockPage(found);
>  			__free_page(found);
>  			goto repeat;
>  		}
> @@ -172,13 +176,11 @@
>  #ifdef SWAP_CACHE_INFO
>  		swap_cache_find_success++;
>  #endif
> -		UnlockPage(found);
>  		return found;
>  	}
>  
>  out_bad:
>  	printk (KERN_ERR "VM: Found a non-swapper swap page!\n");
> -	UnlockPage(found);
>  	__free_page(found);
>  	return 0;
>  }
> diff -urN ref/mm/swapfile.c swap-entry-1/mm/swapfile.c
> --- ref/mm/swapfile.c	Thu Apr  6 01:00:52 2000
> +++ swap-entry-1/mm/swapfile.c	Fri Apr  7 12:35:59 2000
> @@ -212,22 +212,22 @@
>  
>  	/* We have the old entry in the page offset still */
>  	if (!page->index)
> -		goto new_swap_entry;
> +		goto null_swap_entry;
>  	entry.val = page->index;
>  	type = SWP_TYPE(entry);
>  	if (type >= nr_swapfiles)
> -		goto new_swap_entry;
> +		goto bad_nofile;
> +	swap_list_lock();
>  	p = type + swap_info;
>  	if ((p->flags & SWP_WRITEOK) != SWP_WRITEOK)
> -		goto new_swap_entry;
> +		goto unlock_list;
>  	offset = SWP_OFFSET(entry);
>  	if (offset >= p->max)
> -		goto new_swap_entry;
> +		goto bad_offset;
>  	/* Has it been re-used for something else? */
> -	swap_list_lock();
>  	swap_device_lock(p);
>  	if (p->swap_map[offset])
> -		goto unlock_new_swap_entry;
> +		goto unlock;
>  
>  	/* We're cool, we can just use the old one */
>  	p->swap_map[offset] = 1;
> @@ -236,11 +236,24 @@
>  	swap_list_unlock();
>  	return entry;
>  
> -unlock_new_swap_entry:
> +unlock:
>  	swap_device_unlock(p);
> +unlock_list:
>  	swap_list_unlock();
> +clear_swap_entry:
> +	ClearPageSwapEntry(page);
>  new_swap_entry:
>  	return get_swap_page();
> +
> +null_swap_entry:
> +	printk(KERN_WARNING __FUNCTION__ " null swap entry\n");
> +	goto clear_swap_entry;
> +bad_nofile:
> +	printk(KERN_WARNING __FUNCTION__ " nonexistent swap file\n");
> +	goto clear_swap_entry;
> +bad_offset:
> +	printk(KERN_WARNING __FUNCTION__ " bad offset\n");
> +	goto unlock_list;
>  }
>  
>  /*
> @@ -263,8 +276,11 @@
>  		/* If this entry is swap-cached, then page must already
>                     hold the right address for any copies in physical
>                     memory */
> -		if (pte_page(pte) != page)
> +		if (pte_page(pte) != page) {
> +			if (page->index == entry.val)
> +				ClearPageSwapEntry(page);
>  			return;
> +		}
>  		/* We will be removing the swap cache in a moment, so... */
>  		set_pte(dir, pte_mkdirty(pte));
>  		return;
> @@ -358,10 +374,20 @@
>  	 */
>  	if (!mm)
>  		return;
> +	/*
> +	 * Avoid the vmas to go away from under us
> +	 * and also avoids the task to play with
> +	 * pagetables while we're running. If the
> +	 * vmlist_modify_lock wouldn't acquire the
> +	 * mm->page_table_lock spinlock we should
> +	 * acquire it by hand.
> +	 */
> +	vmlist_access_lock(mm);
>  	for (vma = mm->mmap; vma; vma = vma->vm_next) {
>  		pgd_t * pgd = pgd_offset(mm, vma->vm_start);
>  		unuse_vma(vma, pgd, entry, page);
>  	}
> +	vmlist_access_unlock(mm);
>  	return;
>  }
>  
> @@ -418,8 +444,10 @@
>  		shm_unuse(entry, page);
>  		/* Now get rid of the extra reference to the temporary
>                     page we've been using. */
> -		if (PageSwapCache(page))
> +		if (PageSwapCache(page)) {
>  			delete_from_swap_cache(page);
> +			ClearPageSwapEntry(page);
> +		}
>  		__free_page(page);
>  		/*
>  		 * Check for and clear any overflowed swap map counts.
> @@ -488,8 +516,8 @@
>  		swap_list.next = swap_list.head;
>  	}
>  	nr_swap_pages -= p->pages;
> -	swap_list_unlock();
>  	p->flags = SWP_USED;
> +	swap_list_unlock();
>  	err = try_to_unuse(type);
>  	if (err) {
>  		/* re-insert swap space back into swap_list */
> 
> Andrea
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux.eu.org/Linux-MM/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-07 20:12               ` Kanoj Sarcar
@ 2000-04-07 23:26                 ` Andrea Arcangeli
  2000-04-08  0:11                   ` Kanoj Sarcar
  2000-04-07 23:54                 ` Andrea Arcangeli
  2000-04-08  0:04                 ` Andrea Arcangeli
  2 siblings, 1 reply; 47+ messages in thread
From: Andrea Arcangeli @ 2000-04-07 23:26 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm

On Fri, 7 Apr 2000, Kanoj Sarcar wrote:

>[..] you should try stress
>testing with swapdevice removal with a large number of runnable
>processes.[..]

swapdevice removal during swapin activity is broken right now as far I can
see. I'm trying to fix that stuff right now.

>Also, did you have a good reason to want to make lookup_swap_cache()
>invoke find_get_page(), and not find_lock_page()? I coded some of the 

Using find_lock_page and then unlocking the page is meaningless. If you
are going to unconditionally unlock the page then you shouldn't lock it in
first place.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-07 23:26                 ` Andrea Arcangeli
@ 2000-04-08  0:11                   ` Kanoj Sarcar
  2000-04-08  0:37                     ` Kanoj Sarcar
  2000-04-08 13:30                     ` Andrea Arcangeli
  0 siblings, 2 replies; 47+ messages in thread
From: Kanoj Sarcar @ 2000-04-08  0:11 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm

> 
> On Fri, 7 Apr 2000, Kanoj Sarcar wrote:
> 
> >[..] you should try stress
> >testing with swapdevice removal with a large number of runnable
> >processes.[..]
> 
> swapdevice removal during swapin activity is broken right now as far I can
> see. I'm trying to fix that stuff right now.

Be aware that I already have a patch for this. I have been meaning to 
clean it up against latest 2.3 and submit it to Linus ... FWIW, it 
has been broken since 2.2.

> 
> >Also, did you have a good reason to want to make lookup_swap_cache()
> >invoke find_get_page(), and not find_lock_page()? I coded some of the 
> 
> Using find_lock_page and then unlocking the page is meaningless. If you
> are going to unconditionally unlock the page then you shouldn't lock it in
> first place.

I will have to think a little bit about why the code does what it does
currently. I will let you know ...

Kanoj

> 
> Andrea
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-08  0:11                   ` Kanoj Sarcar
@ 2000-04-08  0:37                     ` Kanoj Sarcar
  2000-04-08 13:20                       ` Andrea Arcangeli
  2000-04-08 13:30                     ` Andrea Arcangeli
  1 sibling, 1 reply; 47+ messages in thread
From: Kanoj Sarcar @ 2000-04-08  0:37 UTC (permalink / raw)
  To: Kanoj Sarcar
  Cc: Andrea Arcangeli, Ben LaHaise, riel, Linus Torvalds, linux-mm

> > 
> > >Also, did you have a good reason to want to make lookup_swap_cache()
> > >invoke find_get_page(), and not find_lock_page()? I coded some of the 
> > 
> > Using find_lock_page and then unlocking the page is meaningless. If you
> > are going to unconditionally unlock the page then you shouldn't lock it in
> > first place.
> 
> I will have to think a little bit about why the code does what it does
> currently. I will let you know ...
>

Okay, I think I found at least one reason why the lockpage was being done
in lookup_swap_cache(). It was effectively to check the PageSwapCache bit,
since shrink_mmap:__delete_from_swap_cache could race with a 
lookup_swap_cache.

Yes, I did notice the recent shrink_mmap SMP race fixes that you posted,
now it _*might*_ be unneccesary to do a find_lock_page() in 
lookup_swap_cache() (just for this race). I will have to look at the 
latest prepatch to confirm that. In any case, there are still races with 
swapspace deletion and swapcache lookup, so unless you had a good reason 
to want to replace the find_lock_page with lookup_swap_cache, I would much 
rather see it the way it is currently ...

Kanoj
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-08  0:37                     ` Kanoj Sarcar
@ 2000-04-08 13:20                       ` Andrea Arcangeli
  2000-04-08 21:39                         ` Kanoj Sarcar
  0 siblings, 1 reply; 47+ messages in thread
From: Andrea Arcangeli @ 2000-04-08 13:20 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm

On Fri, 7 Apr 2000, Kanoj Sarcar wrote:

>Okay, I think I found at least one reason why the lockpage was being done
>in lookup_swap_cache(). It was effectively to check the PageSwapCache bit,
>since shrink_mmap:__delete_from_swap_cache could race with a 
>lookup_swap_cache.

shrink_mmap can't race with a find_get_cache. find_get_page increments the
reference count within the critical section and shrink_mmap checks the
page count and drop the page in one whole transaction within a mutually
exclusive critical section.

>Yes, I did notice the recent shrink_mmap SMP race fixes that you posted,

They weren't relative to the cache, but only to the LRU list
inserction/deletion. There wasn't races between shrink_mmap and
find_get_page and friends.

>now it _*might*_ be unneccesary to do a find_lock_page() in 
>lookup_swap_cache() (just for this race). I will have to look at the 

It isn't. Checking PageSwapCache while the page is locked is not
necessary. The only thing which can drop the page from the swap cache is
swapoff that will do that as soon as you do the unlock before returning
from lookup_swap_cache anyway.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-08 13:20                       ` Andrea Arcangeli
@ 2000-04-08 21:39                         ` Kanoj Sarcar
  2000-04-08 23:02                           ` Andrea Arcangeli
  0 siblings, 1 reply; 47+ messages in thread
From: Kanoj Sarcar @ 2000-04-08 21:39 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm

As I said before, unless you have a _good_ reason, I don't see any point
in changing the code, just because it appears cleaner. As you note, there
are ample races with the swapdeletion code removing the page from the
swapcache, while a lookup is in progress, the PageLock _might_ be used
to fix those. In any case, after we agree that the races have been fixed
(and that's theoretically, there will probably be races observed under
real stress), it would be okay to then go and change the find_lock_page 
to find_get_page in lookup_swap_cache(). IMO, of course ...

> 
> On Fri, 7 Apr 2000, Kanoj Sarcar wrote:
> 
> >Okay, I think I found at least one reason why the lockpage was being done
> >in lookup_swap_cache(). It was effectively to check the PageSwapCache bit,
> >since shrink_mmap:__delete_from_swap_cache could race with a 
> >lookup_swap_cache.
> 
> shrink_mmap can't race with a find_get_cache. find_get_page increments the
> reference count within the critical section and shrink_mmap checks the
> page count and drop the page in one whole transaction within a mutually
> exclusive critical section.

Okay, how's this (from pre3):

shrink_mmap
--------------						__find_get_page
get pagemap_lru_lock					----------------
LockPage					
drop pagemap_lru_lock
Fail if page_count(page) > 1
get pagecache_lock
get_page
Fail if page_count(page) != 2
if PageSwapCache, drop pagecache_lock
							get pagecache_lock
							Finds page in swapcache,
								does get_page
							drop pagecache_lock
	and __delete_from_swap_cache,
	which releases PageLock.
							LockPage succeeds,
							erronesouly believes he
							has swapcache page.

Did I miss some interlocking step that would prevent this from happening?
	
> 
> >Yes, I did notice the recent shrink_mmap SMP race fixes that you posted,
> 
> They weren't relative to the cache, but only to the LRU list
> inserction/deletion. There wasn't races between shrink_mmap and
> find_get_page and friends.

Ok, so that's irrelevant ...

> 
> >now it _*might*_ be unneccesary to do a find_lock_page() in 
> >lookup_swap_cache() (just for this race). I will have to look at the 
> 
> It isn't. Checking PageSwapCache while the page is locked is not
> necessary. The only thing which can drop the page from the swap cache is
> swapoff that will do that as soon as you do the unlock before returning
> from lookup_swap_cache anyway.

Yes, see above too. Its probably better to have overenthusiastic locking,
than having lesser locking. As I mentioned before, the shrink_mmap race
is probably one of the reasons I did the PageLocking in lookup_swap_cache(),
I was probably thinking of swapdeletion too ...

Kanoj

> 
> Andrea
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux.eu.org/Linux-MM/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-08 21:39                         ` Kanoj Sarcar
@ 2000-04-08 23:02                           ` Andrea Arcangeli
  2000-04-08 23:18                             ` Kanoj Sarcar
  0 siblings, 1 reply; 47+ messages in thread
From: Andrea Arcangeli @ 2000-04-08 23:02 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm

On Sat, 8 Apr 2000, Kanoj Sarcar wrote:

>shrink_mmap
>--------------						__find_get_page
>get pagemap_lru_lock					----------------
>LockPage					
>drop pagemap_lru_lock
>Fail if page_count(page) > 1
>get pagecache_lock
>get_page
>Fail if page_count(page) != 2
>if PageSwapCache, drop pagecache_lock
>							get pagecache_lock
>							Finds page in swapcache,
>								does get_page
>							drop pagecache_lock
>	and __delete_from_swap_cache,
>	which releases PageLock.
>							LockPage succeeds,
>							erronesouly believes he
>							has swapcache page.
>
>Did I miss some interlocking step that would prevent this from happening?

Oh, very good point indeed, I don't think you are missing anything. Thanks
for showing me that!

It seems to me the only reason we was dropping the lock earlier for the
swap cache was to be able to use the remove_inode_page and so avoding
having to export a secondary remove_inode_page that doesn't grab the
page_cache_lock. It looks the only reason was an implementation issue.

So IMHVO it would be nicer to change the locking in shrink_mmap() instead
of putting the page-cache check in the swap cache lookup fast path. Swap
cache and page cache are sharing the same locking rules w.r.t. the
hashtable. That was the only exception as far I can tell and removing it
would give us a cleaner design IMHO.

What do you think about something like this?

diff -urN swap-entry-2/include/linux/mm.h swap-entry-3/include/linux/mm.h
--- swap-entry-2/include/linux/mm.h	Sat Apr  8 19:16:28 2000
+++ swap-entry-3/include/linux/mm.h	Sun Apr  9 00:18:43 2000
@@ -449,6 +449,7 @@
 struct zone_t;
 /* filemap.c */
 extern void remove_inode_page(struct page *);
+extern void __remove_inode_page(struct page *);
 extern unsigned long page_unuse(struct page *);
 extern int shrink_mmap(int, int, zone_t *);
 extern void truncate_inode_pages(struct address_space *, loff_t);
diff -urN swap-entry-2/include/linux/swap.h swap-entry-3/include/linux/swap.h
--- swap-entry-2/include/linux/swap.h	Sat Apr  8 18:08:37 2000
+++ swap-entry-3/include/linux/swap.h	Sun Apr  9 00:47:42 2000
@@ -105,6 +105,7 @@
 /*
  * Make these inline later once they are working properly.
  */
+extern void shrink_swap_cache(struct page *);
 extern void unlink_from_swap_cache(struct page *);
 extern void __delete_from_swap_cache(struct page *page);
 extern void delete_from_swap_cache(struct page *page);
diff -urN swap-entry-2/mm/filemap.c swap-entry-3/mm/filemap.c
--- swap-entry-2/mm/filemap.c	Sat Apr  8 04:46:04 2000
+++ swap-entry-3/mm/filemap.c	Sun Apr  9 00:39:23 2000
@@ -77,6 +77,13 @@
 	atomic_dec(&page_cache_size);
 }
 
+inline void __remove_inode_page(struct page *page)
+{
+	remove_page_from_inode_queue(page);
+	remove_page_from_hash_queue(page);
+	page->mapping = NULL;
+}
+
 /*
  * Remove a page from the page cache and free it. Caller has to make
  * sure the page is locked and that nobody else uses it - or that usage
@@ -88,9 +95,7 @@
 		PAGE_BUG(page);
 
 	spin_lock(&pagecache_lock);
-	remove_page_from_inode_queue(page);
-	remove_page_from_hash_queue(page);
-	page->mapping = NULL;
+	__remove_inode_page(page);
 	spin_unlock(&pagecache_lock);
 }
 
@@ -298,8 +303,8 @@
 		 * were to be marked referenced..
 		 */
 		if (PageSwapCache(page)) {
+			shrink_swap_cache(page);
 			spin_unlock(&pagecache_lock);
-			__delete_from_swap_cache(page);
 			/* the page is local to us now */
 			page->flags &= ~(1UL << PG_swap_entry);
 			goto made_inode_progress;
diff -urN swap-entry-2/mm/swap_state.c swap-entry-3/mm/swap_state.c
--- swap-entry-2/mm/swap_state.c	Sat Apr  8 17:29:46 2000
+++ swap-entry-3/mm/swap_state.c	Sun Apr  9 00:39:17 2000
@@ -55,6 +55,19 @@
 	return ret;
 }
 
+static inline void __remove_from_swap_cache(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+
+	if (mapping != &swapper_space)
+		BUG();
+	if (!PageSwapCache(page) || !PageLocked(page))
+		PAGE_BUG(page);
+
+	PageClearSwapCache(page);
+	__remove_inode_page(page);
+}
+
 static inline void remove_from_swap_cache(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
@@ -76,6 +89,20 @@
 	lru_cache_del(page);
 	remove_from_swap_cache(page);
 	__free_page(page);
+}
+
+/* called by shrink_mmap() with the page_cache_lock held */
+void shrink_swap_cache(struct page *page)
+{
+	swp_entry_t entry;
+
+	entry.val = page->index;
+
+#ifdef SWAP_CACHE_INFO
+	swap_cache_del_total++;
+#endif
+	__remove_from_swap_cache(page);
+	swap_free(entry);
 }
 
 /*


The other option is to keep the checks in the lookup swap cache fast path.

Comments?

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-08 23:02                           ` Andrea Arcangeli
@ 2000-04-08 23:18                             ` Kanoj Sarcar
  2000-04-08 23:58                               ` Andrea Arcangeli
  0 siblings, 1 reply; 47+ messages in thread
From: Kanoj Sarcar @ 2000-04-08 23:18 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm

> 
> On Sat, 8 Apr 2000, Kanoj Sarcar wrote:
> 
> >shrink_mmap
> >--------------						__find_get_page
> >get pagemap_lru_lock					----------------
> >LockPage					
> >drop pagemap_lru_lock
> >Fail if page_count(page) > 1
> >get pagecache_lock
> >get_page
> >Fail if page_count(page) != 2
> >if PageSwapCache, drop pagecache_lock
> >							get pagecache_lock
> >							Finds page in swapcache,
> >								does get_page
> >							drop pagecache_lock
> >	and __delete_from_swap_cache,
> >	which releases PageLock.
> >							LockPage succeeds,
> >							erronesouly believes he
> >							has swapcache page.
> >
> >Did I miss some interlocking step that would prevent this from happening?
> 
> Oh, very good point indeed, I don't think you are missing anything. Thanks
> for showing me that!
> 
> It seems to me the only reason we was dropping the lock earlier for the
> swap cache was to be able to use the remove_inode_page and so avoding
> having to export a secondary remove_inode_page that doesn't grab the
> page_cache_lock. It looks the only reason was an implementation issue.

Agreed.

> 
> So IMHVO it would be nicer to change the locking in shrink_mmap() instead
> of putting the page-cache check in the swap cache lookup fast path. Swap
> cache and page cache are sharing the same locking rules w.r.t. the
> hashtable. That was the only exception as far I can tell and removing it
> would give us a cleaner design IMHO.

Yes, its easy enough to do, I am sure your attached patch does it properly.

Notwithstanding, I will repeat this again. Just because you change how
shrink_mmap deletes swapcache pages, does not make it a good idea to 
change lookup_swap_cache(). Lets first fix the swap cache deletion races,
then you can change lookup_swap_cache() not to lock the page ... which is
trivial anyway, once we think we have all the holes nailed.

I am off to polishing my previous swapoff patch. Andrea, it might make it 
easier to understand your previous patch if you were to send out a list of
the specific races it is fixing. (Something like my above race example).

Kanoj

> 
> What do you think about something like this?
> 
> diff -urN swap-entry-2/include/linux/mm.h swap-entry-3/include/linux/mm.h
> --- swap-entry-2/include/linux/mm.h	Sat Apr  8 19:16:28 2000
> +++ swap-entry-3/include/linux/mm.h	Sun Apr  9 00:18:43 2000
> @@ -449,6 +449,7 @@
>  struct zone_t;
>  /* filemap.c */
>  extern void remove_inode_page(struct page *);
> +extern void __remove_inode_page(struct page *);
>  extern unsigned long page_unuse(struct page *);
>  extern int shrink_mmap(int, int, zone_t *);
>  extern void truncate_inode_pages(struct address_space *, loff_t);
> diff -urN swap-entry-2/include/linux/swap.h swap-entry-3/include/linux/swap.h
> --- swap-entry-2/include/linux/swap.h	Sat Apr  8 18:08:37 2000
> +++ swap-entry-3/include/linux/swap.h	Sun Apr  9 00:47:42 2000
> @@ -105,6 +105,7 @@
>  /*
>   * Make these inline later once they are working properly.
>   */
> +extern void shrink_swap_cache(struct page *);
>  extern void unlink_from_swap_cache(struct page *);
>  extern void __delete_from_swap_cache(struct page *page);
>  extern void delete_from_swap_cache(struct page *page);
> diff -urN swap-entry-2/mm/filemap.c swap-entry-3/mm/filemap.c
> --- swap-entry-2/mm/filemap.c	Sat Apr  8 04:46:04 2000
> +++ swap-entry-3/mm/filemap.c	Sun Apr  9 00:39:23 2000
> @@ -77,6 +77,13 @@
>  	atomic_dec(&page_cache_size);
>  }
>  
> +inline void __remove_inode_page(struct page *page)
> +{
> +	remove_page_from_inode_queue(page);
> +	remove_page_from_hash_queue(page);
> +	page->mapping = NULL;
> +}
> +
>  /*
>   * Remove a page from the page cache and free it. Caller has to make
>   * sure the page is locked and that nobody else uses it - or that usage
> @@ -88,9 +95,7 @@
>  		PAGE_BUG(page);
>  
>  	spin_lock(&pagecache_lock);
> -	remove_page_from_inode_queue(page);
> -	remove_page_from_hash_queue(page);
> -	page->mapping = NULL;
> +	__remove_inode_page(page);
>  	spin_unlock(&pagecache_lock);
>  }
>  
> @@ -298,8 +303,8 @@
>  		 * were to be marked referenced..
>  		 */
>  		if (PageSwapCache(page)) {
> +			shrink_swap_cache(page);
>  			spin_unlock(&pagecache_lock);
> -			__delete_from_swap_cache(page);
>  			/* the page is local to us now */
>  			page->flags &= ~(1UL << PG_swap_entry);
>  			goto made_inode_progress;
> diff -urN swap-entry-2/mm/swap_state.c swap-entry-3/mm/swap_state.c
> --- swap-entry-2/mm/swap_state.c	Sat Apr  8 17:29:46 2000
> +++ swap-entry-3/mm/swap_state.c	Sun Apr  9 00:39:17 2000
> @@ -55,6 +55,19 @@
>  	return ret;
>  }
>  
> +static inline void __remove_from_swap_cache(struct page *page)
> +{
> +	struct address_space *mapping = page->mapping;
> +
> +	if (mapping != &swapper_space)
> +		BUG();
> +	if (!PageSwapCache(page) || !PageLocked(page))
> +		PAGE_BUG(page);
> +
> +	PageClearSwapCache(page);
> +	__remove_inode_page(page);
> +}
> +
>  static inline void remove_from_swap_cache(struct page *page)
>  {
>  	struct address_space *mapping = page->mapping;
> @@ -76,6 +89,20 @@
>  	lru_cache_del(page);
>  	remove_from_swap_cache(page);
>  	__free_page(page);
> +}
> +
> +/* called by shrink_mmap() with the page_cache_lock held */
> +void shrink_swap_cache(struct page *page)
> +{
> +	swp_entry_t entry;
> +
> +	entry.val = page->index;
> +
> +#ifdef SWAP_CACHE_INFO
> +	swap_cache_del_total++;
> +#endif
> +	__remove_from_swap_cache(page);
> +	swap_free(entry);
>  }
>  
>  /*
> 
> 
> The other option is to keep the checks in the lookup swap cache fast path.
> 
> Comments?
> 
> Andrea
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-08 23:18                             ` Kanoj Sarcar
@ 2000-04-08 23:58                               ` Andrea Arcangeli
  0 siblings, 0 replies; 47+ messages in thread
From: Andrea Arcangeli @ 2000-04-08 23:58 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm

On Sat, 8 Apr 2000, Kanoj Sarcar wrote:

>[..] if you were to send out a list of
>the specific races it is fixing. (Something like my above race example).

Hug, writing the traces for all the possible races would take me really
lots of time. Writing traces is fine for showing _the_ buggy path, but it
doesn't seems the right approch for explaning and understanding how some
new code works. The comment should explain how the new locking works
against swapoff. I'd very much prefer specific questions and
comments.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-08  0:11                   ` Kanoj Sarcar
  2000-04-08  0:37                     ` Kanoj Sarcar
@ 2000-04-08 13:30                     ` Andrea Arcangeli
  2000-04-08 17:39                       ` Andrea Arcangeli
  1 sibling, 1 reply; 47+ messages in thread
From: Andrea Arcangeli @ 2000-04-08 13:30 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm

On Fri, 7 Apr 2000, Kanoj Sarcar wrote:

>Be aware that I already have a patch for this. I have been meaning to 

I've a patch for this too now. Are you using read_swap_cache from any
swapin event? The problem is swapin can't use read_swap_cache because with
read_swap_cache we would never know if we're doing I/O on an inactive swap
entry. Only swapoff can use read_swap_cache. My current tree is doing this
and it's using the swap cache as locking entity to serialize with
unuse_process plus checks on the pte with the page cache lock acquired to
know if lookup_swap_cache (or the swap cache miss path) returned a swap
cache or not (if not then we have to giveup without changing the pte since
swapoff just solved the page fault from under us).

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-08 13:30                     ` Andrea Arcangeli
@ 2000-04-08 17:39                       ` Andrea Arcangeli
  0 siblings, 0 replies; 47+ messages in thread
From: Andrea Arcangeli @ 2000-04-08 17:39 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm

On Sat, 8 Apr 2000, Andrea Arcangeli wrote:

>I've a patch for this too now. Are you using read_swap_cache from any

Here is my approch. I tried to explain the subtle thoughts in the
comments. It's against 2.3.99-pre4-4 + swap-entry fix previously posted on
the list in mail with Message-ID:
<Pine.LNX.4.21.0004071205300.737-100000@alpha.random>. It seems stable
here.

The only minor weird thing I noticed so far after the swapoff+swapin
stress testing is this:

	[..]
	Swap cache: add 107029, delete 105331, find 34046/62876
	[..]
	0 pages swap cached
	[..]

but that's explained by the fact the stat infomation aren't increased with
atomic_inc() (if we really want perfect stat info we should split the
counter in a per-CPU counter and sum all the per-CPU counters to get the
info)

diff -urN swap-entry-1/include/linux/pagemap.h swap-entry-2/include/linux/pagemap.h
--- swap-entry-1/include/linux/pagemap.h	Fri Apr  7 18:16:10 2000
+++ swap-entry-2/include/linux/pagemap.h	Sat Apr  8 19:16:28 2000
@@ -80,6 +80,9 @@
 extern void __add_page_to_hash_queue(struct page * page, struct page **p);
 
 extern void add_to_page_cache(struct page * page, struct address_space *mapping, unsigned long index);
+extern int __add_to_page_cache_unique(struct page *, struct address_space *, unsigned long, struct page **);
+#define add_to_page_cache_unique(page, mapping, index) \
+		__add_to_page_cache_unique(page, mapping, index, page_hash(mapping, index))
 
 extern inline void add_page_to_hash_queue(struct page * page, struct inode * inode, unsigned long index)
 {
diff -urN swap-entry-1/include/linux/swap.h swap-entry-2/include/linux/swap.h
--- swap-entry-1/include/linux/swap.h	Fri Apr  7 02:00:28 2000
+++ swap-entry-2/include/linux/swap.h	Sat Apr  8 18:08:37 2000
@@ -95,15 +95,17 @@
 
 /* linux/mm/swap_state.c */
 extern void show_swap_cache_info(void);
-extern void add_to_swap_cache(struct page *, swp_entry_t);
+extern int add_to_swap_cache_unique(struct page *, swp_entry_t);
 extern int swap_check_entry(unsigned long);
-extern struct page * lookup_swap_cache(swp_entry_t);
+extern struct page * find_get_swap_cache(swp_entry_t);
+extern struct page * find_lock_swap_cache(swp_entry_t);
 extern struct page * read_swap_cache_async(swp_entry_t, int);
 #define read_swap_cache(entry) read_swap_cache_async(entry, 1);
 
 /*
  * Make these inline later once they are working properly.
  */
+extern void unlink_from_swap_cache(struct page *);
 extern void __delete_from_swap_cache(struct page *page);
 extern void delete_from_swap_cache(struct page *page);
 extern void delete_from_swap_cache_nolock(struct page *page);
diff -urN swap-entry-1/ipc/shm.c swap-entry-2/ipc/shm.c
--- swap-entry-1/ipc/shm.c	Fri Apr  7 18:11:37 2000
+++ swap-entry-2/ipc/shm.c	Sat Apr  8 04:15:20 2000
@@ -1334,7 +1334,7 @@
 			swp_entry_t entry = pte_to_swp_entry(pte);
 
 			shm_unlock(shp->id);
-			page = lookup_swap_cache(entry);
+			page = find_get_swap_cache(entry);
 			if (!page) {
 				lock_kernel();
 				swapin_readahead(entry);
@@ -1416,7 +1416,8 @@
 	   reading a not yet uptodate block from disk.
 	   NOTE: we just accounted the swap space reference for this
 	   swap cache page at __get_swap_page() time. */
-	add_to_swap_cache(*outpage = page_map, swap_entry);
+	if (add_to_swap_cache_unique(*outpage = page_map, swap_entry))
+		BUG();
 	return OKAY;
 }
 
diff -urN swap-entry-1/mm/filemap.c swap-entry-2/mm/filemap.c
--- swap-entry-1/mm/filemap.c	Fri Apr  7 18:27:22 2000
+++ swap-entry-2/mm/filemap.c	Sat Apr  8 04:46:04 2000
@@ -488,7 +488,7 @@
 	spin_unlock(&pagecache_lock);
 }
 
-static int add_to_page_cache_unique(struct page * page,
+int __add_to_page_cache_unique(struct page * page,
 	struct address_space *mapping, unsigned long offset,
 	struct page **hash)
 {
@@ -529,7 +529,7 @@
 	if (!page)
 		return -ENOMEM;
 
-	if (!add_to_page_cache_unique(page, mapping, offset, hash)) {
+	if (!__add_to_page_cache_unique(page, mapping, offset, hash)) {
 		int error = mapping->a_ops->readpage(file->f_dentry, page);
 		page_cache_release(page);
 		return error;
@@ -2291,7 +2291,7 @@
 				return ERR_PTR(-ENOMEM);
 		}
 		page = cached_page;
-		if (add_to_page_cache_unique(page, mapping, index, hash))
+		if (__add_to_page_cache_unique(page, mapping, index, hash))
 			goto repeat;
 		cached_page = NULL;
 		err = filler(data, page);
@@ -2318,7 +2318,7 @@
 				return NULL;
 		}
 		page = *cached_page;
-		if (add_to_page_cache_unique(page, mapping, index, hash))
+		if (__add_to_page_cache_unique(page, mapping, index, hash))
 			goto repeat;
 		*cached_page = NULL;
 	}
diff -urN swap-entry-1/mm/memory.c swap-entry-2/mm/memory.c
--- swap-entry-1/mm/memory.c	Fri Apr  7 18:27:22 2000
+++ swap-entry-2/mm/memory.c	Sat Apr  8 19:08:16 2000
@@ -217,7 +217,8 @@
 				if (pte_none(pte))
 					goto cont_copy_pte_range;
 				if (!pte_present(pte)) {
-					swap_duplicate(pte_to_swp_entry(pte));
+					if (!swap_duplicate(pte_to_swp_entry(pte)))
+						BUG();
 					set_pte(dst_pte, pte);
 					goto cont_copy_pte_range;
 				}
@@ -1019,47 +1020,120 @@
 void swapin_readahead(swp_entry_t entry)
 {
 	int i, num;
-	struct page *new_page;
+	struct page * page = NULL;
 	unsigned long offset;
+	swp_entry_t __entry;
 
 	/*
 	 * Get the number of handles we should do readahead io to. Also,
 	 * grab temporary references on them, releasing them as io completes.
+	 *
+	 * At this point we only know the swap device can't go away from under
+	 * us because of the caller locking.
+	 *
+	 * Ugly: we're serializing with swapoff using the big kernel lock.
 	 */
 	num = valid_swaphandles(entry, &offset);
 	for (i = 0; i < num; offset++, i++) {
 		/* Don't block on I/O for read-ahead */
-		if (atomic_read(&nr_async_pages) >= pager_daemon.swap_cluster) {
-			while (i++ < num)
-				swap_free(SWP_ENTRY(SWP_TYPE(entry), offset++));
+		if (atomic_read(&nr_async_pages) >= pager_daemon.swap_cluster)
 			break;
-		}
 		/* Ok, do the async read-ahead now */
-		new_page = read_swap_cache_async(SWP_ENTRY(SWP_TYPE(entry), offset), 0);
-		if (new_page != NULL)
-			__free_page(new_page);
-		swap_free(SWP_ENTRY(SWP_TYPE(entry), offset));
+		if (!page) {
+			page = alloc_page(GFP_USER);
+			if (!page)
+				return;
+		}
+		__entry = SWP_ENTRY(SWP_TYPE(entry), offset);
+		if (add_to_swap_cache_unique(page, __entry))
+			continue;
+		if (!swap_duplicate(__entry)) {
+			swap_free(__entry);
+			unlink_from_swap_cache(page);
+			UnlockPage(page);
+			continue;
+		}
+		rw_swap_page(READ, page, 0);
+		__free_page(page);
+		page = NULL;
 	}
+	if (page)
+		__free_page(page);
 	return;
 }
 
+#define pte_changed(page_table, entry) \
+	(pte_val(*page_table) != pte_val(swp_entry_to_pte(entry)))
+
+/* This is lock-land */
 static int do_swap_page(struct task_struct * tsk,
 	struct vm_area_struct * vma, unsigned long address,
 	pte_t * page_table, swp_entry_t entry, int write_access)
 {
-	struct page *page = lookup_swap_cache(entry);
+	struct page *page;
 	pte_t pte;
+	spinlock_t * page_table_lock = &vma->vm_mm->page_table_lock;
 
+	/*
+	 * find_lock_swap_cache() can return a non swap cache page
+	 * (because find_lock_page() acquires the lock after
+	 * dropping the page_cache_lock).
+	 * We handle the coherency with unmap_process later
+	 * while checking if the pagetable is changed from
+	 * under us. If the pagetable isn't changed from
+	 * under us then `page' is a swap cache page.
+	 */
+ repeat:
+	page = find_lock_swap_cache(entry);
 	if (!page) {
+		page = alloc_page(GFP_USER);
+		if (!page)
+			return -1;
+
+		if (add_to_swap_cache_unique(page, entry)) {
+			__free_page(page);
+			goto repeat;
+		}
+
+		spin_lock(page_table_lock);
+		/*
+		 * If the pte is changed and we added the page to
+		 * the swap cache successfully it means the entry
+		 * is gone away and also the swap device is
+		 * potentially gone away.
+		 */
+		if (pte_changed(page_table, entry))
+			goto unlink;
+		spin_unlock(page_table_lock);
+
+		/*
+		 * Account the swap cache reference on the swap
+		 * side. We have the swap entry locked via
+		 * swap cache locking protocol described below.
+		 * If the entry gone away it means something
+		 * gone badly wrong...
+		 */
+		if (!swap_duplicate(entry))
+			BUG();
+
+		/*
+		 * At this point we know unuse_process() have
+		 * not yet processed this pte and we also hold
+		 * the lock on the page so unuse_process() will
+		 * wait for us to finish the I/O. This way we are
+		 * sure to do I/O from a still SWP_USED swap device
+		 * and that the swap device won't go away while
+		 * we're waiting I/O completation.
+		 */
 		lock_kernel();
 		swapin_readahead(entry);
-		page = read_swap_cache(entry);
+		rw_swap_page(READ, page, 1);
 		unlock_kernel();
-		if (!page)
-			return -1;
 
 		flush_page_to_ram(page);
 		flush_icache_page(vma, page);
+
+		lock_page(page);
 	}
 
 	vma->vm_mm->rss++;
@@ -1067,6 +1141,10 @@
 
 	pte = mk_pte(page, vma->vm_page_prot);
 
+	spin_lock(page_table_lock);
+	if (pte_changed(page_table, entry))
+		goto unlock;
+
 	SetPageSwapEntry(page);
 
 	/*
@@ -1074,7 +1152,6 @@
 	 * Must lock page before transferring our swap count to already
 	 * obtained page count.
 	 */
-	lock_page(page);
 	swap_free(entry);
 	if (write_access && !is_page_shared(page)) {
 		delete_from_swap_cache_nolock(page);
@@ -1086,10 +1163,31 @@
 		UnlockPage(page);
 
 	set_pte(page_table, pte);
+	spin_unlock(page_table_lock);
 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(vma, address, pte);
 	return 1;
+
+ unlink:
+	spin_unlock(page_table_lock);
+	unlink_from_swap_cache(page);
+	UnlockPage(page);
+	__free_page(page);
+	return 1;
+
+ unlock:
+	/*
+	 * If the page is still in the swap cache it
+	 * will be swapoff that will remove it from there
+	 * later.
+	 */
+	spin_unlock(page_table_lock);
+	UnlockPage(page);
+	__free_page(page);
+	return 1;
 }
+
+#undef pte_changed
 
 /*
  * This only needs the MM semaphore
diff -urN swap-entry-1/mm/page_io.c swap-entry-2/mm/page_io.c
--- swap-entry-1/mm/page_io.c	Wed Dec  8 00:05:28 1999
+++ swap-entry-2/mm/page_io.c	Sat Apr  8 02:49:51 2000
@@ -128,8 +128,9 @@
 {
 	struct page *page = mem_map + MAP_NR(buf);
 	
-	if (!PageLocked(page))
+	if (PageLocked(page))
 		PAGE_BUG(page);
+	lock_page(page);
 	if (PageSwapCache(page))
 		PAGE_BUG(page);
 	if (!rw_swap_page_base(rw, entry, page, wait))
diff -urN swap-entry-1/mm/swap_state.c swap-entry-2/mm/swap_state.c
--- swap-entry-1/mm/swap_state.c	Fri Apr  7 18:27:22 2000
+++ swap-entry-2/mm/swap_state.c	Sat Apr  8 17:29:46 2000
@@ -40,16 +40,19 @@
 }
 #endif
 
-void add_to_swap_cache(struct page *page, swp_entry_t entry)
+int add_to_swap_cache_unique(struct page * page, swp_entry_t entry)
 {
+	int ret;
+
 #ifdef SWAP_CACHE_INFO
 	swap_cache_add_total++;
 #endif
-	if (PageTestandSetSwapCache(page))
-		BUG();
 	if (page->mapping)
 		BUG();
-	add_to_page_cache(page, &swapper_space, entry.val);
+	ret = add_to_page_cache_unique(page, &swapper_space, entry.val);
+	if (!ret && PageTestandSetSwapCache(page))
+		BUG();
+	return ret;
 }
 
 static inline void remove_from_swap_cache(struct page *page)
@@ -65,6 +68,16 @@
 	remove_inode_page(page);
 }
 
+void unlink_from_swap_cache(struct page * page)
+{
+#ifdef SWAP_CACHE_INFO
+	swap_cache_del_total++;
+#endif
+	lru_cache_del(page);
+	remove_from_swap_cache(page);
+	__free_page(page);
+}
+
 /*
  * This must be called only on pages that have
  * been verified to be in the swap cache.
@@ -95,7 +108,7 @@
 		lru_cache_del(page);
 
 	__delete_from_swap_cache(page);
-	page_cache_release(page);
+	__free_page(page);
 }
 
 /*
@@ -144,45 +157,53 @@
  * lock before returning.
  */
 
-struct page * lookup_swap_cache(swp_entry_t entry)
+struct page * find_get_swap_cache(swp_entry_t entry)
 {
 	struct page *found;
 
 #ifdef SWAP_CACHE_INFO
 	swap_cache_find_total++;
 #endif
-	while (1) {
-		/*
-		 * Right now the pagecache is 32-bit only.  But it's a 32 bit index. =)
-		 */
-repeat:
-		found = find_get_page(&swapper_space, entry.val);
-		if (!found)
-			return 0;
-		/*
-		 * Though the "found" page was in the swap cache an instant
-		 * earlier, it might have been removed by shrink_mmap etc.
-		 * Re search ... Since find_lock_page grabs a reference on
-		 * the page, it can not be reused for anything else, namely
-		 * it can not be associated with another swaphandle, so it
-		 * is enough to check whether the page is still in the scache.
-		 */
-		if (!PageSwapCache(found)) {
-			__free_page(found);
-			goto repeat;
-		}
-		if (found->mapping != &swapper_space)
-			goto out_bad;
+
+	found = find_get_page(&swapper_space, entry.val);
+	if (!found)
+		return NULL;
+	if (found->mapping != &swapper_space)
+		goto out_bad;
 #ifdef SWAP_CACHE_INFO
-		swap_cache_find_success++;
+	swap_cache_find_success++;
 #endif
-		return found;
-	}
+	return found;
 
 out_bad:
 	printk (KERN_ERR "VM: Found a non-swapper swap page!\n");
 	__free_page(found);
-	return 0;
+	return NULL;
+}
+
+struct page * find_lock_swap_cache(swp_entry_t entry)
+{
+	struct page *found;
+
+#ifdef SWAP_CACHE_INFO
+	swap_cache_find_total++;
+#endif
+
+	found = find_lock_page(&swapper_space, entry.val);
+	if (!found)
+		return NULL;
+	if (found->mapping != &swapper_space)
+		goto out_bad;
+#ifdef SWAP_CACHE_INFO
+	swap_cache_find_success++;
+#endif
+	return found;
+
+out_bad:
+	printk (KERN_ERR "VM: Found a non-swapper locked swap page!\n");
+	UnlockPage(found);
+	__free_page(found);
+	return NULL;
 }
 
 /* 
@@ -192,47 +213,40 @@
  *
  * A failure return means that either the page allocation failed or that
  * the swap entry is no longer in use.
+ * WARNING: only swapoff can use this function.
  */
 
 struct page * read_swap_cache_async(swp_entry_t entry, int wait)
 {
 	struct page *found_page = 0, *new_page;
-	unsigned long new_page_addr;
-	
-	/*
-	 * Make sure the swap entry is still in use.
-	 */
-	if (!swap_duplicate(entry))	/* Account for the swap cache */
-		goto out;
+
 	/*
 	 * Look for the page in the swap cache.
 	 */
-	found_page = lookup_swap_cache(entry);
+	found_page = find_get_swap_cache(entry);
 	if (found_page)
-		goto out_free_swap;
+		goto out;
 
-	new_page_addr = __get_free_page(GFP_USER);
-	if (!new_page_addr)
-		goto out_free_swap;	/* Out of memory */
-	new_page = mem_map + MAP_NR(new_page_addr);
+	new_page = alloc_page(GFP_USER);
+	if (!new_page)
+		goto out;	/* Out of memory */
 
-	/*
-	 * Check the swap cache again, in case we stalled above.
-	 */
-	found_page = lookup_swap_cache(entry);
-	if (found_page)
-		goto out_free_page;
 	/* 
 	 * Add it to the swap cache and read its contents.
 	 */
-	add_to_swap_cache(new_page, entry);
+	while (add_to_swap_cache_unique(new_page, entry)) {
+		found_page = find_get_swap_cache(entry);
+		if (found_page)
+			goto out_free_page;
+	}
+
+	swap_duplicate(entry);
+
 	rw_swap_page(READ, new_page, wait);
 	return new_page;
 
 out_free_page:
 	__free_page(new_page);
-out_free_swap:
-	swap_free(entry);
 out:
 	return found_page;
 }
diff -urN swap-entry-1/mm/swapfile.c swap-entry-2/mm/swapfile.c
--- swap-entry-1/mm/swapfile.c	Fri Apr  7 18:27:22 2000
+++ swap-entry-2/mm/swapfile.c	Sat Apr  8 19:02:20 2000
@@ -437,17 +437,38 @@
 			swap_free(entry);
   			return -ENOMEM;
 		}
+
+		lock_page(page);
+		/*
+		 * Only swapout can drop referenced pages from
+		 * the swap cache.
+		 */
+		if (!PageSwapCache(page))
+			BUG();
+		/*
+		 * Do a fast check to see if it's an orphaned swap cache
+		 * entry to learn if we really need to slowly browse the ptes.
+		 */
+		if (!is_page_shared(page))
+			goto orphaned;
+		UnlockPage(page);
+
 		read_lock(&tasklist_lock);
 		for_each_task(p)
 			unuse_process(p->mm, entry, page);
 		read_unlock(&tasklist_lock);
 		shm_unuse(entry, page);
+
+		lock_page(page);
 		/* Now get rid of the extra reference to the temporary
                    page we've been using. */
-		if (PageSwapCache(page)) {
-			delete_from_swap_cache(page);
-			ClearPageSwapEntry(page);
-		}
+		if (!PageSwapCache(page))
+			BUG();
+	orphaned:
+		delete_from_swap_cache_nolock(page);
+		ClearPageSwapEntry(page);
+
+		UnlockPage(page);
 		__free_page(page);
 		/*
 		 * Check for and clear any overflowed swap map counts.
@@ -710,7 +731,6 @@
 		goto bad_swap;
 	}
 
-	lock_page(mem_map + MAP_NR(swap_header));
 	rw_swap_page_nolock(READ, SWP_ENTRY(type,0), (char *) swap_header, 1);
 
 	if (!memcmp("SWAP-SPACE",swap_header->magic.magic,10))
@@ -902,22 +922,19 @@
 	offset = SWP_OFFSET(entry);
 	if (offset >= p->max)
 		goto bad_offset;
-	if (!p->swap_map[offset])
-		goto bad_unused;
 	/*
 	 * Entry is valid, so increment the map count.
 	 */
 	swap_device_lock(p);
 	if (p->swap_map[offset] < SWAP_MAP_MAX)
-		p->swap_map[offset]++;
+		result = p->swap_map[offset]++;
 	else {
 		static int overflow = 0;
 		if (overflow++ < 5)
 			printk("VM: swap entry overflow\n");
-		p->swap_map[offset] = SWAP_MAP_MAX;
+		result = p->swap_map[offset] = SWAP_MAP_MAX;
 	}
 	swap_device_unlock(p);
-	result = 1;
 out:
 	return result;
 
@@ -927,9 +944,6 @@
 bad_offset:
 	printk("Bad swap offset entry %08lx\n", entry.val);
 	goto out;
-bad_unused:
-	printk("Unused swap offset entry in swap_dup %08lx\n", entry.val);
-	goto out;
 }
 
 /*
@@ -1027,20 +1041,22 @@
 	*offset = SWP_OFFSET(entry);
 	toff = *offset = (*offset >> page_cluster) << page_cluster;
 
+	if ((swapdev->flags & SWP_WRITEOK) != SWP_WRITEOK)
+		goto out;
 	swap_device_lock(swapdev);
 	do {
 		/* Don't read-ahead past the end of the swap area */
 		if (toff >= swapdev->max)
 			break;
-		/* Don't read in bad or busy pages */
+		/* Don't read in bad or empty pages */
 		if (!swapdev->swap_map[toff])
 			break;
 		if (swapdev->swap_map[toff] == SWAP_MAP_BAD)
 			break;
-		swapdev->swap_map[toff]++;
 		toff++;
 		ret++;
 	} while (--i);
 	swap_device_unlock(swapdev);
+ out:
 	return ret;
 }
diff -urN swap-entry-1/mm/vmscan.c swap-entry-2/mm/vmscan.c
--- swap-entry-1/mm/vmscan.c	Thu Apr  6 01:00:52 2000
+++ swap-entry-2/mm/vmscan.c	Sat Apr  8 17:38:10 2000
@@ -72,7 +72,8 @@
 	 */
 	if (PageSwapCache(page)) {
 		entry.val = page->index;
-		swap_duplicate(entry);
+		if (!swap_duplicate(entry))
+			BUG();
 		set_pte(page_table, swp_entry_to_pte(entry));
 drop_pte:
 		vma->vm_mm->rss--;
@@ -157,10 +158,12 @@
 	if (!(page = prepare_highmem_swapout(page)))
 		goto out_swap_free;
 
-	swap_duplicate(entry);	/* One for the process, one for the swap cache */
-
+	if (!swap_duplicate(entry))	/* One for the process, one for the swap cache */
+		BUG();
 	/* This will also lock the page */
-	add_to_swap_cache(page, entry);
+	if (add_to_swap_cache_unique(page, entry))
+		BUG();
+
 	/* Put the swap entry into the pte after the page is in swapcache */
 	vma->vm_mm->rss--;
 	set_pte(page_table, swp_entry_to_pte(entry));


I have not looked into the shm_unuse part yet but I guess it needs fixing
too (however shm compiles and it should keep working as before at least).

Comments are welcome. Thanks.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-07 20:12               ` Kanoj Sarcar
  2000-04-07 23:26                 ` Andrea Arcangeli
@ 2000-04-07 23:54                 ` Andrea Arcangeli
  2000-04-08  0:15                   ` Kanoj Sarcar
  2000-04-08  0:04                 ` Andrea Arcangeli
  2 siblings, 1 reply; 47+ messages in thread
From: Andrea Arcangeli @ 2000-04-07 23:54 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm

On Fri, 7 Apr 2000, Kanoj Sarcar wrote:

>[..] A bigger problem might
>be that you are violating lock orders when you grab the vmlist_lock
>from inside code that already has tasklist_lock in readmode [..]

Conceptually it's the obviously right locking order. The mm exists in
function of a task struct. So first grabbing the tasklist lock, finding
the task_struct and then locking its mm before playing with it looks the
natural ordering of things and how things should be done.

BTW, swap_out() always used the same locking order that I added to swapoff
so if my patch is wrong, swap_out() is always been wrong as well ;).

I had a fast look and it seems nobody is going to harm swap_out and
swapoff but if somebody is using the inverse lock I'd much prefer to fix
that path because the locking design of swapoff and swap_out looks the
obviously right one to me.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-07 23:54                 ` Andrea Arcangeli
@ 2000-04-08  0:15                   ` Kanoj Sarcar
  2000-04-08 13:14                     ` Andrea Arcangeli
  0 siblings, 1 reply; 47+ messages in thread
From: Kanoj Sarcar @ 2000-04-08  0:15 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm

> 
> On Fri, 7 Apr 2000, Kanoj Sarcar wrote:
> 
> >[..] A bigger problem might
> >be that you are violating lock orders when you grab the vmlist_lock
> >from inside code that already has tasklist_lock in readmode [..]
> 
> Conceptually it's the obviously right locking order. The mm exists in
> function of a task struct. So first grabbing the tasklist lock, finding
> the task_struct and then locking its mm before playing with it looks the
> natural ordering of things and how things should be done.
> 
> BTW, swap_out() always used the same locking order that I added to swapoff
> so if my patch is wrong, swap_out() is always been wrong as well ;).

Not sure what you mean ... swap_out never grabbed the mmap_sem/page_table_lock
before (in 2.2. too).

> 
> I had a fast look and it seems nobody is going to harm swap_out and
> swapoff but if somebody is using the inverse lock I'd much prefer to fix
> that path because the locking design of swapoff and swap_out looks the
> obviously right one to me.

Okay, give it a shot, but I think changing the places which hold
tasklist_lock might be a bigger effort. In any case, for vm activity
like swap space deletion, holding of the tasklist_lock is the worst
possible alternative, and should be done only if other alternatives
are too intrusive. I know, it has been like this for a while now ...

Oh, btw, before we start discussing this, you should run that stress
test to make sure whether a lock order violation actually exists
or not ...

Kanoj
> 
> Andrea
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-08  0:15                   ` Kanoj Sarcar
@ 2000-04-08 13:14                     ` Andrea Arcangeli
  2000-04-08 21:47                       ` Kanoj Sarcar
  0 siblings, 1 reply; 47+ messages in thread
From: Andrea Arcangeli @ 2000-04-08 13:14 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm

On Fri, 7 Apr 2000, Kanoj Sarcar wrote:

>> BTW, swap_out() always used the same locking order that I added to swapoff
>> so if my patch is wrong, swap_out() is always been wrong as well ;).
>
>Not sure what you mean ... swap_out never grabbed the mmap_sem/page_table_lock
>before (in 2.2. too).

In 2.2.x page_table_lock wasn't necessary because we was holding the big
kernel lock.

In 2.3.x vmlist_*_lock is alias to spin_lock(&mm->page_table_lock) and
swap_out isn't even calling the spin_lock explicitly but it's doing what
the fixed swapoff does.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-08 13:14                     ` Andrea Arcangeli
@ 2000-04-08 21:47                       ` Kanoj Sarcar
  2000-04-08 23:10                         ` Andrea Arcangeli
  2000-04-10 19:10                         ` Stephen C. Tweedie
  0 siblings, 2 replies; 47+ messages in thread
From: Kanoj Sarcar @ 2000-04-08 21:47 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm

> 
> On Fri, 7 Apr 2000, Kanoj Sarcar wrote:
> 
> >> BTW, swap_out() always used the same locking order that I added to swapoff
> >> so if my patch is wrong, swap_out() is always been wrong as well ;).
> >
> >Not sure what you mean ... swap_out never grabbed the mmap_sem/page_table_lock
> >before (in 2.2. too).
> 
> In 2.2.x page_table_lock wasn't necessary because we was holding the big
> kernel lock.

You have answered your own question in a later email. I quote you:
"Are you using read_swap_cache from any
swapin event? The problem is swapin can't use read_swap_cache because with
read_swap_cache we would never know if we're doing I/O on an inactive swap
entry"

Grabbing lock_kernel in swap_in is not enough since it might get dropped
if the swap_in code goes to sleep for some reason.

Kanoj

> 
> In 2.3.x vmlist_*_lock is alias to spin_lock(&mm->page_table_lock) and
> swap_out isn't even calling the spin_lock explicitly but it's doing what
> the fixed swapoff does.
> 
> Andrea
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux.eu.org/Linux-MM/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-08 21:47                       ` Kanoj Sarcar
@ 2000-04-08 23:10                         ` Andrea Arcangeli
  2000-04-08 23:21                           ` Kanoj Sarcar
  2000-04-10 19:10                         ` Stephen C. Tweedie
  1 sibling, 1 reply; 47+ messages in thread
From: Andrea Arcangeli @ 2000-04-08 23:10 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm

On Sat, 8 Apr 2000, Kanoj Sarcar wrote:

>> 
>> On Fri, 7 Apr 2000, Kanoj Sarcar wrote:
>> 
>> >> BTW, swap_out() always used the same locking order that I added to swapoff
>> >> so if my patch is wrong, swap_out() is always been wrong as well ;).
>> >
>> >Not sure what you mean ... swap_out never grabbed the mmap_sem/page_table_lock
>> >before (in 2.2. too).
>> 
>> In 2.2.x page_table_lock wasn't necessary because we was holding the big
>> kernel lock.
>
>You have answered your own question in a later email. I quote you:
>"Are you using read_swap_cache from any
>swapin event? The problem is swapin can't use read_swap_cache because with
>read_swap_cache we would never know if we're doing I/O on an inactive swap
>entry"
>
>Grabbing lock_kernel in swap_in is not enough since it might get dropped
>if the swap_in code goes to sleep for some reason.

I wasn't talking about races between swapoff/swapin.

I was talking about the locking order issue you raised about the necessary
vmlist_*_lock I added in swapoff.

What I meant is that in 2.2.x there was no need of the
vmlist_*_lock/page_cache_lock in swapoff because we was relying on the big
kernel lock while playing with pagetables and vmas (same in swap_out()).

In 2.3.x both swap_out and swapoff needs to grab first the tasklist_lock
(as in 2.2.x) and then the vmlist_*_lock (otherwise as first the vma
browsing may happen during a vma list modification).

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-08 23:10                         ` Andrea Arcangeli
@ 2000-04-08 23:21                           ` Kanoj Sarcar
  2000-04-08 23:39                             ` Andrea Arcangeli
  0 siblings, 1 reply; 47+ messages in thread
From: Kanoj Sarcar @ 2000-04-08 23:21 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm

> 
> I was talking about the locking order issue you raised about the necessary
> vmlist_*_lock I added in swapoff.
> 
> What I meant is that in 2.2.x there was no need of the
> vmlist_*_lock/page_cache_lock in swapoff because we was relying on the big
> kernel lock while playing with pagetables and vmas (same in swap_out()).
> 
> In 2.3.x both swap_out and swapoff needs to grab first the tasklist_lock
> (as in 2.2.x) and then the vmlist_*_lock (otherwise as first the vma
> browsing may happen during a vma list modification).
> 
> Andrea
> 


As I mentioned before, have you stress tested this to make sure grabbing
read_lock(tasklist_lock), then spin_lock(vmlist_lock) is not deadlock
prone? I _think_ you will run into problems ... and we can then stop 
discussing this. 

Kanoj
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-08 23:21                           ` Kanoj Sarcar
@ 2000-04-08 23:39                             ` Andrea Arcangeli
  2000-04-09  0:40                               ` Kanoj Sarcar
  0 siblings, 1 reply; 47+ messages in thread
From: Andrea Arcangeli @ 2000-04-08 23:39 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm

On Sat, 8 Apr 2000, Kanoj Sarcar wrote:

>As I mentioned before, have you stress tested this to make sure grabbing

I have stress tested the whole thing (also a few minutes ago to check the
latest patch) but it never locked up so we have to think about it.

Could you explain why you think it's the inverse lock ordering?

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-08 23:39                             ` Andrea Arcangeli
@ 2000-04-09  0:40                               ` Kanoj Sarcar
  2000-04-10  8:55                                 ` andrea
  0 siblings, 1 reply; 47+ messages in thread
From: Kanoj Sarcar @ 2000-04-09  0:40 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm

> 
> On Sat, 8 Apr 2000, Kanoj Sarcar wrote:
> 
> >As I mentioned before, have you stress tested this to make sure grabbing
> 
> I have stress tested the whole thing (also a few minutes ago to check the
> latest patch) but it never locked up so we have to think about it.

Okay good.

> 
> Could you explain why you think it's the inverse lock ordering?

Let me see, if I can come up with something, I will let you know. If
it survives stress testing, it probably is not inverse locking.

Btw, I am looking at your patch with message id
<Pine.LNX.4.21.0004081924010.317-100000@alpha.random>, that does not
seem to be holding vmlist/pagetable lock in the swapdelete code (at
least at first blush). That was partly why I wanted to know what fixes 
are in your patch ...

Note: I prefer being able to hold mmap_sem in the swapdelete path, that
will provide protection against fork/exit races too. I will try to port
over my version of the patch, and list the problems it fixes ...

Kanoj
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-09  0:40                               ` Kanoj Sarcar
@ 2000-04-10  8:55                                 ` andrea
  2000-04-11  2:45                                   ` Kanoj Sarcar
  0 siblings, 1 reply; 47+ messages in thread
From: andrea @ 2000-04-10  8:55 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm

On Sat, 8 Apr 2000, Kanoj Sarcar wrote:

>Btw, I am looking at your patch with message id
><Pine.LNX.4.21.0004081924010.317-100000@alpha.random>, that does not
>seem to be holding vmlist/pagetable lock in the swapdelete code (at
>least at first blush). That was partly why I wanted to know what fixes 
>are in your patch ...

The patch was against the earlier swapentry patch that was also fixing the
vma/pte locking in swapoff. All the three patches I posted were
incremental.

>Note: I prefer being able to hold mmap_sem in the swapdelete path, that
>will provide protection against fork/exit races too. I will try to port

With my approch swapoff is serialized w.r.t. to fork/exit the same way as
swap_out(). However I see the potential future problem in exit_mmap() that
makes the entries not reachable before swapoff starts and that does the
swap_free() after swapoff completed and after the swapdevice gone away (==
too late). That's not an issue right now though, since both swapoff and
do_exit() are holding the big kernel lock but it will become an issue
eventually. Probably exit_mmap() should unlink and unmap the vmas bit by
bit using locking to unlink and lefting them visible if they are not yet
released. That should get rid of that future race.

About grabbing the mmap semaphore in unuse_process: we don't need to do
that because we aren't changing vmas from swapoff. Swapoff only browses
and changes pagetables so it only needs the vmalist-access read-spinlock
that avoids vma to go away, and the pagtable exclusive spinlock because
we'll change the pagetables (and the latter one is implied in the
vmlist_access_lock as we know from the vmlist_access_lock implementation).

swap_out() can't grab the mmap_sem for obvious reasons, so if you only
grab the mmap_sem you'll have to rely only on the big kernel lock to avoid
swap_out() to race with your swapoff, right? It doesn't look like a right
long term solution.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-10  8:55                                 ` andrea
@ 2000-04-11  2:45                                   ` Kanoj Sarcar
  2000-04-11 16:22                                     ` Andrea Arcangeli
  0 siblings, 1 reply; 47+ messages in thread
From: Kanoj Sarcar @ 2000-04-11  2:45 UTC (permalink / raw)
  To: andrea; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm

> 
> On Sat, 8 Apr 2000, Kanoj Sarcar wrote:
> 
> >Btw, I am looking at your patch with message id
> ><Pine.LNX.4.21.0004081924010.317-100000@alpha.random>, that does not
> >seem to be holding vmlist/pagetable lock in the swapdelete code (at
> >least at first blush). That was partly why I wanted to know what fixes 
> >are in your patch ...
> 
> The patch was against the earlier swapentry patch that was also fixing the
> vma/pte locking in swapoff. All the three patches I posted were
> incremental.
> 
> >Note: I prefer being able to hold mmap_sem in the swapdelete path, that
> >will provide protection against fork/exit races too. I will try to port
>
> With my approch swapoff is serialized w.r.t. to fork/exit the same way as
> swap_out(). However I see the potential future problem in exit_mmap() that

While forking, a parent might copy a swap handle into the child, but we
might entirely miss scanning the child because he is not on the process list
(kernel_lock is not enough, forking code might sleep). Eg: in the body of 
dup_mmap, we go to sleep due to memory shortage which kicks page stealing
(at highly asynchronous swapio).

Same problem exists in exit_mmap. In this case, one of the routines inside
exit_mmap() can very plausibly go to sleep. Eg: file_unmap.

> makes the entries not reachable before swapoff starts and that does the
> swap_free() after swapoff completed and after the swapdevice gone away (==
> too late). That's not an issue right now though, since both swapoff and
> do_exit() are holding the big kernel lock but it will become an issue
> eventually. Probably exit_mmap() should unlink and unmap the vmas bit by
> bit using locking to unlink and lefting them visible if they are not yet
> released. That should get rid of that future race.
> 
> About grabbing the mmap semaphore in unuse_process: we don't need to do
> that because we aren't changing vmas from swapoff. Swapoff only browses
> and changes pagetables so it only needs the vmalist-access read-spinlock
> that avoids vma to go away, and the pagtable exclusive spinlock because
> we'll change the pagetables (and the latter one is implied in the
> vmlist_access_lock as we know from the vmlist_access_lock implementation).

See above.

> 
> swap_out() can't grab the mmap_sem for obvious reasons, so if you only

Why not? Of course, not with tasklist_lock held (Hehehe, I am not that 
stupid :-)). But other mechanisms are possible.

> grab the mmap_sem you'll have to rely only on the big kernel lock to avoid
> swap_out() to race with your swapoff, right? It doesn't look like a right
> long term solution.

Actually, let me put out the patch, for different reasons, IMO, it is the
right long term solution ...

Kanoj
> 
> Andrea
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux.eu.org/Linux-MM/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-11  2:45                                   ` Kanoj Sarcar
@ 2000-04-11 16:22                                     ` Andrea Arcangeli
  2000-04-11 17:40                                       ` Rik van Riel
  2000-04-11 18:26                                       ` Kanoj Sarcar
  0 siblings, 2 replies; 47+ messages in thread
From: Andrea Arcangeli @ 2000-04-11 16:22 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm

On Mon, 10 Apr 2000, Kanoj Sarcar wrote:

>While forking, a parent might copy a swap handle into the child, but we

That's a bug in fork. Simply let fork to check if the swaphandle is SWAPOK
or not before increasing the swap count. If it's SWAPOK then
swap_duplicate succesfully, otherwise do the swapin using swap cache based
locking as swapin now does in my current tree (check if the pte is changed
before go to increase the swap side and undo the swapcache insertion in
such case and serialize with swapoff and swap_out with page_cache_lock).

>Same problem exists in exit_mmap. In this case, one of the routines inside
>exit_mmap() can very plausibly go to sleep. Eg: file_unmap.

exit_mmap can sleep there. But it have not to hide the mmap as said in
earlier email. It have to zap_page_range and then unlink the vmas all bit
by bit serializing using the vmlist_modify_lock.

>> swap_out() can't grab the mmap_sem for obvious reasons, so if you only
>
>Why not? Of course, not with tasklist_lock held (Hehehe, I am not that 
>stupid :-)). But other mechanisms are possible.

Lock recursion -> deadlock.

	userspace runs
	page fault
		down(&current->mm->mmap_sem);
		try_to_free_pages();
			swap_out();
			down(&current->mm->mmap_sem); <- you deadlocked			

We are serializing swap_out/do_wp_page with the page_cache_lock (in
swapout the page_cache_lock is implied by the vmlist_access_lock).

In the same way I'm serializing swapoff with swapin using swap cache based
on locking and pagetable checks with page_cache_lock acquired and
protecting swapoff with the vmlist_access_lock() that imply the
page_cache_lock.

Using the page_cache_lock and rechecking page table looks the right way to
go to me. It have no design problems that ends in lock recursion.

>Actually, let me put out the patch, for different reasons, IMO, it is the
>right long term solution ...

The patch is welcome indeed. However relying on the mmap_sem looks the
wrong way to me.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-11 16:22                                     ` Andrea Arcangeli
@ 2000-04-11 17:40                                       ` Rik van Riel
  2000-04-11 18:20                                         ` Kanoj Sarcar
  2000-04-21 18:23                                         ` Andrea Arcangeli
  2000-04-11 18:26                                       ` Kanoj Sarcar
  1 sibling, 2 replies; 47+ messages in thread
From: Rik van Riel @ 2000-04-11 17:40 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Kanoj Sarcar, Ben LaHaise, Linus Torvalds, linux-mm

On Tue, 11 Apr 2000, Andrea Arcangeli wrote:
> On Mon, 10 Apr 2000, Kanoj Sarcar wrote:
> 
> >While forking, a parent might copy a swap handle into the child, but we
> 
> That's a bug in fork. Simply let fork to check if the swaphandle
> is SWAPOK or not before increasing the swap count. If it's
> SWAPOK then swap_duplicate succesfully,

"it was hard to write, it should be hard to maintain"

Relying on pieces of magic like this, spread out all over
the kernel source will make the code more fragile and hell
to maintain.

Unless somebody writes the documentation for all of this,
of course...

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-11 17:40                                       ` Rik van Riel
@ 2000-04-11 18:20                                         ` Kanoj Sarcar
  2000-04-21 18:23                                         ` Andrea Arcangeli
  1 sibling, 0 replies; 47+ messages in thread
From: Kanoj Sarcar @ 2000-04-11 18:20 UTC (permalink / raw)
  To: riel; +Cc: Andrea Arcangeli, Ben LaHaise, Linus Torvalds, linux-mm

> 
> On Tue, 11 Apr 2000, Andrea Arcangeli wrote:
> > On Mon, 10 Apr 2000, Kanoj Sarcar wrote:
> > 
> > >While forking, a parent might copy a swap handle into the child, but we
> > 
> > That's a bug in fork. Simply let fork to check if the swaphandle
> > is SWAPOK or not before increasing the swap count. If it's
> > SWAPOK then swap_duplicate succesfully,
> 
> "it was hard to write, it should be hard to maintain"
> 
> Relying on pieces of magic like this, spread out all over
> the kernel source will make the code more fragile and hell
> to maintain.

No, its not magic, since it still doesn't work. Watch out for mail 
from me as to why ... 

> 
> Unless somebody writes the documentation for all of this,
> of course...
>

Precisely, that's why I started off Documentation/vm/locking.
It would be _really_ nice if folks doing locking (or for that matter,
any other delicate) work were to update these files ...

KAnoj
 
> regards,
> 
> Rik
> --
> The Internet is not a network of computers. It is a network
> of people. That is its real strength.
> 
> Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
> http://www.conectiva.com/		http://www.surriel.com/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-11 17:40                                       ` Rik van Riel
  2000-04-11 18:20                                         ` Kanoj Sarcar
@ 2000-04-21 18:23                                         ` Andrea Arcangeli
  2000-04-21 21:00                                           ` Rik van Riel
  1 sibling, 1 reply; 47+ messages in thread
From: Andrea Arcangeli @ 2000-04-21 18:23 UTC (permalink / raw)
  To: riel; +Cc: Kanoj Sarcar, Ben LaHaise, Linus Torvalds, linux-mm

The swap-entry fixes cleared by the swap locking changes are here:

	ftp://ftp.*.kernel.org/pub/linux/kernel/people/andrea/patches/v2.3/2.3.99-pre6-pre3/swap-entry-3

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-21 18:23                                         ` Andrea Arcangeli
@ 2000-04-21 21:00                                           ` Rik van Riel
  2000-04-22  1:12                                             ` Andrea Arcangeli
  0 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2000-04-21 21:00 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Kanoj Sarcar, Ben LaHaise, Linus Torvalds, linux-mm

On Fri, 21 Apr 2000, Andrea Arcangeli wrote:

> The swap-entry fixes cleared by the swap locking changes are here:
> 	
> ftp://ftp.*.kernel.org/pub/linux/kernel/people/andrea/patches/v2.3/2.3.99-pre6-pre3/swap-entry-3

The patch looks "obviously correct", but it would be nice if
you could use the PageClearSwapCache and related macros for
changing the bitflags.

Things like
              new_page->flags &= ~(1UL << PG_swap_entry);
just make the code less readable than it has to be.

(and yes, loads of people already run away screaming when
they look at the memory management code, I really think we
should make maintainability a higher priority target for
the code)

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-21 21:00                                           ` Rik van Riel
@ 2000-04-22  1:12                                             ` Andrea Arcangeli
  2000-04-22  1:51                                               ` Linus Torvalds
  0 siblings, 1 reply; 47+ messages in thread
From: Andrea Arcangeli @ 2000-04-22  1:12 UTC (permalink / raw)
  To: riel; +Cc: Kanoj Sarcar, Ben LaHaise, Linus Torvalds, linux-mm

On Fri, 21 Apr 2000, Rik van Riel wrote:

>you could use the PageClearSwapCache and related macros for
>changing the bitflags.

BTW, thinking more I think the clearbit in shrink_mmap should really be
atomic (lookup_swap_cache can run from under it and try to lock the page
playing with the page->flags while we're clearing the swap_entry bitflag).

The other places doesn't need to be atomic as far I can tell so (as just
said) I'd prefer not to add unscalable SMP locking. Suggest a name for a
new macro that doesn't use asm and I can use it of course.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-22  1:12                                             ` Andrea Arcangeli
@ 2000-04-22  1:51                                               ` Linus Torvalds
  2000-04-22 18:29                                                 ` Rik van Riel
  0 siblings, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2000-04-22  1:51 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: riel, Kanoj Sarcar, Ben LaHaise, linux-mm

On Sat, 22 Apr 2000, Andrea Arcangeli wrote:
>
> On Fri, 21 Apr 2000, Rik van Riel wrote:
> 
> >you could use the PageClearSwapCache and related macros for
> >changing the bitflags.
> 
> BTW, thinking more I think the clearbit in shrink_mmap should really be
> atomic (lookup_swap_cache can run from under it and try to lock the page
> playing with the page->flags while we're clearing the swap_entry bitflag).

Actually, I was toying with the much simpler rule:
 - "PG_locked" is always atomic
 - all other flags can only be tested/changed if PG_locked holds

This simple rule would allow for not using the (slow) atomic operations,
because the other bits are always protected by the one-bit lock.

PG_accessed is probably a special case, as it's just a hint bit - so it
might be ok to say something like "you can change PG_accessed without
holding the PG_locked bit, but then you have to use an atomic operation".
That way changes to PG_accessed might be lost (by other non-atomic
updaters that hold the PG_lock), but at least changing PG_accessed without
holding PG_lock cannot cause other bits to become lost.

Does the above sound sane? It might be reasonably easy to verify that the
rule holds (ie make all the macros check that PG_locked is set, and
hand-check the few places where we access page->flags directly). The rules
sound safe to me, and means that most of the updates could be non-atomic.
Comments?

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-22  1:51                                               ` Linus Torvalds
@ 2000-04-22 18:29                                                 ` Rik van Riel
  2000-04-22 19:58                                                   ` Linus Torvalds
  0 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2000-04-22 18:29 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrea Arcangeli, Kanoj Sarcar, Ben LaHaise, linux-mm

On Fri, 21 Apr 2000, Linus Torvalds wrote:
> On Sat, 22 Apr 2000, Andrea Arcangeli wrote:
> > On Fri, 21 Apr 2000, Rik van Riel wrote:
> > 
> > >you could use the PageClearSwapCache and related macros for
> > >changing the bitflags.
> > 
> > BTW, thinking more I think the clearbit in shrink_mmap should really be
> > atomic (lookup_swap_cache can run from under it and try to lock the page
> > playing with the page->flags while we're clearing the swap_entry bitflag).
> 
> Actually, I was toying with the much simpler rule:
>  - "PG_locked" is always atomic
>  - all other flags can only be tested/changed if PG_locked holds
> 
> This simple rule would allow for not using the (slow) atomic operations,
> because the other bits are always protected by the one-bit lock.

It all depends on the source code. If we're holding the page
lock anyway in places where we play with the other flags, that's
probably the best strategy, but if we're updating the page flags
in a lot of places without holding the page lock, then it's
probably better to just do everything with the current atomic
bitops.

Btw, here's a result from 2.3.99-pre6-3 ... line number 3 and
4 are extremely suspect...

[riel@duckman mm]$ grep 'page->flags' *.c
filemap.c:              if (test_and_clear_bit(PG_referenced, &page->flags)) 
filemap.c:      set_bit(PG_referenced, &page->flags);
filemap.c:      flags = page->flags & ~((1 << PG_uptodate) | (1 << PG_error) | (1 << PG_dirty));
filemap.c:      page->flags = flags | (1 << PG_locked) | (1 << PG_referenced);
page_io.c:              set_bit(PG_decr_after, &page->flags);
vmscan.c:               set_bit(PG_referenced, &page->flags);


Here's the suspect piece of code (filemap.c::__add_to_page_cache()):

        flags = page->flags & ~((1 << PG_uptodate) | (1 << PG_error) | (1 << PG_dirty));
        page->flags = flags | (1 << PG_locked) | (1 << PG_referenced);
        get_page(page);

So here we play with a number of page flags _before_ taking
the page or locking it. It's probably safe because of some
circumstances under which we are called, but somehow it
just doesn't look clean to me ;)

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-22 18:29                                                 ` Rik van Riel
@ 2000-04-22 19:58                                                   ` Linus Torvalds
  0 siblings, 0 replies; 47+ messages in thread
From: Linus Torvalds @ 2000-04-22 19:58 UTC (permalink / raw)
  To: riel; +Cc: Andrea Arcangeli, Kanoj Sarcar, Ben LaHaise, linux-mm

On Sat, 22 Apr 2000, Rik van Riel wrote:
> 
> It all depends on the source code. If we're holding the page
> lock anyway in places where we play with the other flags, that's
> probably the best strategy, but if we're updating the page flags
> in a lot of places without holding the page lock, then it's
> probably better to just do everything with the current atomic
> bitops.

I suspect that we hold the page lock already. And it just seems wrong that
we could ever alter some of the flags (like dirty or uptodate) without
holding the page lock. So the logic is more than just "bitfield
coherency": it's also a higher-level coherency guarantee.

> Btw, here's a result from 2.3.99-pre6-3 ... line number 3 and
> 4 are extremely suspect...
> 
> [riel@duckman mm]$ grep 'page->flags' *.c
> filemap.c:              if (test_and_clear_bit(PG_referenced, &page->flags)) 
> filemap.c:      set_bit(PG_referenced, &page->flags);
> filemap.c:      flags = page->flags & ~((1 << PG_uptodate) | (1 << PG_error) | (1 << PG_dirty));
> filemap.c:      page->flags = flags | (1 << PG_locked) | (1 << PG_referenced);

Both 3 and 4 are from the same sequence: it's the initialization code for
the page. They are run before the page is added to the page cache, so
atomicity is not even an issue, because nobody can get at the page
beforeit ison any of the page lists. Think of it as an anonymous page that
is just about to get truly instantiated into the page cache.

It's also one of the few timeswhere we truly touch many bits. In most
other caseswe touch just one or two at a time.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-11 16:22                                     ` Andrea Arcangeli
  2000-04-11 17:40                                       ` Rik van Riel
@ 2000-04-11 18:26                                       ` Kanoj Sarcar
  1 sibling, 0 replies; 47+ messages in thread
From: Kanoj Sarcar @ 2000-04-11 18:26 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm

> 
> On Mon, 10 Apr 2000, Kanoj Sarcar wrote:
> 
> >While forking, a parent might copy a swap handle into the child, but we
> 
> That's a bug in fork. Simply let fork to check if the swaphandle is SWAPOK
> or not before increasing the swap count. If it's SWAPOK then
> swap_duplicate succesfully, otherwise do the swapin using swap cache based
> locking as swapin now does in my current tree (check if the pte is changed
> before go to increase the swap side and undo the swapcache insertion in
> such case and serialize with swapoff and swap_out with page_cache_lock).

I still can't see how this will help. You still might end up having an
unreachable mm, since the task is not on the active list, and swapoff 
unconditionally decrements the swap count after printing a warning message.
Later on, the process also tries decrementing the count, or accessing the
swap handle ...

> 
> >Same problem exists in exit_mmap. In this case, one of the routines inside
> >exit_mmap() can very plausibly go to sleep. Eg: file_unmap.
> 
> exit_mmap can sleep there. But it have not to hide the mmap as said in
> earlier email. It have to zap_page_range and then unlink the vmas all bit
> by bit serializing using the vmlist_modify_lock.

Sure, write the patch. It seems to me just getting the mmap_sem once 
should be enough, instead of multiple acquire/releases of the vmlist_modify_lock,
if for no other reason than performance. But I will agree to any solution
that fixes races. Correctness first, then performance ...

> 
> >> swap_out() can't grab the mmap_sem for obvious reasons, so if you only
> >
> >Why not? Of course, not with tasklist_lock held (Hehehe, I am not that 
> >stupid :-)). But other mechanisms are possible.
> 
> Lock recursion -> deadlock.
> 
> 	userspace runs
> 	page fault
> 		down(&current->mm->mmap_sem);
> 		try_to_free_pages();
> 			swap_out();
> 			down(&current->mm->mmap_sem); <- you deadlocked			

I am not sure what you mean. Maybe we should just stop talking about this
till I get a chance to post the patch, then you can show me the holes in 
it. I am not sure if I have been able to communicate clearly what I want 
to do, and how ...

Kanoj


> 
> We are serializing swap_out/do_wp_page with the page_cache_lock (in
> swapout the page_cache_lock is implied by the vmlist_access_lock).
> 
> In the same way I'm serializing swapoff with swapin using swap cache based
> on locking and pagetable checks with page_cache_lock acquired and
> protecting swapoff with the vmlist_access_lock() that imply the
> page_cache_lock.
> 
> Using the page_cache_lock and rechecking page table looks the right way to
> go to me. It have no design problems that ends in lock recursion.
> 
> >Actually, let me put out the patch, for different reasons, IMO, it is the
> >right long term solution ...
> 
> The patch is welcome indeed. However relying on the mmap_sem looks the
> wrong way to me.
> 
> Andrea
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-08 21:47                       ` Kanoj Sarcar
  2000-04-08 23:10                         ` Andrea Arcangeli
@ 2000-04-10 19:10                         ` Stephen C. Tweedie
  1 sibling, 0 replies; 47+ messages in thread
From: Stephen C. Tweedie @ 2000-04-10 19:10 UTC (permalink / raw)
  To: Kanoj Sarcar
  Cc: Andrea Arcangeli, Ben LaHaise, riel, Linus Torvalds, linux-mm,
	Stephen Tweedie

Hi,

On Sat, Apr 08, 2000 at 02:47:29PM -0700, Kanoj Sarcar wrote:
> 
> You have answered your own question in a later email. I quote you:
> "Are you using read_swap_cache from any
> swapin event? The problem is swapin can't use read_swap_cache because with
> read_swap_cache we would never know if we're doing I/O on an inactive swap
> entry"

Right.  The way the swap synchronisation always worked was that
we must have a swap cache entry before _any_ IO, so the page lock 
bit on that swap cache page could also serve as an IO lock on the
swap entry.  (Actually, it didn't _always_ work that way, but that's
the basic mechanism with the current swap cache.)

That relied on the swapper being able to do an atomic operation to
search for a page cache page, and to create a new page and lock it
if it wasn't already there.  That was made atomic only by use of the
big lock.  If you don't have all page cache activity using that lock,
then yes, you'll need the page cache lock while you set all of this 
up.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
  2000-04-07 20:12               ` Kanoj Sarcar
  2000-04-07 23:26                 ` Andrea Arcangeli
  2000-04-07 23:54                 ` Andrea Arcangeli
@ 2000-04-08  0:04                 ` Andrea Arcangeli
  2 siblings, 0 replies; 47+ messages in thread
From: Andrea Arcangeli @ 2000-04-08  0:04 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Ben LaHaise, riel, Linus Torvalds, linux-mm

On Fri, 7 Apr 2000, Kanoj Sarcar wrote:

>are unneeded when you consider the kernel_lock is already held in most
>of those paths.[..]

Good point. However I'm not thinking and I'm not going to think with the
big kernel lock in mind in the paths where we incidentally hold the big
kernel lock because somebody _else_ still needs it (like with
acquire_swap_page/get_swap_page/swap_free). The setting of SWP_USED in
swapoff have to be done inside the critical section protected by the
swaplist lock. That was at least a conceptual bug even if it couldn't
trigger due swap_out and swapoff that both holds the big kernel lock.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

[parent not found: <yttem7xstk2.fsf@vexeta.dc.fi.udc.es>]

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
       [not found] <yttem7xstk2.fsf@vexeta.dc.fi.udc.es>
@ 2000-04-23  0:52 ` Andrea Arcangeli
  0 siblings, 0 replies; 47+ messages in thread
From: Andrea Arcangeli @ 2000-04-23  0:52 UTC (permalink / raw)
  To: Juan J. Quintela
  Cc: Linus Torvalds, riel, Kanoj Sarcar, Ben LaHaise, linux-mm

On 22 Apr 2000, Juan J. Quintela wrote:

>swap.  Without this line, my systems BUG always running this
>application.
>
>I tested also to put something like:
>
>	if (!PageLocked(page))
>		BUG();
>
>
>but with that the kernel doesn't boot.  It BUG.

That's normal, a new allocated page is not locked (you should BUG if the
page was locked instead).

Before hashing the page, the page is not visible and we can do whatever we
want with it and it's also not locked since it doesn't need to be locked
in first place. Locking a page make sense only as far as the page can be
shared by more than one user. The point of setting the locked bit there is
to give visibility to the page only _after_ we set the page lock on it to
make sure nobody will play with it before we finished. It doesn't matter
how we set the page locked as far as we set the page locked before hashing
it.

The thing I'm not 100% sure about previous Linus's email (not about
add_to_page_cache but about not needing the lock on the bus to change the
other bitflags while we're holding the page lock) is if we can be sure
that the other atomic trylock can't run from under us and invalidate our
nonatomic clearbit. If that's not the case (so if the other atomic trylock
keep us away while it's running) I fully agree we don't need the lock on
the bus to change the other bitflags while we're holding the page lock. I
want to try a little userspace simulation to make sure we're safe that
way. I want to make sure we don't need the lock on the bus too to avoid to
mess with the other trylock that tries to run from under us.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

[parent not found: <yttk8ho26s8.fsf@vexeta.dc.fi.udc.es>]

* Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels
       [not found] <yttk8ho26s8.fsf@vexeta.dc.fi.udc.es>
@ 2000-04-23 16:07 ` Andrea Arcangeli
  0 siblings, 0 replies; 47+ messages in thread
From: Andrea Arcangeli @ 2000-04-23 16:07 UTC (permalink / raw)
  To: Juan J. Quintela
  Cc: Linus Torvalds, riel, Kanoj Sarcar, Ben LaHaise, linux-mm

On 23 Apr 2000, Juan J. Quintela wrote:

>[..] The page is
>never locked when we enter there. [..]

That how it's designed to work. Please check my last email to linux-mm
for an explanation of why it's correct behaviour.

Andrea


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2000-04-23 16:07 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-04-03 22:22 PG_swap_entry bug in recent kernels Ben LaHaise
2000-04-04 15:06 ` Andrea Arcangeli
2000-04-04 15:46   ` Rik van Riel
2000-04-04 16:50     ` Andrea Arcangeli
2000-04-04 17:06       ` Ben LaHaise
2000-04-04 18:03         ` Andrea Arcangeli
2000-04-06 22:11           ` [patch] take 2 " Ben LaHaise
2000-04-07 10:45             ` Andrea Arcangeli
2000-04-07 11:29               ` Rik van Riel
2000-04-07 12:00                 ` Andrea Arcangeli
2000-04-07 12:54                   ` Rik van Riel
2000-04-07 13:14                     ` Andrea Arcangeli
2000-04-07 20:12               ` Kanoj Sarcar
2000-04-07 23:26                 ` Andrea Arcangeli
2000-04-08  0:11                   ` Kanoj Sarcar
2000-04-08  0:37                     ` Kanoj Sarcar
2000-04-08 13:20                       ` Andrea Arcangeli
2000-04-08 21:39                         ` Kanoj Sarcar
2000-04-08 23:02                           ` Andrea Arcangeli
2000-04-08 23:18                             ` Kanoj Sarcar
2000-04-08 23:58                               ` Andrea Arcangeli
2000-04-08 13:30                     ` Andrea Arcangeli
2000-04-08 17:39                       ` Andrea Arcangeli
2000-04-07 23:54                 ` Andrea Arcangeli
2000-04-08  0:15                   ` Kanoj Sarcar
2000-04-08 13:14                     ` Andrea Arcangeli
2000-04-08 21:47                       ` Kanoj Sarcar
2000-04-08 23:10                         ` Andrea Arcangeli
2000-04-08 23:21                           ` Kanoj Sarcar
2000-04-08 23:39                             ` Andrea Arcangeli
2000-04-09  0:40                               ` Kanoj Sarcar
2000-04-10  8:55                                 ` andrea
2000-04-11  2:45                                   ` Kanoj Sarcar
2000-04-11 16:22                                     ` Andrea Arcangeli
2000-04-11 17:40                                       ` Rik van Riel
2000-04-11 18:20                                         ` Kanoj Sarcar
2000-04-21 18:23                                         ` Andrea Arcangeli
2000-04-21 21:00                                           ` Rik van Riel
2000-04-22  1:12                                             ` Andrea Arcangeli
2000-04-22  1:51                                               ` Linus Torvalds
2000-04-22 18:29                                                 ` Rik van Riel
2000-04-22 19:58                                                   ` Linus Torvalds
2000-04-11 18:26                                       ` Kanoj Sarcar
2000-04-10 19:10                         ` Stephen C. Tweedie
2000-04-08  0:04                 ` Andrea Arcangeli
     [not found] <yttem7xstk2.fsf@vexeta.dc.fi.udc.es>
2000-04-23  0:52 ` Andrea Arcangeli
     [not found] <yttk8ho26s8.fsf@vexeta.dc.fi.udc.es>
2000-04-23 16:07 ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox