filecache/swapcache questions

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* filecache/swapcache questions
@ 1999-06-15  7:16 Kanoj Sarcar
  1999-06-15  7:32 ` Rik van Riel
  0 siblings, 1 reply; 23+ messages in thread
From: Kanoj Sarcar @ 1999-06-15  7:16 UTC (permalink / raw)
  To: linux-mm

Hi all.

I am trying to understand some of the swapcache/filecache code. I
have a few questions (I am sure I will have more soon), which I am
jotting down here in the hope that someone can answer them. It is
quite possible that I am reading the code wrong ...

Q1. Is it really needed to put all the swap pages in the swapper_inode
i_pages? 

Q2. shrink_mmap has code that reads:

                if (PageSwapCache(page)) {
                        if (referenced && swap_count(page->offset) != 1)
                                continue;
                        delete_from_swap_cache(page);
                        return 1;
                }

How will it be possible for a page to be in the swapcache, for its
reference count to be 1 (which has been checked just before), and
for its swap_count(page->offset) to also be 1? I can see this being
possible only if an unmap/exit path might lazily leave a anonymous
page in the swap cache, but I don't believe that happens. Ipc/shm 
pages are not candidates here, since they temporarily raise the page
reference count while swapping.

Q3. Is there some mechanism to detect io errors for swap cache pages
similar to what the PG_uptodate bit provides for filemap pages?

Thanks.

Kanoj
kanoj@engr.sgi.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: filecache/swapcache questions
  1999-06-15  7:16 filecache/swapcache questions Kanoj Sarcar
@ 1999-06-15  7:32 ` Rik van Riel
  1999-06-15 15:51   ` Kanoj Sarcar
  1999-06-17 23:33   ` Stephen C. Tweedie
  0 siblings, 2 replies; 23+ messages in thread
From: Rik van Riel @ 1999-06-15  7:32 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: linux-mm

On Tue, 15 Jun 1999, Kanoj Sarcar wrote:

> Q1. Is it really needed to put all the swap pages in the swapper_inode
> i_pages?

Yes, see below.

> How will it be possible for a page to be in the swapcache, for its
> reference count to be 1 (which has been checked just before), and for
> its swap_count(page->offset) to also be 1? I can see this being
> possible only if an unmap/exit path might lazily leave a anonymous
> page in the swap cache, but I don't believe that happens.

It does happen. We use a 'two-stage' reclamation process instead
of page aging. It seems to work wonderfully -- nice page aging
properties without the overhead. Plus, it automatically balances
swap and cache memory since the same reclamation routine passes
over both types of pages.


Rik -- Open Source: you deserve to be in control of your data.
+-------------------------------------------------------------------+
| Le Reseau netwerksystemen BV:               http://www.reseau.nl/ |
| Linux Memory Management site:   http://www.linux.eu.org/Linux-MM/ |
| Nederlandse Linux documentatie:          http://www.nl.linux.org/ |
+-------------------------------------------------------------------+

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: filecache/swapcache questions
  1999-06-15  7:32 ` Rik van Riel
@ 1999-06-15 15:51   ` Kanoj Sarcar
  1999-06-15 20:24     ` Rik van Riel
  1999-06-16 20:37     ` Andrea Arcangeli
  1999-06-17 23:33   ` Stephen C. Tweedie
  1 sibling, 2 replies; 23+ messages in thread
From: Kanoj Sarcar @ 1999-06-15 15:51 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, sct

> 
> On Tue, 15 Jun 1999, Kanoj Sarcar wrote:

Hmm, I am either misunderstanding your explanation, or I couldn't
make the crux of my questions clear in the first posting.

> 
> > Q1. Is it really needed to put all the swap pages in the swapper_inode
> > i_pages?
> 
> Yes, see below.

I understand that it is beneficial for performance reasons to have
a list of swapped pages which are clean wrt their disk copies in 
the swapcache, which is implemented as a file cache on swapper_inode.
What I am trying to find out is if it is enough to put these pages
in the hash queue for swapper_inode, without really also putting
them in the inode queue for swapper_inode. Its not like we ever 
"truncate" swapper_inode, that we will need to go thru its i_pages
list ...

> 
> > How will it be possible for a page to be in the swapcache, for its
> > reference count to be 1 (which has been checked just before), and for
> > its swap_count(page->offset) to also be 1? I can see this being
> > possible only if an unmap/exit path might lazily leave a anonymous
> > page in the swap cache, but I don't believe that happens.
> 
> It does happen. We use a 'two-stage' reclamation process instead
> of page aging. It seems to work wonderfully -- nice page aging
> properties without the overhead. Plus, it automatically balances
> swap and cache memory since the same reclamation routine passes
> over both types of pages.
>

I still can't see how this can happen. Note that try_to_swap_out
either does a get_swap_page/swap_duplicate on the swaphandle, which
gets the swap_count up to 2, or if it sees a page already in the
swapcache, it just does a swap_duplicate. Either way, if the only 
reference on the physical page is from the swapcache, there will be 
at least one more reference on the swap page other than due to the 
swapcache. What am I missing?

Thanks.

Kanoj
kanoj@engr.sgi.com

PS: Q4: who uses rw_swap_page_nolock, and what is shmfs? Note that
rw_swap_page_nolock is the only caller that passes in non PageSwapCache
pages into rw_swap_page_base(), which otherwise could assume that
all pages passed into it are PageSwapCache, which would eliminate
the need for a seperate PG_swap_unlock_after bit.
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: filecache/swapcache questions
  1999-06-15 15:51   ` Kanoj Sarcar
@ 1999-06-15 20:24     ` Rik van Riel
  1999-06-15 21:02       ` Kanoj Sarcar
  1999-06-16 20:37     ` Andrea Arcangeli
  1 sibling, 1 reply; 23+ messages in thread
From: Rik van Riel @ 1999-06-15 20:24 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: linux-mm, sct

On Tue, 15 Jun 1999, Kanoj Sarcar wrote:

> I still can't see how this can happen. Note that try_to_swap_out
> either does a get_swap_page/swap_duplicate on the swaphandle,
> which gets the swap_count up to 2, or if it sees a page already in
> the swapcache, it just does a swap_duplicate. Either way, if the
> only reference on the physical page is from the swapcache, there
> will be at least one more reference on the swap page other than
> due to the swapcache. What am I missing?

When the swap I/O (if needed) finishes, the page count is
decreased by one.

Rik -- Open Source: you deserve to be in control of your data.
+-------------------------------------------------------------------+
| Le Reseau netwerksystemen BV:               http://www.reseau.nl/ |
| Linux Memory Management site:   http://www.linux.eu.org/Linux-MM/ |
| Nederlandse Linux documentatie:          http://www.nl.linux.org/ |
+-------------------------------------------------------------------+

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: filecache/swapcache questions
  1999-06-15 20:24     ` Rik van Riel
@ 1999-06-15 21:02       ` Kanoj Sarcar
  0 siblings, 0 replies; 23+ messages in thread
From: Kanoj Sarcar @ 1999-06-15 21:02 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, sct

> 
> On Tue, 15 Jun 1999, Kanoj Sarcar wrote:
> 
> > I still can't see how this can happen. Note that try_to_swap_out
> > either does a get_swap_page/swap_duplicate on the swaphandle,
> > which gets the swap_count up to 2, or if it sees a page already in
> > the swapcache, it just does a swap_duplicate. Either way, if the
> > only reference on the physical page is from the swapcache, there
> > will be at least one more reference on the swap page other than
> > due to the swapcache. What am I missing?
> 
> When the swap I/O (if needed) finishes, the page count is
> decreased by one.
>

Never mind, I was being blind before. This is why shrink_mmap 
has code that reads:

                if (PageSwapCache(page)) {
                        if (referenced && swap_count(page->offset) != 1)
                                continue;
                        delete_from_swap_cache(page);
                        return 1;
                }
 
Say a process is just about to execute exit()/munmap(), and kswapd 
steals a page from it, updating the pte with the swaphandle.
zap_pte_range -> free_pte will just free the swaphandle, possibly
leaving the page with a refcount of 1 (from the swapcache) and
a swappage count of 1 (from the swapcache again). shrink_mmap
recognizes the page/swaphandle will not be used by anyone and
frees these up.

Thanks for the pointers, Rik.

Kanoj
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: filecache/swapcache questions
  1999-06-15 15:51   ` Kanoj Sarcar
  1999-06-15 20:24     ` Rik van Riel
@ 1999-06-16 20:37     ` Andrea Arcangeli
  1 sibling, 0 replies; 23+ messages in thread
From: Andrea Arcangeli @ 1999-06-16 20:37 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Rik van Riel, linux-mm, sct

On Tue, 15 Jun 1999, Kanoj Sarcar wrote:

>What I am trying to find out is if it is enough to put these pages
>in the hash queue for swapper_inode, without really also putting
>them in the inode queue for swapper_inode. Its not like we ever 
>"truncate" swapper_inode, that we will need to go thru its i_pages
>list ...

Yes, it's useless taking them into the swapper inode queue too. It's this
way only because it uses a common interface.

>PS: Q4: who uses rw_swap_page_nolock, and what is shmfs? Note that
>rw_swap_page_nolock is the only caller that passes in non PageSwapCache
>pages into rw_swap_page_base(), which otherwise could assume that
>all pages passed into it are PageSwapCache, which would eliminate
>the need for a seperate PG_swap_unlock_after bit.

Please look at:

	ftp://ftp.suse.com/pub/people/andrea/kernel-patches/2.2.10_andrea-VM5.gz

Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: filecache/swapcache questions
  1999-06-15  7:32 ` Rik van Riel
  1999-06-15 15:51   ` Kanoj Sarcar
@ 1999-06-17 23:33   ` Stephen C. Tweedie
  1999-06-18  0:20     ` Kanoj Sarcar
  1 sibling, 1 reply; 23+ messages in thread
From: Stephen C. Tweedie @ 1999-06-17 23:33 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Kanoj Sarcar, linux-mm

Hi,

On Tue, 15 Jun 1999 09:32:19 +0200 (CEST), Rik van Riel
<riel@nl.linux.org> said:

>> How will it be possible for a page to be in the swapcache, for its
>> reference count to be 1 (which has been checked just before), and for
>> its swap_count(page->offset) to also be 1? I can see this being
>> possible only if an unmap/exit path might lazily leave a anonymous
>> page in the swap cache, but I don't believe that happens.

> It does happen. We use a 'two-stage' reclamation process instead
> of page aging. It seems to work wonderfully -- nice page aging
> properties without the overhead. 

Much more than that: if we take a write fault to a page which is shared
on swap by two processes, then we bring it into cache and take a
copy-on-write, leaving one copy in the swap cache (reference one: it is
_only_ in use by the swap cache now), and the other copy being reference
by the faulting process.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: filecache/swapcache questions
  1999-06-17 23:33   ` Stephen C. Tweedie
@ 1999-06-18  0:20     ` Kanoj Sarcar
  1999-06-18 17:00       ` Stephen C. Tweedie
  0 siblings, 1 reply; 23+ messages in thread
From: Kanoj Sarcar @ 1999-06-18  0:20 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: riel, linux-mm

> 
> Hi,
> 
> On Tue, 15 Jun 1999 09:32:19 +0200 (CEST), Rik van Riel
> <riel@nl.linux.org> said:
> 
> >> How will it be possible for a page to be in the swapcache, for its
> >> reference count to be 1 (which has been checked just before), and for
> >> its swap_count(page->offset) to also be 1? I can see this being
> >> possible only if an unmap/exit path might lazily leave a anonymous
> >> page in the swap cache, but I don't believe that happens.
> 
> > It does happen. We use a 'two-stage' reclamation process instead
> > of page aging. It seems to work wonderfully -- nice page aging
> > properties without the overhead. 
> 
> Much more than that: if we take a write fault to a page which is shared
> on swap by two processes, then we bring it into cache and take a
> copy-on-write, leaving one copy in the swap cache (reference one: it is
> _only_ in use by the swap cache now), and the other copy being reference
> by the faulting process.
> 
> --Stephen
> --

Interesting scenario ... unfortunately, I am getting confused.
I am trying to lay out the steps in your example here:

Step 1:  P1 and P2 sharing a page which is not in core, is out on
swap at swap handle X, swap_count(X) = 2 (P1 + P2)

Step 2: P1 writes to page.

        Step 2a: swap_in reads in the page into core into page A, 
page_count(A) = 2 (swapcache + P1), A.offset = X,  
swap_count (X)= 2 (P2 + swapcache)

        Step 2b: P1 incurs do_wp_page on the page, gets a new page. 
The old page A ends up with a page_count = 1 (swapcache), and
swap_count (X) stays at 2. 

So, what am I missing, since your example does not end up with 
page_count = 1 and swap_count(page offset/swaphandle) = 1?

I did give an alternative scenario involving an exitting process,
do you believe that one?

While I have your attention, I think I found a bug in the
sys_swapoff algorithm ... basically, it needs to also look 
at swap_lockmap. Say an exitting process fired off some async
swap ins just before it exitted, and a bunch of these are in
flight (swap_lockmaps are set, as are swap_map, from swapcache).
The swap device gets deleted (with a printk warning message due
to non zero swap_map count). Finally, the old async swap in's 
start terminating, invoking swap_after_unlock_page. Interesting
things could happen, depending on whether the swap id has been
reallocated or not ... Is there any protection against this
scenario?

Thanks.

Kanoj
kanoj@engr.sgi.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: filecache/swapcache questions
  1999-06-18  0:20     ` Kanoj Sarcar
@ 1999-06-18 17:00       ` Stephen C. Tweedie
  1999-06-18 17:03         ` Kanoj Sarcar
  0 siblings, 1 reply; 23+ messages in thread
From: Stephen C. Tweedie @ 1999-06-18 17:00 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Stephen C. Tweedie, riel, linux-mm

Hi,

On Thu, 17 Jun 1999 17:20:10 -0700 (PDT), kanoj@google.engr.sgi.com
(Kanoj Sarcar) said:

> Interesting scenario ... unfortunately, I am getting confused.
> I am trying to lay out the steps in your example here:
 
> Step 1:  P1 and P2 sharing a page which is not in core, is out on
> swap at swap handle X, swap_count(X) = 2 (P1 + P2)

> Step 2: P1 writes to page.

>         Step 2a: swap_in reads in the page into core into page A, 
> page_count(A) = 2 (swapcache + P1), A.offset = X,  
> swap_count (X)= 2 (P2 + swapcache)

Yes.  Exactly.

> So, what am I missing, since your example does not end up with 
> page_count = 1 and swap_count(page offset/swaphandle) = 1?

> I did give an alternative scenario involving an exitting process,
> do you believe that one?

Yes --- I'd missed the fact that you wanted swap_count to be one as well
as page count.

> While I have your attention, I think I found a bug in the
> sys_swapoff algorithm ... basically, it needs to also look 
> at swap_lockmap. Say an exitting process fired off some async
> swap ins just before it exitted, and a bunch of these are in
> flight (swap_lockmaps are set, as are swap_map, from swapcache).
> The swap device gets deleted (with a printk warning message due
> to non zero swap_map count). Finally, the old async swap in's 
> start terminating, invoking swap_after_unlock_page. Interesting
> things could happen, depending on whether the swap id has been
> reallocated or not ... Is there any protection against this
> scenario?

Yes --- try_to_unuse calls read_swap_cache() with wait==1, so we always
wait for the IO to complete before swapoff can complete.  At least,
that's the theory. :)

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: filecache/swapcache questions
  1999-06-18 17:00       ` Stephen C. Tweedie
@ 1999-06-18 17:03         ` Kanoj Sarcar
  0 siblings, 0 replies; 23+ messages in thread
From: Kanoj Sarcar @ 1999-06-18 17:03 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: riel, linux-mm

> > While I have your attention, I think I found a bug in the
> > sys_swapoff algorithm ... basically, it needs to also look 
> > at swap_lockmap. Say an exitting process fired off some async
> > swap ins just before it exitted, and a bunch of these are in
> > flight (swap_lockmaps are set, as are swap_map, from swapcache).
> > The swap device gets deleted (with a printk warning message due
> > to non zero swap_map count). Finally, the old async swap in's 
> > start terminating, invoking swap_after_unlock_page. Interesting
> > things could happen, depending on whether the swap id has been
> > reallocated or not ... Is there any protection against this
> > scenario?
> 
> Yes --- try_to_unuse calls read_swap_cache() with wait==1, so we always
> wait for the IO to complete before swapoff can complete.  At least,
> that's the theory. :)
>

I just figured that one out all by myself :-) Duuh ...

Note to myself : read the code, stupid, before spouting off ...

Thanks, Stephen.

Kanoj 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: filecache/swapcache questions
  1999-06-28 22:36             ` Stephen C. Tweedie
@ 1999-06-28 23:24               ` Kanoj Sarcar
  0 siblings, 0 replies; 23+ messages in thread
From: Kanoj Sarcar @ 1999-06-28 23:24 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: linux-mm

> 
> Hi,
> 
> On Mon, 21 Jun 1999 11:46:27 -0700 (PDT), kanoj@google.engr.sgi.com
> (Kanoj Sarcar) said:
> 
> >> Look no further than swap_in(), which knows that there is no pte (so
> >> swapout concurrency is not a problem) and it holds the mmap lock (so
> >> there are no concurrent swap_ins on the page).  It reads in the page adn
> >> unconditionally sets up the pte to point to it, assuming that nobody
> >> else can conceivably set the pte while we do the swap outselves.
> 
> > Hmm, am I being fooled by the comment in swap_in?
> 
> > /*
> >  * The tests may look silly, but it essentially makes sure that
> >  * no other process did a swap-in on us just as we were waiting.
> >  *
> 
> afaik only swapoff can trigger that.  Concurrent swap-in on the same
> entry can occur into the page cache, but not into the page tables
> because those are protected by the semaphore.
> 
> --Stephen
> 

Right ... I was trying to counter your argument that swapoff needs
to hold the mmap_sem to protect ptes (except for the fork/exit/swapin 
races) by pointing out that pte updates are already protected by 
kernel_lock.

Kanoj
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: filecache/swapcache questions
  1999-06-21 18:46           ` Kanoj Sarcar
  1999-06-21 23:44             ` Kanoj Sarcar
@ 1999-06-28 22:36             ` Stephen C. Tweedie
  1999-06-28 23:24               ` Kanoj Sarcar
  1 sibling, 1 reply; 23+ messages in thread
From: Stephen C. Tweedie @ 1999-06-28 22:36 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Stephen C. Tweedie, linux-mm

Hi,

On Mon, 21 Jun 1999 11:46:27 -0700 (PDT), kanoj@google.engr.sgi.com
(Kanoj Sarcar) said:

>> Look no further than swap_in(), which knows that there is no pte (so
>> swapout concurrency is not a problem) and it holds the mmap lock (so
>> there are no concurrent swap_ins on the page).  It reads in the page adn
>> unconditionally sets up the pte to point to it, assuming that nobody
>> else can conceivably set the pte while we do the swap outselves.

> Hmm, am I being fooled by the comment in swap_in?

> /*
>  * The tests may look silly, but it essentially makes sure that
>  * no other process did a swap-in on us just as we were waiting.
>  *

afaik only swapoff can trigger that.  Concurrent swap-in on the same
entry can occur into the page cache, but not into the page tables
because those are protected by the semaphore.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: filecache/swapcache questions
  1999-06-24 23:55                 ` Kanoj Sarcar
@ 1999-06-25  0:26                   ` Andrea Arcangeli
  0 siblings, 0 replies; 23+ messages in thread
From: Andrea Arcangeli @ 1999-06-25  0:26 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: torvalds, sct, linux-mm

On Thu, 24 Jun 1999, Kanoj Sarcar wrote:

>The scenario that you lay out is not possible, as both Stephen and I
>pointed out earlier in this thread. swapoff uses read_swap_cache,
>so if a process has started a swapin, swapoff will wait for that io
>to complete. Note that swapoff can not proceed until the read-in is

Sorry, I forgot to specify where the the faulting-task was sleeping.

I wasn't talking about the case where the faulting-task was sleeping on
I/O with the swap-cache page just alloced and hashed in the page cache. If
the task is sleeping waiting for I/O then I completly agree with you that
swapoff will block in lookup_swap_cache because it will see the swap-cache
page locked down from the faulting-task.

In my case the faulting-task was sleeping in _GFP_ (maybe swapping out
some stuff in sync mode). And if you look at rw_swap_cache_async you'll
notice that the task can go to sleep in GFP while holding an additional
reference into the swap space (see swap_duplicate). While the task was
sleeping swapoff was allowed to alloc a new page in the meantime, then was
allowed to add such new page to the swap cache and to start I/O on it, and
finally to remap the pte with the new page. Then swapoff continued
noticing that there was an additional reference in the swap cache even if
nobody was mapping such swapped-out page anymore (the additional reference
was of the proggy sleeping in GFP).

>The swap lockmap deletion in 2.3.8 is not complete. I hope you will
>be taking in Andrea's "shm pages in swapcache" changes (although I

I'll send the shm patch to Linus in the next days (but I bet nobody will
trigger the race in the meantime, also considering that database people
have the shm memory not swappable).

Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: filecache/swapcache questions
  1999-06-24 22:23               ` Andrea Arcangeli
@ 1999-06-24 23:55                 ` Kanoj Sarcar
  1999-06-25  0:26                   ` Andrea Arcangeli
  0 siblings, 1 reply; 23+ messages in thread
From: Kanoj Sarcar @ 1999-06-24 23:55 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: torvalds, sct, linux-mm

> 
> On Mon, 21 Jun 1999, Kanoj Sarcar wrote:
> 
> >And continuing on with the problems with swapoff ...
> 
> I have not thought yet at the races you are talking about in the thread.
> 
> But I think I seen another potential problem related to swapoff in the
> last days. Think if you run swapoff -a while there is a program that is
> faulting in a swapin exception. The process is sleeping into
> read_swap_cache_async() after having increased the swap-count (this is the
> only problem). While the task is sleeping swapoff will swapin the page and
> will map the swapped-in page in the pte of the process while the process
> is sleeping. Then swapoff continue and see that the swap-count is still >
> 0 (1 in the example) even if the page is been swapped-in for all tasks in
> the system. Swapoff get confused and set the swap count to 0 by hand (and
> doing that it corrupts a bit the state of the VM). I think I reproduced
> the above scenario stress testing 2.3.8 + my VM changes (finally "stable"
> except the buffer beyond end of the device problem) but it the problem
> I seen is real then it will apply to 2.2.x as well.
> 
> Andrea Arcangeli
> 

Andrea, 

The scenario that you lay out is not possible, as both Stephen and I
pointed out earlier in this thread. swapoff uses read_swap_cache,
so if a process has started a swapin, swapoff will wait for that io
to complete. Note that swapoff can not proceed until the read-in is
complete (at which point the swapcount is decremented by 
PG_swap_unlock_after logic). So, it is not possible for swapoff
to see swap count > 0. At least in theory ...

As to why you might be seeing the problem, this might be due
to fork/exit races with swapoff (which I pointed out in this thread), 
which I hope to have a fix for sometime soon (although it looks ugly). 
Also, see below.

Linus,

The swap lockmap deletion in 2.3.8 is not complete. I hope you will
be taking in Andrea's "shm pages in swapcache" changes (although I
haven't reviewed it, so I can't attest to its goodness). One problem
in 2.3.8 is that a shm page could be getting swapped out, and a swapoff
could actually read the contents of the swaphandle into a new page,
*before* the swapout completed (this was prevented in 2.3.7 in
rw_swap_page_base() by swap lockmap checking), since shm pages are 
not in the swap cache (thus swapoff would have no way of synchronizing
with the swapout completing). This could lead to shm data getting
corrupted. And also lead to swapoff manually setting swapcount to 0,
with shm swapout termination also decrementing swapcount.

Or maybe I am just confused ....

Kanoj
kanoj@engr.sgi.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: filecache/swapcache questions
  1999-06-21 23:44             ` Kanoj Sarcar
@ 1999-06-24 22:23               ` Andrea Arcangeli
  1999-06-24 23:55                 ` Kanoj Sarcar
  0 siblings, 1 reply; 23+ messages in thread
From: Andrea Arcangeli @ 1999-06-24 22:23 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: sct, linux-mm

On Mon, 21 Jun 1999, Kanoj Sarcar wrote:

>And continuing on with the problems with swapoff ...

I have not thought yet at the races you are talking about in the thread.

But I think I seen another potential problem related to swapoff in the
last days. Think if you run swapoff -a while there is a program that is
faulting in a swapin exception. The process is sleeping into
read_swap_cache_async() after having increased the swap-count (this is the
only problem). While the task is sleeping swapoff will swapin the page and
will map the swapped-in page in the pte of the process while the process
is sleeping. Then swapoff continue and see that the swap-count is still >
0 (1 in the example) even if the page is been swapped-in for all tasks in
the system. Swapoff get confused and set the swap count to 0 by hand (and
doing that it corrupts a bit the state of the VM). I think I reproduced
the above scenario stress testing 2.3.8 + my VM changes (finally "stable"
except the buffer beyond end of the device problem) but it the problem
I seen is real then it will apply to 2.2.x as well.

Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: filecache/swapcache questions
  1999-06-21 18:46           ` Kanoj Sarcar
@ 1999-06-21 23:44             ` Kanoj Sarcar
  1999-06-24 22:23               ` Andrea Arcangeli
  1999-06-28 22:36             ` Stephen C. Tweedie
  1 sibling, 1 reply; 23+ messages in thread
From: Kanoj Sarcar @ 1999-06-21 23:44 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: sct, linux-mm

And continuing on with the problems with swapoff ...

While forking, we copy swap handles from the parent into the child
in copy_page_range. There are of course sleep point in dup_mmap
(kmem_cache_alloc would be one, vm_ops->open could be another). 

A swapoff coming in at this point might scan the process list, not
find the nascent child, and just delete the device, leaving the
child referencing the old swap handles.

Irregardless of our current discussions about why the mmap_sem 
is needed in swapoff to protect ptes, it seems that grabbing it
in swapoff could trivially solve this fork race ... and some code
changes in exit_mmap could also fix the exit race ...

Kanoj
kanoj@engr.sgi.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: filecache/swapcache questions
  1999-06-21 17:49         ` Stephen C. Tweedie
@ 1999-06-21 18:46           ` Kanoj Sarcar
  1999-06-21 23:44             ` Kanoj Sarcar
  1999-06-28 22:36             ` Stephen C. Tweedie
  0 siblings, 2 replies; 23+ messages in thread
From: Kanoj Sarcar @ 1999-06-21 18:46 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: linux-mm

> 
> Hi, 
> 
> On Mon, 21 Jun 1999 10:36:37 -0700 (PDT), kanoj@google.engr.sgi.com
> (Kanoj Sarcar) said:
> 
> > But doesn't my previous logic work in this case too? Namely
> > that kernel_lock is held when any code looks at or changes
> > a pte, so if swapoff holds the kernel_lock and never goes to 
> > sleep, things should work?
> 
> No, because the swapoff could still take place while a normal swapin is
> already in progress.
> 
> > Maybe if you can jot down a quick scenario where a problem occurs when
> > swapoff does not take mmap_sem, it would be easier for me to spot
> > which concurrency issue I am missing ...
> 
> Look no further than swap_in(), which knows that there is no pte (so
> swapout concurrency is not a problem) and it holds the mmap lock (so
> there are no concurrent swap_ins on the page).  It reads in the page adn
> unconditionally sets up the pte to point to it, assuming that nobody
> else can conceivably set the pte while we do the swap outselves.
> 
> --Stephen
> 

Hmm, am I being fooled by the comment in swap_in?

/*
 * The tests may look silly, but it essentially makes sure that
 * no other process did a swap-in on us just as we were waiting.
 *

Also, swap_in seems to be revalidating the pte if it goes to
sleep:

        if (pte_val(*page_table) != entry) {
                if (page_map)
                        free_page_and_swap_cache(page_address(page_map));
                return;
        }

All this while holding kernel_lock ...

So, I am still mystified about why swapoff would need the mmap_sem.

Kanoj

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: filecache/swapcache questions
  1999-06-21 17:36       ` Kanoj Sarcar
@ 1999-06-21 17:49         ` Stephen C. Tweedie
  1999-06-21 18:46           ` Kanoj Sarcar
  0 siblings, 1 reply; 23+ messages in thread
From: Stephen C. Tweedie @ 1999-06-21 17:49 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Stephen C. Tweedie, linux-mm

Hi, 

On Mon, 21 Jun 1999 10:36:37 -0700 (PDT), kanoj@google.engr.sgi.com
(Kanoj Sarcar) said:

> But doesn't my previous logic work in this case too? Namely
> that kernel_lock is held when any code looks at or changes
> a pte, so if swapoff holds the kernel_lock and never goes to 
> sleep, things should work?

No, because the swapoff could still take place while a normal swapin is
already in progress.

> Maybe if you can jot down a quick scenario where a problem occurs when
> swapoff does not take mmap_sem, it would be easier for me to spot
> which concurrency issue I am missing ...

Look no further than swap_in(), which knows that there is no pte (so
swapout concurrency is not a problem) and it holds the mmap lock (so
there are no concurrent swap_ins on the page).  It reads in the page adn
unconditionally sets up the pte to point to it, assuming that nobody
else can conceivably set the pte while we do the swap outselves.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: filecache/swapcache questions
  1999-06-21 16:57     ` Stephen C. Tweedie
@ 1999-06-21 17:36       ` Kanoj Sarcar
  1999-06-21 17:49         ` Stephen C. Tweedie
  0 siblings, 1 reply; 23+ messages in thread
From: Kanoj Sarcar @ 1999-06-21 17:36 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: linux-mm

> 
> Hi,
> 
> On Mon, 21 Jun 1999 09:46:19 -0700 (PDT), kanoj@google.engr.sgi.com
> (Kanoj Sarcar) said:
> 
> > I don't agree with you about swapoff needing the mmap_sem. In my
> > thinking, mmap_sem is needed to preserve the vma list, *if* you 
> > go to sleep while scanning the list. Updates to the vma fields/
> > chain are protected by kernel_lock and mmap_sem. 
> 
> No.  mmap_sem protects both the vma list and the page tables.  Page
> faults hold the mmap semaphore both to protect the vma list and to
> protect against concurrent pagins to the same page.  
> 
> The swapper is currently exempt from the mmap_sem, so the paging code
> needs to check whether the current pte has disappeared if it ever
> blocks, but it assumes that we never have concurrent pagein occurring
> (think threads).  swapoff currently breaks that assumption.
>

But doesn't my previous logic work in this case too? Namely
that kernel_lock is held when any code looks at or changes
a pte, so if swapoff holds the kernel_lock and never goes to 
sleep, things should work?

Maybe if you can jot down a quick scenario where a problem occurs
when swapoff does not take mmap_sem, it would be easier for me
to spot which concurrency issue I am missing ...

Thanks.

Kanoj
kanoj@engr.sgi.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: filecache/swapcache questions
  1999-06-21 16:46   ` Kanoj Sarcar
@ 1999-06-21 16:57     ` Stephen C. Tweedie
  1999-06-21 17:36       ` Kanoj Sarcar
  0 siblings, 1 reply; 23+ messages in thread
From: Stephen C. Tweedie @ 1999-06-21 16:57 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Stephen C. Tweedie, linux-mm

Hi,

On Mon, 21 Jun 1999 09:46:19 -0700 (PDT), kanoj@google.engr.sgi.com
(Kanoj Sarcar) said:

> I don't agree with you about swapoff needing the mmap_sem. In my
> thinking, mmap_sem is needed to preserve the vma list, *if* you 
> go to sleep while scanning the list. Updates to the vma fields/
> chain are protected by kernel_lock and mmap_sem. 

No.  mmap_sem protects both the vma list and the page tables.  Page
faults hold the mmap semaphore both to protect the vma list and to
protect against concurrent pagins to the same page.  

The swapper is currently exempt from the mmap_sem, so the paging code
needs to check whether the current pte has disappeared if it ever
blocks, but it assumes that we never have concurrent pagein occurring
(think threads).  swapoff currently breaks that assumption.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: filecache/swapcache questions
  1999-06-21 11:25 ` Stephen C. Tweedie
@ 1999-06-21 16:46   ` Kanoj Sarcar
  1999-06-21 16:57     ` Stephen C. Tweedie
  0 siblings, 1 reply; 23+ messages in thread
From: Kanoj Sarcar @ 1999-06-21 16:46 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: linux-mm

> 
> Hi,
> 
> On Sun, 20 Jun 1999 22:29:14 -0700 (PDT), kanoj@google.engr.sgi.com
> (Kanoj Sarcar) said:
> 
> > Imagine a process exitting, executing exit_mmap. exit_mmap
> > cleans out the vma list from the mm, ie sets mm->mmap = 0.
> > Then, it invokes vm_ops->unmap, say on a MAP_SHARED file
> > vma, which starts file io, that puts the process to sleep.
> 
> > Now, a sys_swapoff comes in ... this will not be able to
> > retrieve the swap handles from the former process (since
> > the vma's are invisible), so it may end up deleting the 
> > device with a warning message about non 0 swap_map count.
> 
> > The exitting process then invokes a bunch of swap_free()s
> > via zap_page_range, whereas the swap id might already have
> > been reassigned.
> 
> Agreed.
> 
> > If there's no protection against this, a possible fix would 
> > be for exit_mmap not to clean the vma list, rather delete a
> > vma at a time from the list.
> 
> Looking at this, we have other problems: the forced swapin caused by
> sys_swapoff() doesn't down() the mmap semaphore.  That is very bad
> indeed.  We need to fix it.  If we fix it, then we can fix exit_mmap()
> at the same time by taking the mmap semaphore while we do the
> unmap/close operations.
> 
> --Stephen
> 

I don't agree with you about swapoff needing the mmap_sem. In my
thinking, mmap_sem is needed to preserve the vma list, *if* you 
go to sleep while scanning the list. Updates to the vma fields/
chain are protected by kernel_lock and mmap_sem. If you are scanning
the vma list, and are guaranteed not to sleep, why would you need
to grab mmap_sem, if you already have the kernel_lock, like 
swapoff does?

Yes, but I agree we can play it safe and grab the lock ... that
might make it easier to synchronize with exit_mmap. Let me think
about this and post a possible patch.

Thanks.

Kanoj
kanoj@engr.sgi.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: filecache/swapcache questions
  1999-06-21  5:29 Kanoj Sarcar
@ 1999-06-21 11:25 ` Stephen C. Tweedie
  1999-06-21 16:46   ` Kanoj Sarcar
  0 siblings, 1 reply; 23+ messages in thread
From: Stephen C. Tweedie @ 1999-06-21 11:25 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: sct, linux-mm

Hi,

On Sun, 20 Jun 1999 22:29:14 -0700 (PDT), kanoj@google.engr.sgi.com
(Kanoj Sarcar) said:

> Imagine a process exitting, executing exit_mmap. exit_mmap
> cleans out the vma list from the mm, ie sets mm->mmap = 0.
> Then, it invokes vm_ops->unmap, say on a MAP_SHARED file
> vma, which starts file io, that puts the process to sleep.

> Now, a sys_swapoff comes in ... this will not be able to
> retrieve the swap handles from the former process (since
> the vma's are invisible), so it may end up deleting the 
> device with a warning message about non 0 swap_map count.

> The exitting process then invokes a bunch of swap_free()s
> via zap_page_range, whereas the swap id might already have
> been reassigned.

Agreed.

> If there's no protection against this, a possible fix would 
> be for exit_mmap not to clean the vma list, rather delete a
> vma at a time from the list.

Looking at this, we have other problems: the forced swapin caused by
sys_swapoff() doesn't down() the mmap semaphore.  That is very bad
indeed.  We need to fix it.  If we fix it, then we can fix exit_mmap()
at the same time by taking the mmap semaphore while we do the
unmap/close operations.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: filecache/swapcache questions
@ 1999-06-21  5:29 Kanoj Sarcar
  1999-06-21 11:25 ` Stephen C. Tweedie
  0 siblings, 1 reply; 23+ messages in thread
From: Kanoj Sarcar @ 1999-06-21  5:29 UTC (permalink / raw)
  To: sct; +Cc: linux-mm

Okay, lets see if I am being stupid again ...

Imagine a process exitting, executing exit_mmap. exit_mmap
cleans out the vma list from the mm, ie sets mm->mmap = 0.
Then, it invokes vm_ops->unmap, say on a MAP_SHARED file
vma, which starts file io, that puts the process to sleep.

Now, a sys_swapoff comes in ... this will not be able to
retrieve the swap handles from the former process (since
the vma's are invisible), so it may end up deleting the 
device with a warning message about non 0 swap_map count.

The exitting process then invokes a bunch of swap_free()s
via zap_page_range, whereas the swap id might already have
been reassigned.

If there's no protection against this, a possible fix would 
be for exit_mmap not to clean the vma list, rather delete a
vma at a time from the list.

So, what is the call to swap_free doing in filemap_sync_pte?
When will this call ever be executed?

Thanks.

Kanoj
kanoj@engr.sgi.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~1999-06-28 23:24 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1999-06-15  7:16 filecache/swapcache questions Kanoj Sarcar
1999-06-15  7:32 ` Rik van Riel
1999-06-15 15:51   ` Kanoj Sarcar
1999-06-15 20:24     ` Rik van Riel
1999-06-15 21:02       ` Kanoj Sarcar
1999-06-16 20:37     ` Andrea Arcangeli
1999-06-17 23:33   ` Stephen C. Tweedie
1999-06-18  0:20     ` Kanoj Sarcar
1999-06-18 17:00       ` Stephen C. Tweedie
1999-06-18 17:03         ` Kanoj Sarcar
1999-06-21  5:29 Kanoj Sarcar
1999-06-21 11:25 ` Stephen C. Tweedie
1999-06-21 16:46   ` Kanoj Sarcar
1999-06-21 16:57     ` Stephen C. Tweedie
1999-06-21 17:36       ` Kanoj Sarcar
1999-06-21 17:49         ` Stephen C. Tweedie
1999-06-21 18:46           ` Kanoj Sarcar
1999-06-21 23:44             ` Kanoj Sarcar
1999-06-24 22:23               ` Andrea Arcangeli
1999-06-24 23:55                 ` Kanoj Sarcar
1999-06-25  0:26                   ` Andrea Arcangeli
1999-06-28 22:36             ` Stephen C. Tweedie
1999-06-28 23:24               ` Kanoj Sarcar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox