Re: filecache/swapcache questions

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: filecache/swapcache questions
@ 1999-06-21  5:29 Kanoj Sarcar
  1999-06-21 11:25 ` Stephen C. Tweedie
  0 siblings, 1 reply; 60+ messages in thread
From: Kanoj Sarcar @ 1999-06-21  5:29 UTC (permalink / raw)
  To: sct; +Cc: linux-mm

Okay, lets see if I am being stupid again ...

Imagine a process exitting, executing exit_mmap. exit_mmap
cleans out the vma list from the mm, ie sets mm->mmap = 0.
Then, it invokes vm_ops->unmap, say on a MAP_SHARED file
vma, which starts file io, that puts the process to sleep.

Now, a sys_swapoff comes in ... this will not be able to
retrieve the swap handles from the former process (since
the vma's are invisible), so it may end up deleting the 
device with a warning message about non 0 swap_map count.

The exitting process then invokes a bunch of swap_free()s
via zap_page_range, whereas the swap id might already have
been reassigned.

If there's no protection against this, a possible fix would 
be for exit_mmap not to clean the vma list, rather delete a
vma at a time from the list.

So, what is the call to swap_free doing in filemap_sync_pte?
When will this call ever be executed?

Thanks.

Kanoj
kanoj@engr.sgi.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions
  1999-06-21  5:29 filecache/swapcache questions Kanoj Sarcar
@ 1999-06-21 11:25 ` Stephen C. Tweedie
  1999-06-21 16:46   ` Kanoj Sarcar
  0 siblings, 1 reply; 60+ messages in thread
From: Stephen C. Tweedie @ 1999-06-21 11:25 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: sct, linux-mm

Hi,

On Sun, 20 Jun 1999 22:29:14 -0700 (PDT), kanoj@google.engr.sgi.com
(Kanoj Sarcar) said:

> Imagine a process exitting, executing exit_mmap. exit_mmap
> cleans out the vma list from the mm, ie sets mm->mmap = 0.
> Then, it invokes vm_ops->unmap, say on a MAP_SHARED file
> vma, which starts file io, that puts the process to sleep.

> Now, a sys_swapoff comes in ... this will not be able to
> retrieve the swap handles from the former process (since
> the vma's are invisible), so it may end up deleting the 
> device with a warning message about non 0 swap_map count.

> The exitting process then invokes a bunch of swap_free()s
> via zap_page_range, whereas the swap id might already have
> been reassigned.

Agreed.

> If there's no protection against this, a possible fix would 
> be for exit_mmap not to clean the vma list, rather delete a
> vma at a time from the list.

Looking at this, we have other problems: the forced swapin caused by
sys_swapoff() doesn't down() the mmap semaphore.  That is very bad
indeed.  We need to fix it.  If we fix it, then we can fix exit_mmap()
at the same time by taking the mmap semaphore while we do the
unmap/close operations.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions
  1999-06-21 11:25 ` Stephen C. Tweedie
@ 1999-06-21 16:46   ` Kanoj Sarcar
  1999-06-21 16:57     ` Stephen C. Tweedie
  0 siblings, 1 reply; 60+ messages in thread
From: Kanoj Sarcar @ 1999-06-21 16:46 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: linux-mm

> 
> Hi,
> 
> On Sun, 20 Jun 1999 22:29:14 -0700 (PDT), kanoj@google.engr.sgi.com
> (Kanoj Sarcar) said:
> 
> > Imagine a process exitting, executing exit_mmap. exit_mmap
> > cleans out the vma list from the mm, ie sets mm->mmap = 0.
> > Then, it invokes vm_ops->unmap, say on a MAP_SHARED file
> > vma, which starts file io, that puts the process to sleep.
> 
> > Now, a sys_swapoff comes in ... this will not be able to
> > retrieve the swap handles from the former process (since
> > the vma's are invisible), so it may end up deleting the 
> > device with a warning message about non 0 swap_map count.
> 
> > The exitting process then invokes a bunch of swap_free()s
> > via zap_page_range, whereas the swap id might already have
> > been reassigned.
> 
> Agreed.
> 
> > If there's no protection against this, a possible fix would 
> > be for exit_mmap not to clean the vma list, rather delete a
> > vma at a time from the list.
> 
> Looking at this, we have other problems: the forced swapin caused by
> sys_swapoff() doesn't down() the mmap semaphore.  That is very bad
> indeed.  We need to fix it.  If we fix it, then we can fix exit_mmap()
> at the same time by taking the mmap semaphore while we do the
> unmap/close operations.
> 
> --Stephen
> 

I don't agree with you about swapoff needing the mmap_sem. In my
thinking, mmap_sem is needed to preserve the vma list, *if* you 
go to sleep while scanning the list. Updates to the vma fields/
chain are protected by kernel_lock and mmap_sem. If you are scanning
the vma list, and are guaranteed not to sleep, why would you need
to grab mmap_sem, if you already have the kernel_lock, like 
swapoff does?

Yes, but I agree we can play it safe and grab the lock ... that
might make it easier to synchronize with exit_mmap. Let me think
about this and post a possible patch.

Thanks.

Kanoj
kanoj@engr.sgi.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions
  1999-06-21 16:46   ` Kanoj Sarcar
@ 1999-06-21 16:57     ` Stephen C. Tweedie
  1999-06-21 17:36       ` Kanoj Sarcar
  0 siblings, 1 reply; 60+ messages in thread
From: Stephen C. Tweedie @ 1999-06-21 16:57 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Stephen C. Tweedie, linux-mm

Hi,

On Mon, 21 Jun 1999 09:46:19 -0700 (PDT), kanoj@google.engr.sgi.com
(Kanoj Sarcar) said:

> I don't agree with you about swapoff needing the mmap_sem. In my
> thinking, mmap_sem is needed to preserve the vma list, *if* you 
> go to sleep while scanning the list. Updates to the vma fields/
> chain are protected by kernel_lock and mmap_sem. 

No.  mmap_sem protects both the vma list and the page tables.  Page
faults hold the mmap semaphore both to protect the vma list and to
protect against concurrent pagins to the same page.  

The swapper is currently exempt from the mmap_sem, so the paging code
needs to check whether the current pte has disappeared if it ever
blocks, but it assumes that we never have concurrent pagein occurring
(think threads).  swapoff currently breaks that assumption.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions
  1999-06-21 16:57     ` Stephen C. Tweedie
@ 1999-06-21 17:36       ` Kanoj Sarcar
  1999-06-21 17:49         ` Stephen C. Tweedie
  0 siblings, 1 reply; 60+ messages in thread
From: Kanoj Sarcar @ 1999-06-21 17:36 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: linux-mm

> 
> Hi,
> 
> On Mon, 21 Jun 1999 09:46:19 -0700 (PDT), kanoj@google.engr.sgi.com
> (Kanoj Sarcar) said:
> 
> > I don't agree with you about swapoff needing the mmap_sem. In my
> > thinking, mmap_sem is needed to preserve the vma list, *if* you 
> > go to sleep while scanning the list. Updates to the vma fields/
> > chain are protected by kernel_lock and mmap_sem. 
> 
> No.  mmap_sem protects both the vma list and the page tables.  Page
> faults hold the mmap semaphore both to protect the vma list and to
> protect against concurrent pagins to the same page.  
> 
> The swapper is currently exempt from the mmap_sem, so the paging code
> needs to check whether the current pte has disappeared if it ever
> blocks, but it assumes that we never have concurrent pagein occurring
> (think threads).  swapoff currently breaks that assumption.
>

But doesn't my previous logic work in this case too? Namely
that kernel_lock is held when any code looks at or changes
a pte, so if swapoff holds the kernel_lock and never goes to 
sleep, things should work?

Maybe if you can jot down a quick scenario where a problem occurs
when swapoff does not take mmap_sem, it would be easier for me
to spot which concurrency issue I am missing ...

Thanks.

Kanoj
kanoj@engr.sgi.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions
  1999-06-21 17:36       ` Kanoj Sarcar
@ 1999-06-21 17:49         ` Stephen C. Tweedie
  1999-06-21 18:46           ` Kanoj Sarcar
  0 siblings, 1 reply; 60+ messages in thread
From: Stephen C. Tweedie @ 1999-06-21 17:49 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Stephen C. Tweedie, linux-mm

Hi, 

On Mon, 21 Jun 1999 10:36:37 -0700 (PDT), kanoj@google.engr.sgi.com
(Kanoj Sarcar) said:

> But doesn't my previous logic work in this case too? Namely
> that kernel_lock is held when any code looks at or changes
> a pte, so if swapoff holds the kernel_lock and never goes to 
> sleep, things should work?

No, because the swapoff could still take place while a normal swapin is
already in progress.

> Maybe if you can jot down a quick scenario where a problem occurs when
> swapoff does not take mmap_sem, it would be easier for me to spot
> which concurrency issue I am missing ...

Look no further than swap_in(), which knows that there is no pte (so
swapout concurrency is not a problem) and it holds the mmap lock (so
there are no concurrent swap_ins on the page).  It reads in the page adn
unconditionally sets up the pte to point to it, assuming that nobody
else can conceivably set the pte while we do the swap outselves.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions
  1999-06-21 17:49         ` Stephen C. Tweedie
@ 1999-06-21 18:46           ` Kanoj Sarcar
  1999-06-21 23:44             ` Kanoj Sarcar
  1999-06-28 22:36             ` filecache/swapcache questions Stephen C. Tweedie
  0 siblings, 2 replies; 60+ messages in thread
From: Kanoj Sarcar @ 1999-06-21 18:46 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: linux-mm

> 
> Hi, 
> 
> On Mon, 21 Jun 1999 10:36:37 -0700 (PDT), kanoj@google.engr.sgi.com
> (Kanoj Sarcar) said:
> 
> > But doesn't my previous logic work in this case too? Namely
> > that kernel_lock is held when any code looks at or changes
> > a pte, so if swapoff holds the kernel_lock and never goes to 
> > sleep, things should work?
> 
> No, because the swapoff could still take place while a normal swapin is
> already in progress.
> 
> > Maybe if you can jot down a quick scenario where a problem occurs when
> > swapoff does not take mmap_sem, it would be easier for me to spot
> > which concurrency issue I am missing ...
> 
> Look no further than swap_in(), which knows that there is no pte (so
> swapout concurrency is not a problem) and it holds the mmap lock (so
> there are no concurrent swap_ins on the page).  It reads in the page adn
> unconditionally sets up the pte to point to it, assuming that nobody
> else can conceivably set the pte while we do the swap outselves.
> 
> --Stephen
> 

Hmm, am I being fooled by the comment in swap_in?

/*
 * The tests may look silly, but it essentially makes sure that
 * no other process did a swap-in on us just as we were waiting.
 *

Also, swap_in seems to be revalidating the pte if it goes to
sleep:

        if (pte_val(*page_table) != entry) {
                if (page_map)
                        free_page_and_swap_cache(page_address(page_map));
                return;
        }

All this while holding kernel_lock ...

So, I am still mystified about why swapoff would need the mmap_sem.

Kanoj

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions
  1999-06-21 18:46           ` Kanoj Sarcar
@ 1999-06-21 23:44             ` Kanoj Sarcar
  1999-06-24 22:23               ` Andrea Arcangeli
  1999-06-28 22:36             ` filecache/swapcache questions Stephen C. Tweedie
  1 sibling, 1 reply; 60+ messages in thread
From: Kanoj Sarcar @ 1999-06-21 23:44 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: sct, linux-mm

And continuing on with the problems with swapoff ...

While forking, we copy swap handles from the parent into the child
in copy_page_range. There are of course sleep point in dup_mmap
(kmem_cache_alloc would be one, vm_ops->open could be another). 

A swapoff coming in at this point might scan the process list, not
find the nascent child, and just delete the device, leaving the
child referencing the old swap handles.

Irregardless of our current discussions about why the mmap_sem 
is needed in swapoff to protect ptes, it seems that grabbing it
in swapoff could trivially solve this fork race ... and some code
changes in exit_mmap could also fix the exit race ...

Kanoj
kanoj@engr.sgi.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions
  1999-06-21 23:44             ` Kanoj Sarcar
@ 1999-06-24 22:23               ` Andrea Arcangeli
  1999-06-24 23:55                 ` Kanoj Sarcar
  0 siblings, 1 reply; 60+ messages in thread
From: Andrea Arcangeli @ 1999-06-24 22:23 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: sct, linux-mm

On Mon, 21 Jun 1999, Kanoj Sarcar wrote:

>And continuing on with the problems with swapoff ...

I have not thought yet at the races you are talking about in the thread.

But I think I seen another potential problem related to swapoff in the
last days. Think if you run swapoff -a while there is a program that is
faulting in a swapin exception. The process is sleeping into
read_swap_cache_async() after having increased the swap-count (this is the
only problem). While the task is sleeping swapoff will swapin the page and
will map the swapped-in page in the pte of the process while the process
is sleeping. Then swapoff continue and see that the swap-count is still >
0 (1 in the example) even if the page is been swapped-in for all tasks in
the system. Swapoff get confused and set the swap count to 0 by hand (and
doing that it corrupts a bit the state of the VM). I think I reproduced
the above scenario stress testing 2.3.8 + my VM changes (finally "stable"
except the buffer beyond end of the device problem) but it the problem
I seen is real then it will apply to 2.2.x as well.

Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions
  1999-06-24 22:23               ` Andrea Arcangeli
@ 1999-06-24 23:55                 ` Kanoj Sarcar
  1999-06-25  0:26                   ` Andrea Arcangeli
  0 siblings, 1 reply; 60+ messages in thread
From: Kanoj Sarcar @ 1999-06-24 23:55 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: torvalds, sct, linux-mm

> 
> On Mon, 21 Jun 1999, Kanoj Sarcar wrote:
> 
> >And continuing on with the problems with swapoff ...
> 
> I have not thought yet at the races you are talking about in the thread.
> 
> But I think I seen another potential problem related to swapoff in the
> last days. Think if you run swapoff -a while there is a program that is
> faulting in a swapin exception. The process is sleeping into
> read_swap_cache_async() after having increased the swap-count (this is the
> only problem). While the task is sleeping swapoff will swapin the page and
> will map the swapped-in page in the pte of the process while the process
> is sleeping. Then swapoff continue and see that the swap-count is still >
> 0 (1 in the example) even if the page is been swapped-in for all tasks in
> the system. Swapoff get confused and set the swap count to 0 by hand (and
> doing that it corrupts a bit the state of the VM). I think I reproduced
> the above scenario stress testing 2.3.8 + my VM changes (finally "stable"
> except the buffer beyond end of the device problem) but it the problem
> I seen is real then it will apply to 2.2.x as well.
> 
> Andrea Arcangeli
> 

Andrea, 

The scenario that you lay out is not possible, as both Stephen and I
pointed out earlier in this thread. swapoff uses read_swap_cache,
so if a process has started a swapin, swapoff will wait for that io
to complete. Note that swapoff can not proceed until the read-in is
complete (at which point the swapcount is decremented by 
PG_swap_unlock_after logic). So, it is not possible for swapoff
to see swap count > 0. At least in theory ...

As to why you might be seeing the problem, this might be due
to fork/exit races with swapoff (which I pointed out in this thread), 
which I hope to have a fix for sometime soon (although it looks ugly). 
Also, see below.

Linus,

The swap lockmap deletion in 2.3.8 is not complete. I hope you will
be taking in Andrea's "shm pages in swapcache" changes (although I
haven't reviewed it, so I can't attest to its goodness). One problem
in 2.3.8 is that a shm page could be getting swapped out, and a swapoff
could actually read the contents of the swaphandle into a new page,
*before* the swapout completed (this was prevented in 2.3.7 in
rw_swap_page_base() by swap lockmap checking), since shm pages are 
not in the swap cache (thus swapoff would have no way of synchronizing
with the swapout completing). This could lead to shm data getting
corrupted. And also lead to swapoff manually setting swapcount to 0,
with shm swapout termination also decrementing swapcount.

Or maybe I am just confused ....

Kanoj
kanoj@engr.sgi.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions
  1999-06-24 23:55                 ` Kanoj Sarcar
@ 1999-06-25  0:26                   ` Andrea Arcangeli
  1999-06-28  1:48                     ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races Kanoj Sarcar
  0 siblings, 1 reply; 60+ messages in thread
From: Andrea Arcangeli @ 1999-06-25  0:26 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: torvalds, sct, linux-mm

On Thu, 24 Jun 1999, Kanoj Sarcar wrote:

>The scenario that you lay out is not possible, as both Stephen and I
>pointed out earlier in this thread. swapoff uses read_swap_cache,
>so if a process has started a swapin, swapoff will wait for that io
>to complete. Note that swapoff can not proceed until the read-in is

Sorry, I forgot to specify where the the faulting-task was sleeping.

I wasn't talking about the case where the faulting-task was sleeping on
I/O with the swap-cache page just alloced and hashed in the page cache. If
the task is sleeping waiting for I/O then I completly agree with you that
swapoff will block in lookup_swap_cache because it will see the swap-cache
page locked down from the faulting-task.

In my case the faulting-task was sleeping in _GFP_ (maybe swapping out
some stuff in sync mode). And if you look at rw_swap_cache_async you'll
notice that the task can go to sleep in GFP while holding an additional
reference into the swap space (see swap_duplicate). While the task was
sleeping swapoff was allowed to alloc a new page in the meantime, then was
allowed to add such new page to the swap cache and to start I/O on it, and
finally to remap the pte with the new page. Then swapoff continued
noticing that there was an additional reference in the swap cache even if
nobody was mapping such swapped-out page anymore (the additional reference
was of the proggy sleeping in GFP).

>The swap lockmap deletion in 2.3.8 is not complete. I hope you will
>be taking in Andrea's "shm pages in swapcache" changes (although I

I'll send the shm patch to Linus in the next days (but I bet nobody will
trigger the race in the meantime, also considering that database people
have the shm memory not swappable).

Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races
  1999-06-25  0:26                   ` Andrea Arcangeli
@ 1999-06-28  1:48                     ` Kanoj Sarcar
  1999-06-28 10:35                       ` Andrea Arcangeli
                                         ` (2 more replies)
  0 siblings, 3 replies; 60+ messages in thread
From: Kanoj Sarcar @ 1999-06-28  1:48 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: torvalds, sct, linux-mm, Kanoj Sarcar

Linus/Andrea/Stephen,

This is the patch that tries to cure the swapoff races with processes
forking, exiting, and (readahead) swapping by faulting. 

Basically, all these operations are synchronized by the process
mmap_sem. Unfortunately, swapoff has to visit all processes, during
which it must hold tasklist_lock, a spinlock. Hence, it can not take
the mmap_sem, a sleeping mutex. So, the patch links up all active
mm's in a list that swapoff can visit (with minor restructuring, 
kswapd can also use this, although it can not hold mmap_sem).
Addition/deletions to the list are protected by a sleeping 
mutex, hence swapoff can grab the individual mmap_sems, while
preventing changes to the list. Effectively, process creation
and destruction are locked out if swapoff is running.

To do this, the lock ordering is mm_sem -> mmap_sem. To 
prevent deadlocks, care must be taken that a process invoking
delete/insert_mmlist does not have its own mmap_sem held. For
this, the do_fork path needs to change so as not to acquire
mmap_sem early, rather only when it is really needed. This does
not open up a resource-ordering problem between kernel_lock and
mmap_sem, since the kernel_lock is a monitor lock that is released
at schedule time, so no deadlocks are possible.

I have just done basic sanity testing on this, I am hoping Andrea 
can run his swapoff stress tests to see whether this patch helps
cure the problem he was seeing.

Thanks.

Kanoj
kanoj@engr.sgi.com


--- /usr/tmp/p_rdiff_a009HP/exec.c	Sun Jun 27 16:51:58 1999
+++ fs/exec.c	Sun Jun 27 15:14:43 1999
@@ -399,6 +399,7 @@
 	up(&mm->mmap_sem);
 	mm_release();
 	mmput(old_mm);
+	insert_mmlist(mm);
 	return 0;
 
 	/*
--- /usr/tmp/p_rdiff_a009HP/sched.h	Sun Jun 27 16:52:01 1999
+++ include/linux/sched.h	Fri Jun 25 17:22:56 1999
@@ -170,6 +170,8 @@
 	atomic_t count;
 	int map_count;				/* number of VMAs */
 	struct semaphore mmap_sem;
+	struct mm_struct *prev;			/* list of allocated mms */
+	struct mm_struct *next;			/* list of allocated mms */
 	unsigned long context;
 	unsigned long start_code, end_code, start_data, end_data;
 	unsigned long start_brk, brk, start_stack;
@@ -191,7 +193,7 @@
 		swapper_pg_dir, 			\
 		ATOMIC_INIT(1), 1,			\
 		__MUTEX_INITIALIZER(name.mmap_sem),	\
-		0,					\
+		&init_mm, &init_mm, 0,			\
 		0, 0, 0, 0,				\
 		0, 0, 0, 				\
 		0, 0, 0, 0,				\
@@ -611,6 +613,7 @@
 /*
  * Routines for handling mm_structs
  */
+extern struct semaphore mm_sem;
 extern struct mm_struct * mm_alloc(void);
 static inline void mmget(struct mm_struct * mm)
 {
@@ -619,6 +622,22 @@
 extern void mmput(struct mm_struct *);
 /* Remove the current tasks stale references to the old mm_struct */
 extern void mm_release(void);
+static inline void insert_mmlist(struct mm_struct * mm)
+{
+	down(&mm_sem);
+	mm->prev = &init_mm;
+	mm->next = init_mm.next;
+	init_mm.next->prev = mm;
+	init_mm.next = mm;
+	up(&mm_sem);
+}
+static inline void delete_mmlist(struct mm_struct * mm)
+{
+	down(&mm_sem);
+	mm->next->prev = mm->prev;
+	mm->prev->next = mm->next;
+	up(&mm_sem);
+}
 
 extern int  copy_thread(int, unsigned long, unsigned long, struct task_struct *, struct pt_regs *);
 extern void flush_thread(void);
--- /usr/tmp/p_rdiff_a009HP/fork.c	Sun Jun 27 16:52:04 1999
+++ kernel/fork.c	Sun Jun 27 15:28:34 1999
@@ -351,6 +351,7 @@
 		release_segments(mm);
 		exit_mmap(mm);
 		free_page_tables(mm);
+		delete_mmlist(mm);
 		kmem_cache_free(mm_cachep, mm);
 	}
 }
@@ -383,7 +384,11 @@
 	retval = new_page_tables(tsk);
 	if (retval)
 		goto free_mm;
+	insert_mmlist(mm);
+
+	down(&current->mm->mmap_sem);
 	retval = dup_mmap(mm);
+	up(&current->mm->mmap_sem);
 	if (retval)
 		goto free_pt;
 	up(&mm->mmap_sem);
@@ -549,7 +554,6 @@
 
 	*p = *current;
 
-	down(&current->mm->mmap_sem);
 	lock_kernel();
 
 	retval = -EAGAIN;
@@ -676,7 +680,6 @@
 	++total_forks;
 bad_fork:
 	unlock_kernel();
-	up(&current->mm->mmap_sem);
 fork_out:
 	if ((clone_flags & CLONE_VFORK) && (retval > 0)) 
 		down(&sem);
--- /usr/tmp/p_rdiff_a009HP/mmap.c	Sun Jun 27 16:52:07 1999
+++ mm/mmap.c	Sun Jun 27 15:20:08 1999
@@ -39,6 +39,8 @@
 /* SLAB cache for vm_area_struct's. */
 kmem_cache_t *vm_area_cachep;
 
+struct semaphore mm_sem;
+
 int sysctl_overcommit_memory;
 
 /* Check that a process has enough memory to allocate a
@@ -812,6 +814,7 @@
 {
 	struct vm_area_struct * mpnt;
 
+	down(&mm->mmap_sem);
 	mpnt = mm->mmap;
 	mm->mmap = mm->mmap_avl = mm->mmap_cache = NULL;
 	mm->rss = 0;
@@ -843,6 +846,7 @@
 		printk("exit_mmap: map count is %d\n", mm->map_count);
 
 	clear_page_tables(mm, 0, USER_PTRS_PER_PGD);
+	up(&mm->mmap_sem);
 }
 
 /* Insert vm structure into process list sorted by address
@@ -957,6 +961,7 @@
 
 void __init vma_init(void)
 {
+	init_MUTEX(&mm_sem);
 	vm_area_cachep = kmem_cache_create("vm_area_struct",
 					   sizeof(struct vm_area_struct),
 					   0, SLAB_HWCACHE_ALIGN,
--- /usr/tmp/p_rdiff_a009HP/page_alloc.c	Sun Jun 27 16:52:09 1999
+++ mm/page_alloc.c	Sun Jun 27 15:39:58 1999
@@ -385,10 +385,9 @@
 }
 
 /*
- * The tests may look silly, but it essentially makes sure that
- * no other process did a swap-in on us just as we were waiting.
+ * Concurrent swap-in via swapoff is interlocked out.
  *
- * Also, don't bother to add to the swap cache if this page-in
+ * Don't bother to add to the swap cache if this page-in
  * was due to a write access.
  */
 void swap_in(struct task_struct * tsk, struct vm_area_struct * vma,
@@ -400,11 +399,6 @@
 	if (!page_map) {
 		swapin_readahead(entry);
 		page_map = read_swap_cache(entry);
-	}
-	if (pte_val(*page_table) != entry) {
-		if (page_map)
-			free_page_and_swap_cache(page_address(page_map));
-		return;
 	}
 	if (!page_map) {
 		set_pte(page_table, BAD_PAGE);
--- /usr/tmp/p_rdiff_a009HP/swapfile.c	Sun Jun 27 16:52:12 1999
+++ mm/swapfile.c	Sun Jun 27 15:27:49 1999
@@ -259,20 +259,20 @@
 	}
 }
 
-static void unuse_process(struct mm_struct * mm, unsigned long entry, 
+static void unuse_mm(struct mm_struct * mm, unsigned long entry, 
 			unsigned long page)
 {
 	struct vm_area_struct* vma;
 
 	/*
-	 * Go through process' page directory.
+	 * Go through address space page directory.
 	 */
-	if (!mm || mm == &init_mm)
-		return;
+	down(&mm->mmap_sem);
 	for (vma = mm->mmap; vma; vma = vma->vm_next) {
 		pgd_t * pgd = pgd_offset(mm, vma->vm_start);
 		unuse_vma(vma, pgd, entry, page);
 	}
+	up(&mm->mmap_sem);
 	return;
 }
 
@@ -283,8 +283,8 @@
  */
 static int try_to_unuse(unsigned int type)
 {
+	struct mm_struct * mm;
 	struct swap_info_struct * si = &swap_info[type];
-	struct task_struct *p;
 	struct page *page_map;
 	unsigned long entry, page;
 	int i;
@@ -316,10 +316,12 @@
   			return -ENOMEM;
 		}
 		page = page_address(page_map);
-		read_lock(&tasklist_lock);
-		for_each_task(p)
-			unuse_process(p->mm, entry, page);
-		read_unlock(&tasklist_lock);
+		down(&mm_sem);
+		mm = init_mm.next;
+		while (mm != &init_mm) {
+			unuse_mm(mm, entry, page);
+		}
+		up(&mm_sem);
 		shm_unuse(entry, page);
 		/* Now get rid of the extra reference to the temporary
                    page we've been using. */
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races
  1999-06-28  1:48                     ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races Kanoj Sarcar
@ 1999-06-28 10:35                       ` Andrea Arcangeli
  1999-06-28 17:11                         ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Kanoj Sarcar
  1999-06-28 16:32                       ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races Stephen C. Tweedie
  1999-06-28 19:39                       ` Chuck Lever
  2 siblings, 1 reply; 60+ messages in thread
From: Andrea Arcangeli @ 1999-06-28 10:35 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: torvalds, sct, linux-mm

On Sun, 27 Jun 1999, Kanoj Sarcar wrote:

>This is the patch that tries to cure the swapoff races with processes
>forking, exiting, and (readahead) swapping by faulting. 

For the record: at least the read_swap_cache_async race I pointed out can
be fixed without grabbing the mmap semaphore. I agree that grabbing the
semaphore would fix the race though.

Here it is the alternate fix:

Index: mm/swap_state.c
===================================================================
RCS file: /var/cvs/linux/mm/swap_state.c,v
retrieving revision 1.1.1.3
diff -u -r1.1.1.3 swap_state.c
--- mm/swap_state.c	1999/06/14 15:30:09	1.1.1.3
+++ mm/swap_state.c	1999/06/28 10:15:15
@@ -125,7 +125,7 @@
 		"swap_duplicate: entry %08lx, offset exceeds max\n", entry);
 	goto out;
 bad_unused:
-	printk(KERN_ERR
+	printk(KERN_WARNING
 		"swap_duplicate at %8p: entry %08lx, unused page\n", 
 	       __builtin_return_address(0), entry);
 	goto out;
@@ -291,20 +291,15 @@
 	       entry, wait ? ", wait" : "");
 #endif
 	/*
-	 * Make sure the swap entry is still in use.
-	 */
-	if (!swap_duplicate(entry))	/* Account for the swap cache */
-		goto out;
-	/*
 	 * Look for the page in the swap cache.
 	 */
 	found_page = lookup_swap_cache(entry);
 	if (found_page)
-		goto out_free_swap;
+		goto out;
 
 	new_page_addr = __get_free_page(GFP_USER);
 	if (!new_page_addr)
-		goto out_free_swap;	/* Out of memory */
+		goto out;	/* Out of memory */
 	new_page = mem_map + MAP_NR(new_page_addr);
 
 	/*
@@ -313,6 +308,11 @@
 	found_page = lookup_swap_cache(entry);
 	if (found_page)
 		goto out_free_page;
+	/*
+	 * Make sure the swap entry is still in use.
+	 */
+	if (!swap_duplicate(entry))	/* Account for the swap cache */
+		goto out_free_page;
 	/* 
 	 * Add it to the swap cache and read its contents.
 	 */
@@ -330,8 +330,6 @@
 
 out_free_page:
 	__free_page(new_page);
-out_free_swap:
-	swap_free(entry);
 out:
 	return found_page;
 }



NOTE: this will cause swap_duplicate to generate some warning message but
everything will work fine then, exactly because the swapin code just check
if the pte is changed (swapped in from swapoff) before looking if
read_swap_cache returned a NULL pointer. (also the shm.c swap-cache code
checks if the pte is changed before to go oom).

But probably the right thing to do is to grab the mm semaphore in swapoff
as you did since we don't risk to deadlock there :).

Comments?

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8
  1999-06-28 10:35                       ` Andrea Arcangeli
@ 1999-06-28 17:11                         ` Kanoj Sarcar
  0 siblings, 0 replies; 60+ messages in thread
From: Kanoj Sarcar @ 1999-06-28 17:11 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: torvalds, sct, linux-mm

> Here it is the alternate fix:
> 
> Index: mm/swap_state.c
> ===================================================================
> RCS file: /var/cvs/linux/mm/swap_state.c,v
> retrieving revision 1.1.1.3
> diff -u -r1.1.1.3 swap_state.c
> --- mm/swap_state.c	1999/06/14 15:30:09	1.1.1.3
> +++ mm/swap_state.c	1999/06/28 10:15:15
> @@ -125,7 +125,7 @@
>  		"swap_duplicate: entry %08lx, offset exceeds max\n", entry);
>  	goto out;
>  bad_unused:
> -	printk(KERN_ERR
> +	printk(KERN_WARNING
>  		"swap_duplicate at %8p: entry %08lx, unused page\n", 
>  	       __builtin_return_address(0), entry);
>  	goto out;
> @@ -291,20 +291,15 @@
>  	       entry, wait ? ", wait" : "");
>  #endif
>  	/*
> -	 * Make sure the swap entry is still in use.
> -	 */
> -	if (!swap_duplicate(entry))	/* Account for the swap cache */
> -		goto out;
> -	/*
>  	 * Look for the page in the swap cache.
>  	 */
>  	found_page = lookup_swap_cache(entry);
>  	if (found_page)
> -		goto out_free_swap;
> +		goto out;
>  
>  	new_page_addr = __get_free_page(GFP_USER);
>  	if (!new_page_addr)
> -		goto out_free_swap;	/* Out of memory */
> +		goto out;	/* Out of memory */
>  	new_page = mem_map + MAP_NR(new_page_addr);
>  
>  	/*
> @@ -313,6 +308,11 @@
>  	found_page = lookup_swap_cache(entry);
>  	if (found_page)
>  		goto out_free_page;
> +	/*
> +	 * Make sure the swap entry is still in use.
> +	 */
> +	if (!swap_duplicate(entry))	/* Account for the swap cache */
> +		goto out_free_page;
>  	/* 
>  	 * Add it to the swap cache and read its contents.
>  	 */
> @@ -330,8 +330,6 @@
>  
>  out_free_page:
>  	__free_page(new_page);
> -out_free_swap:
> -	swap_free(entry);
>  out:
>  	return found_page;
>  }
> 
> 
> 
> NOTE: this will cause swap_duplicate to generate some warning message but

Or not, depending on whether the swap id has already been allocated 
to a newly added swap device. In which case, the worst we will do is
read-ahead in some unneeded swap pages. Not too bad ... I thought 
about this solution, and figured that a better idea is probably to
give up on read-ahead in swapin_readahead() if the faulting pte
had already been swapped in.

Anyway, it seems to prevent fork/exit races, grabbing mmap_sem is
needed in the swapoff path. Best to use that synchronization, since
the fault path also grabs mmap_sem.

Thanks.

Kanoj

> everything will work fine then, exactly because the swapin code just check
> if the pte is changed (swapped in from swapoff) before looking if
> read_swap_cache returned a NULL pointer. (also the shm.c swap-cache code
> checks if the pte is changed before to go oom).
> 
> But probably the right thing to do is to grab the mm semaphore in swapoff
> as you did since we don't risk to deadlock there :).
> 
> Comments?
> 
> Andrea
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://humbolt.geo.uu.nl/Linux-MM/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races
  1999-06-28  1:48                     ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races Kanoj Sarcar
  1999-06-28 10:35                       ` Andrea Arcangeli
@ 1999-06-28 16:32                       ` Stephen C. Tweedie
  1999-06-28 17:25                         ` Kanoj Sarcar
  1999-06-28 19:39                       ` Chuck Lever
  2 siblings, 1 reply; 60+ messages in thread
From: Stephen C. Tweedie @ 1999-06-28 16:32 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Andrea Arcangeli, torvalds, sct, linux-mm

Hi,

On Sun, 27 Jun 1999 18:48:47 -0700 (PDT), kanoj@google.engr.sgi.com
(Kanoj Sarcar) said:

> Linus/Andrea/Stephen,
> This is the patch that tries to cure the swapoff races with processes
> forking, exiting, and (readahead) swapping by faulting. 

> Basically, all these operations are synchronized by the process
> mmap_sem. Unfortunately, swapoff has to visit all processes, during
> which it must hold tasklist_lock, a spinlock. Hence, it can not take
> the mmap_sem, a sleeping mutex. 

But it can atomic_inc(&mm->count) to pin the mm, drop the task lock and
take the mm semaphore, and mmput() once it has finished.

> So, the patch links up all active mm's in a list that swapoff can
> visit

There shouldn't be need for a new data structure.  A bit of extra work
in swapoff should be all that is needed, and that avoids adding any
extra code at all on the hot paths.

Adding extra locks is the sort of thing other unixes do to solve
problems like this: we don't want to fall into that trap on Linux. :)

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races
  1999-06-28 16:32                       ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races Stephen C. Tweedie
@ 1999-06-28 17:25                         ` Kanoj Sarcar
  1999-06-28 20:40                           ` Stephen C. Tweedie
  0 siblings, 1 reply; 60+ messages in thread
From: Kanoj Sarcar @ 1999-06-28 17:25 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: andrea, torvalds, linux-mm

> 
> Hi,
> 
> On Sun, 27 Jun 1999 18:48:47 -0700 (PDT), kanoj@google.engr.sgi.com
> (Kanoj Sarcar) said:
> 
> > Linus/Andrea/Stephen,
> > This is the patch that tries to cure the swapoff races with processes
> > forking, exiting, and (readahead) swapping by faulting. 
> 
> > Basically, all these operations are synchronized by the process
> > mmap_sem. Unfortunately, swapoff has to visit all processes, during
> > which it must hold tasklist_lock, a spinlock. Hence, it can not take
> > the mmap_sem, a sleeping mutex. 
> 
> But it can atomic_inc(&mm->count) to pin the mm, drop the task lock and
> take the mm semaphore, and mmput() once it has finished.
>

Hmm, hadn't thought about that one. Of course, as soon as you drop 
the task_lock, in theory, you have to resume your search from the
beginning of the task list, since the list might have changed while
you dropped the task_lock (assume for a moment that the vm code does
not know how the task list is managed). That prevents any forward
progress by swapoff. 

I did think of other ways to maintain a hold on the process,
preventing it from forking or exitting, but my judgement was they
were going to be more heavyweight than my current solution.

> > So, the patch links up all active mm's in a list that swapoff can
> > visit
> 
> There shouldn't be need for a new data structure.  A bit of extra work
> in swapoff should be all that is needed, and that avoids adding any
> extra code at all on the hot paths.
> 
> Adding extra locks is the sort of thing other unixes do to solve
> problems like this: we don't want to fall into that trap on Linux. :)
> 

Agreed ... if you can come up with a reasonably simple and lightweight
solution without using locks.

Thanks.

Kanoj
kanoj@engr.sgi.com
> --Stephen
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races
  1999-06-28 17:25                         ` Kanoj Sarcar
@ 1999-06-28 20:40                           ` Stephen C. Tweedie
  1999-06-28 21:11                             ` Kanoj Sarcar
  0 siblings, 1 reply; 60+ messages in thread
From: Stephen C. Tweedie @ 1999-06-28 20:40 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Stephen C. Tweedie, andrea, torvalds, linux-mm

Hi,

On Mon, 28 Jun 1999 10:25:45 -0700 (PDT), kanoj@google.engr.sgi.com
(Kanoj Sarcar) said:

>> But it can atomic_inc(&mm->count) to pin the mm, drop the task lock and
>> take the mm semaphore, and mmput() once it has finished.

> Hmm, hadn't thought about that one. Of course, as soon as you drop 
> the task_lock, in theory, you have to resume your search from the
> beginning of the task list, since the list might have changed while
> you dropped the task_lock (assume for a moment that the vm code does
> not know how the task list is managed). That prevents any forward
> progress by swapoff. 

Then keep a fencepost of the highest pid you have completed so far,
and with the lock held, look for the lowest pid greater than that
one.  If you don't make any progress on the mm, bump up the fencepost
pid by one.

It will work.  It's a little extra overhead, but it confines all of
the cost to the swapoff path.  The pid scan isn't going to be nearly
as expensive as the rest of the vm scanning we are already forced to
do in swapoff.

--Stephen


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races
  1999-06-28 20:40                           ` Stephen C. Tweedie
@ 1999-06-28 21:11                             ` Kanoj Sarcar
  1999-06-28 22:12                               ` Stephen C. Tweedie
  0 siblings, 1 reply; 60+ messages in thread
From: Kanoj Sarcar @ 1999-06-28 21:11 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: andrea, torvalds, linux-mm

> 
> Hi,
> 
> On Mon, 28 Jun 1999 10:25:45 -0700 (PDT), kanoj@google.engr.sgi.com
> (Kanoj Sarcar) said:
> 
> >> But it can atomic_inc(&mm->count) to pin the mm, drop the task lock and
> >> take the mm semaphore, and mmput() once it has finished.
> 
> > Hmm, hadn't thought about that one. Of course, as soon as you drop 
> > the task_lock, in theory, you have to resume your search from the
> > beginning of the task list, since the list might have changed while
> > you dropped the task_lock (assume for a moment that the vm code does
> > not know how the task list is managed). That prevents any forward
> > progress by swapoff. 
> 
> Then keep a fencepost of the highest pid you have completed so far,
> and with the lock held, look for the lowest pid greater than that
> one.  If you don't make any progress on the mm, bump up the fencepost
> pid by one.

If I understand right, here is an example. Lets say I believe I 
have scanned uptil pid 10. You are suggesting, after having scanned
pid 10, hold on to task_lock, and look for the min pid > 10. Say
that is pid 12. Problem is, while I was scanning pid 10, maybe
pid 5 got reallocated, and pid 5 is a new process (probably a 
child of pid 20). Note that I mention that it is good design for
the vm code not to assume how the task list is managed or pids
allocated (yes, I have thought of having a swapoff generation 
number stored in each task structure too ...)

> 
> It will work.  It's a little extra overhead, but it confines all of
> the cost to the swapoff path.  The pid scan isn't going to be nearly
> as expensive as the rest of the vm scanning we are already forced to
> do in swapoff.

I would love to confine the complexity in the swapoff path, except
I can't come up with a solution. In any case, I think I was not 
clear about what the cost is in my fix. It is adding 2 chain fields
in the mm structure, adding and deleting to this chain at mm alloc/free
time, and the up/down cost on the mutex. Note that the up/down cost
is minimal (one atomic inc/dec) when no swapoff is going on, since the
kernel_lock also protects the chain. The mutex only becomes contended
when there is a swapoff in progress. 

Thanks.

Kanoj
kanoj@engr.sgi.com

Ps - All this discussion does not seem to be making it on to the
linux-mm web page ...

> 
> --Stephen
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races
  1999-06-28 21:11                             ` Kanoj Sarcar
@ 1999-06-28 22:12                               ` Stephen C. Tweedie
  1999-06-28 23:43                                 ` Kanoj Sarcar
  0 siblings, 1 reply; 60+ messages in thread
From: Stephen C. Tweedie @ 1999-06-28 22:12 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Stephen C. Tweedie, andrea, torvalds, linux-mm

Hi,

On Mon, 28 Jun 1999 14:11:18 -0700 (PDT), kanoj@google.engr.sgi.com
(Kanoj Sarcar) said:

> If I understand right, here is an example. Lets say I believe I 
> have scanned uptil pid 10. You are suggesting, after having scanned
> pid 10, hold on to task_lock, and look for the min pid > 10. Say
> that is pid 12. Problem is, while I was scanning pid 10, maybe
> pid 5 got reallocated, and pid 5 is a new process (probably a 
> child of pid 20). 

Fine --- repeat the whole thing until we have no swap entries left.  We
can still guarantee to make progress without extra locking for normal
swapping. 

>> It will work.  It's a little extra overhead, but it confines all of
>> the cost to the swapoff path.  The pid scan isn't going to be nearly
>> as expensive as the rest of the vm scanning we are already forced to
>> do in swapoff.

> I would love to confine the complexity in the swapoff path, except
> I can't come up with a solution. In any case, I think I was not 
> clear about what the cost is in my fix. It is adding 2 chain fields
> in the mm structure, adding and deleting to this chain at mm alloc/free
> time, and the up/down cost on the mutex. 

But it's not necessary.  Other OSes may add a lock here, a lock there
every time it happens to make a non-performance-critical path easier,
but in the long term that sort of thinking just bloats the fast paths.

> Note that the up/down cost is minimal (one atomic inc/dec) when no
> swapoff is going on

On SMP, the cache traffic produced by such locks is not minimal.  You
can measure the performance hit of every single cache miss that results.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races
  1999-06-28 22:12                               ` Stephen C. Tweedie
@ 1999-06-28 23:43                                 ` Kanoj Sarcar
  1999-06-29 11:44                                   ` Stephen C. Tweedie
  0 siblings, 1 reply; 60+ messages in thread
From: Kanoj Sarcar @ 1999-06-28 23:43 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: andrea, torvalds, linux-mm

> 
> Hi,
> 
> On Mon, 28 Jun 1999 14:11:18 -0700 (PDT), kanoj@google.engr.sgi.com
> (Kanoj Sarcar) said:
> 
> > If I understand right, here is an example. Lets say I believe I 
> > have scanned uptil pid 10. You are suggesting, after having scanned
> > pid 10, hold on to task_lock, and look for the min pid > 10. Say
> > that is pid 12. Problem is, while I was scanning pid 10, maybe
> > pid 5 got reallocated, and pid 5 is a new process (probably a 
> > child of pid 20). 
> 
> Fine --- repeat the whole thing until we have no swap entries left.  We
> can still guarantee to make progress without extra locking for normal
> swapping. 
>

This will almost always work, except theoretically, you still can
not guarantee forward progress, unless you can stop forks() from
happening. That is, given a high enough rate of forking, swapoff
is never going to terminate. 

Kanoj
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races
  1999-06-28 23:43                                 ` Kanoj Sarcar
@ 1999-06-29 11:44                                   ` Stephen C. Tweedie
  1999-06-29 22:01                                     ` Kanoj Sarcar
  0 siblings, 1 reply; 60+ messages in thread
From: Stephen C. Tweedie @ 1999-06-29 11:44 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Stephen C. Tweedie, andrea, torvalds, linux-mm

Hi,

On Mon, 28 Jun 1999 16:43:59 -0700 (PDT), kanoj@google.engr.sgi.com
(Kanoj Sarcar) said:

> This will almost always work, except theoretically, you still can
> not guarantee forward progress, unless you can stop forks() from
> happening. That is, given a high enough rate of forking, swapoff
> is never going to terminate. 

Then repeat until it converges, ie. until you have no swap entries left.
No big deal.  Unless the swapoff sweep and the fork are running over pid
space at exactly the same rate forever (which we do not have to worry
about!), you will make progress.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races
  1999-06-29 11:44                                   ` Stephen C. Tweedie
@ 1999-06-29 22:01                                     ` Kanoj Sarcar
  1999-06-30 17:28                                       ` Stephen C. Tweedie
  0 siblings, 1 reply; 60+ messages in thread
From: Kanoj Sarcar @ 1999-06-29 22:01 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: andrea, torvalds, linux-mm

> 
> Hi,
> 
> On Mon, 28 Jun 1999 16:43:59 -0700 (PDT), kanoj@google.engr.sgi.com
> (Kanoj Sarcar) said:
> 
> > This will almost always work, except theoretically, you still can
> > not guarantee forward progress, unless you can stop forks() from
> > happening. That is, given a high enough rate of forking, swapoff
> > is never going to terminate. 
> 
> Then repeat until it converges, ie. until you have no swap entries left.
> No big deal.  Unless the swapoff sweep and the fork are running over pid
> space at exactly the same rate forever (which we do not have to worry
> about!), you will make progress.
>

Stephen,

Seeing that both of us devoted so much time to discussing this,
I felt compelled to look at what is involved in doing what you 
are suggesting. 

To know whether there are any more references left to be eliminated
on a swap page, we can not tolerate a SWAP_MAP_MAX concept; else we
can never determine whether there are processes still referencing the
swap page. Removing SWAP_MAP_MAX is a good thing in itself. The 
swap_map[] array needs to be declared as an array of elements of the 
same size as the page->count field, ie an atomic_t (since there can be
no more references to the swap page than there can be on the physical
page).

Also, I am not sure why you say that fork can not keep ahead of
the swapoff sweep forever. Are you saying it is okay not to guarantee
forward progress of swapoff while a program that keeps on forking 
(and the children exit almost immediately) is running? Then there's
the complexity of clone(CLONE_PID), which creates task structures 
with the same pid, so the pid fencepost algorithm would need to
handle that too ...

Let me know what you think of these two issues, then I can try
to create a patch that does this ... 

Kanoj
kanoj@engr.sgi.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races
  1999-06-29 22:01                                     ` Kanoj Sarcar
@ 1999-06-30 17:28                                       ` Stephen C. Tweedie
  1999-06-30 18:05                                         ` Kanoj Sarcar
  0 siblings, 1 reply; 60+ messages in thread
From: Stephen C. Tweedie @ 1999-06-30 17:28 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Stephen C. Tweedie, andrea, torvalds, linux-mm

Hi,

On Tue, 29 Jun 1999 15:01:24 -0700 (PDT), kanoj@google.engr.sgi.com
(Kanoj Sarcar) said:

> To know whether there are any more references left to be eliminated
> on a swap page, we can not tolerate a SWAP_MAP_MAX concept; else we
> can never determine whether there are processes still referencing the
> swap page. Removing SWAP_MAP_MAX is a good thing in itself. The 
> swap_map[] array needs to be declared as an array of elements of the 
> same size as the page->count field, ie an atomic_t (since there can be
> no more references to the swap page than there can be on the physical
> page).

Yes there can...

> Also, I am not sure why you say that fork can not keep ahead of
> the swapoff sweep forever. 

Hmm, maybe..

> Are you saying it is okay not to guarantee forward progress of swapoff
> while a program that keeps on forking (and the children exit almost
> immediately) is running? 

There are a lot of things which don't make forward progress in such a
situation already.  Put a lock on dup_mm() if it worries you that much.

> Then there's the complexity of clone(CLONE_PID), which creates task
> structures with the same pid, so the pid fencepost algorithm would
> need to handle that too ...

Sure.  I never said that I had a complete solution: I just don't believe
that a new mm lock on all the faulting paths is necessary for a complete
solution.  

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races
  1999-06-30 17:28                                       ` Stephen C. Tweedie
@ 1999-06-30 18:05                                         ` Kanoj Sarcar
  0 siblings, 0 replies; 60+ messages in thread
From: Kanoj Sarcar @ 1999-06-30 18:05 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: andrea, torvalds, linux-mm

> 
> Hi,
> 
> On Tue, 29 Jun 1999 15:01:24 -0700 (PDT), kanoj@google.engr.sgi.com
> (Kanoj Sarcar) said:
> 
> > To know whether there are any more references left to be eliminated
> > on a swap page, we can not tolerate a SWAP_MAP_MAX concept; else we
> > can never determine whether there are processes still referencing the
> > swap page. Removing SWAP_MAP_MAX is a good thing in itself. The 
> > swap_map[] array needs to be declared as an array of elements of the 
> > same size as the page->count field, ie an atomic_t (since there can be
> > no more references to the swap page than there can be on the physical
> > page).
> 
> Yes there can...

I don't know how, but if this is true, and we do not have a
theoretical upper bound on the swap_count, then we will have to 
preserve SWAP_MAP_MAX ... which will render your proposal 
unachieveable ...

> 
> > Also, I am not sure why you say that fork can not keep ahead of
> > the swapoff sweep forever. 
> 
> Hmm, maybe..
> 
> > Are you saying it is okay not to guarantee forward progress of swapoff
> > while a program that keeps on forking (and the children exit almost
> > immediately) is running? 
> 
> There are a lot of things which don't make forward progress in such a
> situation already.  Put a lock on dup_mm() if it worries you that much.

That's basically what my solution does ... adds in a lock point
in copy_mm.

> 
> > Then there's the complexity of clone(CLONE_PID), which creates task
> > structures with the same pid, so the pid fencepost algorithm would
> > need to handle that too ...
> 
> Sure.  I never said that I had a complete solution: I just don't believe
> that a new mm lock on all the faulting paths is necessary for a complete
> solution.  

Hmmm, did you look at my solution in detail ... no locks are taken
on the page fault paths, other than mmap_sem, which the current code
already takes ...

Thanks.

Kanoj
kanoj@engr.sgi.com
> 
> --Stephen
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races
  1999-06-28  1:48                     ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races Kanoj Sarcar
  1999-06-28 10:35                       ` Andrea Arcangeli
  1999-06-28 16:32                       ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races Stephen C. Tweedie
@ 1999-06-28 19:39                       ` Chuck Lever
  1999-06-28 19:55                         ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Kanoj Sarcar
  1999-06-28 20:45                         ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races Stephen C. Tweedie
  2 siblings, 2 replies; 60+ messages in thread
From: Chuck Lever @ 1999-06-28 19:39 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Andrea Arcangeli, torvalds, sct, linux-mm

On Sun, 27 Jun 1999, Kanoj Sarcar wrote:
> Basically, all these operations are synchronized by the process
> mmap_sem. Unfortunately, swapoff has to visit all processes, during
> which it must hold tasklist_lock, a spinlock. Hence, it can not take
> the mmap_sem, a sleeping mutex. So, the patch links up all active
> mm's in a list that swapoff can visit (with minor restructuring, 
> kswapd can also use this, although it can not hold mmap_sem).
> Addition/deletions to the list are protected by a sleeping 
> mutex, hence swapoff can grab the individual mmap_sems, while
> preventing changes to the list. Effectively, process creation
> and destruction are locked out if swapoff is running.
> 
> To do this, the lock ordering is mm_sem -> mmap_sem. To 
> prevent deadlocks, care must be taken that a process invoking
> delete/insert_mmlist does not have its own mmap_sem held. For
> this, the do_fork path needs to change so as not to acquire
> mmap_sem early, rather only when it is really needed. This does
> not open up a resource-ordering problem between kernel_lock and
> mmap_sem, since the kernel_lock is a monitor lock that is released
> at schedule time, so no deadlocks are possible.

i'm already working on a patch that will allow kswapd to grab the mmap_sem
for the task that is about to be swapped.  this takes a slightly different
approach, since i'm focusing on kswapd and not on swapoff.  essentially
the patch does two things:

1)  it separates the logic of try_to_free_pages() and kswapd.  kswapd now
does the swapping, while try_to_free_pages() only does the shrink_mmap()
phase.

2)  after kswapd has chosen a process to swap, it drops the kernel lock
and grabs the mmap_sem for the thing it's about to swap.  it picks up the
kernel lock at appropriate points lower in the code.

i think it simplifies things a lot; there is no longer a concern about a
process deadlocking when re-acquiring it's own semaphore.  and, swapping
and page-fault handling for a given object can be serialized via the
object's mmap_sem.

	- Chuck Lever
--
corporate:	<chuckl@netscape.com>
personal:	<chucklever@netscape.net> or <cel@monkey.org>

The Linux Scalability project:
	http://www.citi.umich.edu/projects/linux-scalability/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8
  1999-06-28 19:39                       ` Chuck Lever
@ 1999-06-28 19:55                         ` Kanoj Sarcar
  1999-06-28 20:33                           ` Chuck Lever
  1999-06-28 22:09                           ` Stephen C. Tweedie
  1999-06-28 20:45                         ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races Stephen C. Tweedie
  1 sibling, 2 replies; 60+ messages in thread
From: Kanoj Sarcar @ 1999-06-28 19:55 UTC (permalink / raw)
  To: Chuck Lever; +Cc: andrea, torvalds, sct, linux-mm

> 
> i'm already working on a patch that will allow kswapd to grab the mmap_sem
> for the task that is about to be swapped.  this takes a slightly different
> approach, since i'm focusing on kswapd and not on swapoff.  essentially
> the patch does two things:

So, I would think some (if not mine) swapoff fix is still needed ...

> 
> 1)  it separates the logic of try_to_free_pages() and kswapd.  kswapd now
> does the swapping, while try_to_free_pages() only does the shrink_mmap()
> phase.
> 
> 2)  after kswapd has chosen a process to swap, it drops the kernel lock
> and grabs the mmap_sem for the thing it's about to swap.  it picks up the
> kernel lock at appropriate points lower in the code.
>

Agreed this would be a nice thing to be able to do ... 
Other than the deadlock problem, there's another issue involved, I 
think. Processes can go to sleep (inside drivers/fs for example while
mmaping/munmaping/faulting) holding their mmap_sem, so any solution 
should be able to guarantee that (at least one of) the memory free'ers 
do not go to sleep indefinitely (or for some time that is upto driver/fs
code to determine).

Kanoj
kanoj@engr.sgi.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8
  1999-06-28 19:55                         ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Kanoj Sarcar
@ 1999-06-28 20:33                           ` Chuck Lever
  1999-06-28 20:51                             ` Kanoj Sarcar
  1999-06-28 22:09                           ` Stephen C. Tweedie
  1 sibling, 1 reply; 60+ messages in thread
From: Chuck Lever @ 1999-06-28 20:33 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: andrea, torvalds, sct, linux-mm

On Mon, 28 Jun 1999, Kanoj Sarcar wrote:
> > i'm already working on a patch that will allow kswapd to grab the mmap_sem
> > for the task that is about to be swapped.  this takes a slightly different
> > approach, since i'm focusing on kswapd and not on swapoff.  essentially
> > the patch does two things:
> 
> So, I would think some (if not mine) swapoff fix is still needed ...

oh absolutely!  i was thinking that my patch might help make your work
simpler, that's all.  once i've tested it a little more, i'll post it to
the list.

> Other than the deadlock problem, there's another issue involved, I 
> think. Processes can go to sleep (inside drivers/fs for example while
> mmaping/munmaping/faulting) holding their mmap_sem, so any solution 
> should be able to guarantee that (at least one of) the memory free'ers 
> do not go to sleep indefinitely (or for some time that is upto driver/fs
> code to determine).

or perhaps the kernel could start more than one kswapd (one per swap
partition?).  with my patch, regular processes never wait for swap out
I/O, only kswapd does.

if you're concerned about bounding the latency of VM operations in order
to provide some RT guarantees, then i'd imagine, based on what i've read
on this list, that Linus might want to keep things simple more than he'd
want to clutter the memory freeing logic... but if there's a simple way to
"guarantee" a low latency then it would be worth the trouble.

	- Chuck Lever
--
corporate:	<chuckl@netscape.com>
personal:	<chucklever@netscape.net> or <cel@monkey.org>

The Linux Scalability project:
	http://www.citi.umich.edu/projects/linux-scalability/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8
  1999-06-28 20:33                           ` Chuck Lever
@ 1999-06-28 20:51                             ` Kanoj Sarcar
  1999-06-28 21:32                               ` Chuck Lever
  1999-06-28 22:08                               ` Stephen C. Tweedie
  0 siblings, 2 replies; 60+ messages in thread
From: Kanoj Sarcar @ 1999-06-28 20:51 UTC (permalink / raw)
  To: Chuck Lever; +Cc: andrea, torvalds, sct, linux-mm

> 
> > Other than the deadlock problem, there's another issue involved, I 
> > think. Processes can go to sleep (inside drivers/fs for example while
> > mmaping/munmaping/faulting) holding their mmap_sem, so any solution 
> > should be able to guarantee that (at least one of) the memory free'ers 
> > do not go to sleep indefinitely (or for some time that is upto driver/fs
> > code to determine).
> 
> or perhaps the kernel could start more than one kswapd (one per swap
> partition?).  with my patch, regular processes never wait for swap out
> I/O, only kswapd does.
> 
> if you're concerned about bounding the latency of VM operations in order
> to provide some RT guarantees, then i'd imagine, based on what i've read
> on this list, that Linus might want to keep things simple more than he'd
> want to clutter the memory freeing logic... but if there's a simple way to
> "guarantee" a low latency then it would be worth the trouble.

Oh no, I was not talking about exotic stuff like RT ... I was 
simply pointing out that to prevent deadlocks, and guarantee forward
progress, you have to show that despite what underlying fs/driver
code does, at least one memory freer is free to do its job. Else,
under low memory conditions, no memory freer can free up memory, so
the system is effectively hung. If you have to wait for mmap_sem, 
you can not easily do that (unless you are willing to do a trylock 
for mmap_sem, ie give up on a process and continue scanning for others). 
This is partly why after thinking about it, I did not attempt to do 
this myself. 

Note that while Stephen's 2.2 kpiod work was probably aimed at
fixing fs deadlocks, I think it also gave the nice property that
the chances that the "swapout" method goes to sleep were reduced.
Not to 0, since make_pio_request() itself requests memory ...
Things are probably much better in 2.3, I am not upto date with
.7 and .8. 

Kanoj
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8
  1999-06-28 20:51                             ` Kanoj Sarcar
@ 1999-06-28 21:32                               ` Chuck Lever
  1999-06-28 21:38                                 ` Kanoj Sarcar
  1999-06-28 22:21                                 ` Stephen C. Tweedie
  1999-06-28 22:08                               ` Stephen C. Tweedie
  1 sibling, 2 replies; 60+ messages in thread
From: Chuck Lever @ 1999-06-28 21:32 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: andrea, torvalds, sct, linux-mm

On Mon, 28 Jun 1999, Kanoj Sarcar wrote:
> > or perhaps the kernel could start more than one kswapd (one per swap
> > partition?).  with my patch, regular processes never wait for swap out
> > I/O, only kswapd does.
> > 
> > if you're concerned about bounding the latency of VM operations in order
> > to provide some RT guarantees, then i'd imagine, based on what i've read
> > on this list, that Linus might want to keep things simple more than he'd
> > want to clutter the memory freeing logic... but if there's a simple way to
> > "guarantee" a low latency then it would be worth the trouble.
> 
> Oh no, I was not talking about exotic stuff like RT ... I was 
> simply pointing out that to prevent deadlocks, and guarantee forward
> progress, you have to show that despite what underlying fs/driver
> code does, at least one memory freer is free to do its job. Else,
> under low memory conditions, no memory freer can free up memory, so
> the system is effectively hung. If you have to wait for mmap_sem, 
> you can not easily do that (unless you are willing to do a trylock 
> for mmap_sem, ie give up on a process and continue scanning for others). 
> This is partly why after thinking about it, I did not attempt to do 
> this myself. 

(i also tried down_trylock, but discarded it.)

well, except that kswapd itself doesn't free any memory.  it simply copies
data from memory to disk.  shrink_mmap() actually does the freeing, and
can do this with minimal locking, and from within regular application
processes.  when a process calls shrink_mmap(), it will cause some pages
to be made available to GFP.

if you need evidence that shrink_mmap() will keep a system running without
swapping, just run 2.3.8 :) :)

come to think of it, i don't think there is a safety guarantee in this
mechanism to prevent a lock-up.  i'll have to think more about it.

	- Chuck Lever
--
corporate:	<chuckl@netscape.com>
personal:	<chucklever@netscape.net> or <cel@monkey.org>

The Linux Scalability project:
	http://www.citi.umich.edu/projects/linux-scalability/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8
  1999-06-28 21:32                               ` Chuck Lever
@ 1999-06-28 21:38                                 ` Kanoj Sarcar
  1999-06-28 21:50                                   ` Chuck Lever
  1999-06-28 22:22                                   ` Stephen C. Tweedie
  1999-06-28 22:21                                 ` Stephen C. Tweedie
  1 sibling, 2 replies; 60+ messages in thread
From: Kanoj Sarcar @ 1999-06-28 21:38 UTC (permalink / raw)
  To: Chuck Lever; +Cc: andrea, torvalds, sct, linux-mm

> 
> (i also tried down_trylock, but discarded it.)
> 
> well, except that kswapd itself doesn't free any memory.  it simply copies
> data from memory to disk.  shrink_mmap() actually does the freeing, and
> can do this with minimal locking, and from within regular application
> processes.  when a process calls shrink_mmap(), it will cause some pages
> to be made available to GFP.
> 

The page is not really free for reallocation, unless kswapd can
push out the contents to disk, right? Which means, kswapd should
have as minimal sleep/memallocation points as possible ...

Kanoj
kanoj@engr.sgi.com

> if you need evidence that shrink_mmap() will keep a system running without
> swapping, just run 2.3.8 :) :)
> 
> come to think of it, i don't think there is a safety guarantee in this
> mechanism to prevent a lock-up.  i'll have to think more about it.
> 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8
  1999-06-28 21:38                                 ` Kanoj Sarcar
@ 1999-06-28 21:50                                   ` Chuck Lever
  1999-06-28 22:15                                     ` Kanoj Sarcar
  1999-06-28 22:22                                   ` Stephen C. Tweedie
  1 sibling, 1 reply; 60+ messages in thread
From: Chuck Lever @ 1999-06-28 21:50 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: linux-mm

On Mon, 28 Jun 1999, Kanoj Sarcar wrote:
> > well, except that kswapd itself doesn't free any memory.  it simply copies
> > data from memory to disk.  shrink_mmap() actually does the freeing, and
> > can do this with minimal locking, and from within regular application
> > processes.  when a process calls shrink_mmap(), it will cause some pages
> > to be made available to GFP.
> 
> The page is not really free for reallocation, unless kswapd can
> push out the contents to disk, right? Which means, kswapd should
> have as minimal sleep/memallocation points as possible ...

kswapd itself always uses a gfp_mask that includes GFP_IO, so nothing it
calls will ever wait.  the I/O it schedules is asynchronous, and when
complete, the buffer exit code in end_buffer_io_async will set the page
flags appropriately for shrink_mmap() to come by and steal it. also, the
buffer code will use pre-allocated buffers if gfp fails.

	- Chuck Lever
--
corporate:	<chuckl@netscape.com>
personal:	<chucklever@netscape.net> or <cel@monkey.org>

The Linux Scalability project:
	http://www.citi.umich.edu/projects/linux-scalability/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8
  1999-06-28 21:50                                   ` Chuck Lever
@ 1999-06-28 22:15                                     ` Kanoj Sarcar
  1999-06-29 11:23                                       ` Stephen C. Tweedie
  0 siblings, 1 reply; 60+ messages in thread
From: Kanoj Sarcar @ 1999-06-28 22:15 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-mm

> 
> On Mon, 28 Jun 1999, Kanoj Sarcar wrote:
> > > well, except that kswapd itself doesn't free any memory.  it simply copies
> > > data from memory to disk.  shrink_mmap() actually does the freeing, and
> > > can do this with minimal locking, and from within regular application
> > > processes.  when a process calls shrink_mmap(), it will cause some pages
> > > to be made available to GFP.
> > 
> > The page is not really free for reallocation, unless kswapd can
> > push out the contents to disk, right? Which means, kswapd should
> > have as minimal sleep/memallocation points as possible ...
> 
> kswapd itself always uses a gfp_mask that includes GFP_IO, so nothing it
> calls will ever wait.  the I/O it schedules is asynchronous, and when
> complete, the buffer exit code in end_buffer_io_async will set the page
> flags appropriately for shrink_mmap() to come by and steal it. also, the
> buffer code will use pre-allocated buffers if gfp fails.
>

Which is why you must gurantee that kswapd can always run, and keep
as few blocking points as possible ...

Kanoj
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8
  1999-06-28 22:15                                     ` Kanoj Sarcar
@ 1999-06-29 11:23                                       ` Stephen C. Tweedie
  1999-06-29 17:36                                         ` Kanoj Sarcar
  0 siblings, 1 reply; 60+ messages in thread
From: Stephen C. Tweedie @ 1999-06-29 11:23 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Chuck Lever, linux-mm

Hi,

On Mon, 28 Jun 1999 15:15:29 -0700 (PDT), kanoj@google.engr.sgi.com
(Kanoj Sarcar) said:

>> kswapd itself always uses a gfp_mask that includes GFP_IO, so nothing it
>> calls will ever wait.  the I/O it schedules is asynchronous, and when
>> complete, the buffer exit code in end_buffer_io_async will set the page
>> flags appropriately for shrink_mmap() to come by and steal it. also, the
>> buffer code will use pre-allocated buffers if gfp fails.
>> 

> Which is why you must gurantee that kswapd can always run, and keep
> as few blocking points as possible ...

Look, we're just going round in circles here.

kswapd *can* always run.

kswapd never ever waits in its memory allocation calls.  In
get_free_pages(), we special case PF_MEMALLOC processes (such as kswapd)
and completely avoid trying to free pages in that case: rather, we rely
on the free page thresholds preserving a last-chance set of free pages
which are _only_ usable by such processes.

kswapd can wait for IO, but the block device layers go to great lengths
to ensure that this can always proceed safely.  If the device layers
need an extra memory allocation to succeed, that again is protected by
PF_MEMALLOC.

kswapd never waits for long-term-held filesystem locks: that is what
kpiod is for.

This architecture is very robust.  Add an extra mmap semaphore lock to
the swapout path and you destroy it.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8
  1999-06-29 11:23                                       ` Stephen C. Tweedie
@ 1999-06-29 17:36                                         ` Kanoj Sarcar
  0 siblings, 0 replies; 60+ messages in thread
From: Kanoj Sarcar @ 1999-06-29 17:36 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: cel, linux-mm

> 
> Hi,
> 
> On Mon, 28 Jun 1999 15:15:29 -0700 (PDT), kanoj@google.engr.sgi.com
> (Kanoj Sarcar) said:
> 
> >> kswapd itself always uses a gfp_mask that includes GFP_IO, so nothing it
> >> calls will ever wait.  the I/O it schedules is asynchronous, and when
> >> complete, the buffer exit code in end_buffer_io_async will set the page
> >> flags appropriately for shrink_mmap() to come by and steal it. also, the
> >> buffer code will use pre-allocated buffers if gfp fails.
> >> 
> 
> > Which is why you must gurantee that kswapd can always run, and keep
> > as few blocking points as possible ...
> 
> Look, we're just going round in circles here.
> 
> kswapd *can* always run.
>

Not if you are going to try grabbing mmap_sem in that path ... 

Anyway, I guess we have established that is a bad idea ...

Kanoj 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8
  1999-06-28 21:38                                 ` Kanoj Sarcar
  1999-06-28 21:50                                   ` Chuck Lever
@ 1999-06-28 22:22                                   ` Stephen C. Tweedie
  1 sibling, 0 replies; 60+ messages in thread
From: Stephen C. Tweedie @ 1999-06-28 22:22 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Chuck Lever, andrea, torvalds, sct, linux-mm

Hi,

On Mon, 28 Jun 1999 14:38:43 -0700 (PDT), kanoj@google.engr.sgi.com
(Kanoj Sarcar) said:

> The page is not really free for reallocation, unless kswapd can
> push out the contents to disk, right? Which means, kswapd should
> have as minimal sleep/memallocation points as possible ...

The kswapd process is marked with the PF_MEMALLOC process flag, so any
recursive memory allocations it attempts get satisfied without IO being
invoked.  kswapd does not sleep during memory allocation.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8
  1999-06-28 21:32                               ` Chuck Lever
  1999-06-28 21:38                                 ` Kanoj Sarcar
@ 1999-06-28 22:21                                 ` Stephen C. Tweedie
  1999-06-28 22:57                                   ` Andrea Arcangeli
  1999-06-29  1:00                                   ` Chuck Lever
  1 sibling, 2 replies; 60+ messages in thread
From: Stephen C. Tweedie @ 1999-06-28 22:21 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Kanoj Sarcar, andrea, torvalds, sct, linux-mm

Hi,

On Mon, 28 Jun 1999 17:32:05 -0400 (EDT), Chuck Lever <cel@monkey.org>
said:

> well, except that kswapd itself doesn't free any memory.  

It has to.  That was why kswapd was initially written, to ensure that
interrupt memory requests (eg. busy router boxes) don't starve of
memory.  All of the benefits of kswapd came later.  In normal kernels
the try_to_swap_out doesn't free memory, true enough, but kswapd calls
shrink_mmap() too to make sure it does make real progress in freeing
memory.

> if you need evidence that shrink_mmap() will keep a system running without
> swapping, just run 2.3.8 :) :)

2.3.8 shows up slower on several benchmarks because of its reluctance to
swap.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8
  1999-06-28 22:21                                 ` Stephen C. Tweedie
@ 1999-06-28 22:57                                   ` Andrea Arcangeli
  1999-06-29  2:13                                     ` Chuck Lever
  1999-06-29  1:00                                   ` Chuck Lever
  1 sibling, 1 reply; 60+ messages in thread
From: Andrea Arcangeli @ 1999-06-28 22:57 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Chuck Lever, Kanoj Sarcar, torvalds, linux-mm

On Mon, 28 Jun 1999, Stephen C. Tweedie wrote:

>> if you need evidence that shrink_mmap() will keep a system running without
>> swapping, just run 2.3.8 :) :)
>
>2.3.8 shows up slower on several benchmarks because of its reluctance to
>swap.

Here the point is if you are swapping over your ramdisk or over my HD :).
Over my HD (system+swap all in the same IDE disk) you must _avoid_ to swap
at all costs if you care about performances. And btw with the clock
algorithm nobody can ever be sure to get a good swap/cache balance. With
the page-LRU code I have almost ready for 2.3.x (definitely stable for
2.2.x) instead we'll be sure to swapout only when there isn't plenty of
cache recyclable.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8
  1999-06-28 22:57                                   ` Andrea Arcangeli
@ 1999-06-29  2:13                                     ` Chuck Lever
  1999-06-29 12:01                                       ` Stephen C. Tweedie
  0 siblings, 1 reply; 60+ messages in thread
From: Chuck Lever @ 1999-06-29  2:13 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Stephen C. Tweedie, Kanoj Sarcar, torvalds, linux-mm

On Tue, 29 Jun 1999, Andrea Arcangeli wrote:
> On Mon, 28 Jun 1999, Stephen C. Tweedie wrote:
> 
> >> if you need evidence that shrink_mmap() will keep a system running without
> >> swapping, just run 2.3.8 :) :)
> >
> >2.3.8 shows up slower on several benchmarks because of its reluctance to
> >swap.
> 
> Here the point is if you are swapping over your ramdisk or over my HD :).
> Over my HD (system+swap all in the same IDE disk) you must _avoid_ to swap
> at all costs if you care about performances.

i'm not so sure about that.  swapping out, if efficiently done, is a
series of asynchronous sequential writes.  the only performance that will
interfere with is heavily I/O-bound applications.  even so, if it gets
more pages out of an application's way, then shrink_mmap will be less
destructive to your working set, which is a *good* thing, and your caches
will perform better.

at least, that's the way i've seen it with the workloads i've been playing
with.  so, i believe that swapping (paging) is my friend, up to a point.

	- Chuck Lever
--
corporate:	<chuckl@netscape.com>
personal:	<chucklever@netscape.net> or <cel@monkey.org>

The Linux Scalability project:
	http://www.citi.umich.edu/projects/linux-scalability/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8
  1999-06-29  2:13                                     ` Chuck Lever
@ 1999-06-29 12:01                                       ` Stephen C. Tweedie
  1999-06-29 12:32                                         ` Andrea Arcangeli
  0 siblings, 1 reply; 60+ messages in thread
From: Stephen C. Tweedie @ 1999-06-29 12:01 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Andrea Arcangeli, Stephen C. Tweedie, Kanoj Sarcar, linux-mm

Hi,

On Mon, 28 Jun 1999 22:13:15 -0400 (EDT), Chuck Lever <cel@monkey.org>
said:

> On Tue, 29 Jun 1999, Andrea Arcangeli wrote:
>> 
>> Here the point is if you are swapping over your ramdisk or over my HD :).
>> Over my HD (system+swap all in the same IDE disk) you must _avoid_ to swap
>> at all costs if you care about performances.

> i'm not so sure about that.  swapping out, if efficiently done, is a
> series of asynchronous sequential writes.  the only performance that will
> interfere with is heavily I/O-bound applications.  even so, if it gets
> more pages out of an application's way, then shrink_mmap will be less
> destructive to your working set, which is a *good* thing, and your caches
> will perform better.

Absolutely.  The important thing is to do enough swapping to make sure
that unused data is not kicking around in memory.  Maybe you don't want
the swapper to be active during your kernel compile, but if you have
less than a GB of physical memory then you probably want it to at least
think about swapping unused stuff out as the compilation starts.

If you defer swapping too much, you just end up doing more paging IO
since you can fit less of your working set into cache.

--Stephen

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8
  1999-06-29 12:01                                       ` Stephen C. Tweedie
@ 1999-06-29 12:32                                         ` Andrea Arcangeli
  1999-06-30 15:59                                           ` Stephen C. Tweedie
  0 siblings, 1 reply; 60+ messages in thread
From: Andrea Arcangeli @ 1999-06-29 12:32 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Chuck Lever, Kanoj Sarcar, linux-mm

On Tue, 29 Jun 1999, Stephen C. Tweedie wrote:

>Absolutely.  The important thing is to do enough swapping to make sure
>that unused data is not kicking around in memory.  Maybe you don't want

I know that sometime is the right thing do to.

But think also a difference scenario. You have a machine that only reads
all the time from a disk 10giga of data in loop. The data is so big and
you reference it so in round-robin that you have no chance to find one bit
of data in in the page-cache (but don't tell me to not use a lru-algorithm
:).

So what you gain? You find most of your task swapped out: when you click
netscape on the other desktop you find yourself stalled. Then you change
desktop, the program continue to read from disk in background, and then
you find stalled again the next time. In this case you gain _nothing_ from
swapping out netscape.

So I think we should make the swapout level to be at least configurable.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8
  1999-06-29 12:32                                         ` Andrea Arcangeli
@ 1999-06-30 15:59                                           ` Stephen C. Tweedie
  0 siblings, 0 replies; 60+ messages in thread
From: Stephen C. Tweedie @ 1999-06-30 15:59 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Stephen C. Tweedie, Chuck Lever, Kanoj Sarcar, linux-mm

Hi,

On Tue, 29 Jun 1999 14:32:41 +0200 (CEST), Andrea Arcangeli
<andrea@suse.de> said:

> On Tue, 29 Jun 1999, Stephen C. Tweedie wrote:
>> Absolutely.  The important thing is to do enough swapping to make sure
>> that unused data is not kicking around in memory.  Maybe you don't want

> I know that sometime is the right thing do to.

> But think also a difference scenario. You have a machine that only reads
> all the time from a disk 10giga of data in loop. 

Absolutely.  The find|grep workload, for example.  The point is that
this memory load is different from the load imposed by a kernel build.
If you are using file IO more, you need to be turning the cache over
more.  

The old buffer cache had this property, and it worked very well indeed.
The buffer cache would try to recycle itself in preference to growing,
so for file-intensive workloads we would naturally evict cached data in
preference to swapping, but for memory-intensive compute workloads we
would be more likely to swap unused VM pages out.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8
  1999-06-28 22:21                                 ` Stephen C. Tweedie
  1999-06-28 22:57                                   ` Andrea Arcangeli
@ 1999-06-29  1:00                                   ` Chuck Lever
  1 sibling, 0 replies; 60+ messages in thread
From: Chuck Lever @ 1999-06-29  1:00 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Kanoj Sarcar, andrea, torvalds, linux-mm

On Mon, 28 Jun 1999, Stephen C. Tweedie wrote:
> On Mon, 28 Jun 1999 17:32:05 -0400 (EDT), Chuck Lever <cel@monkey.org>
> said:
> 
> > well, except that kswapd itself doesn't free any memory.  
> 
> It has to.  That was why kswapd was initially written, to ensure that
> interrupt memory requests (eg. busy router boxes) don't starve of
> memory.  All of the benefits of kswapd came later.  In normal kernels
> the try_to_swap_out doesn't free memory, true enough, but kswapd calls
> shrink_mmap() too to make sure it does make real progress in freeing
> memory.

again, foot in mouth.  i meant kswapd doesn't free any memory *by simply
swapping*.  that's what i get for typing when i'm hungry.

> > if you need evidence that shrink_mmap() will keep a system running without
> > swapping, just run 2.3.8 :) :)
> 
> 2.3.8 shows up slower on several benchmarks because of its reluctance to
> swap.

right, agreed.  but it doesn't stall, it just slows down.

	- Chuck Lever
--
corporate:	<chuckl@netscape.com>
personal:	<chucklever@netscape.net> or <cel@monkey.org>

The Linux Scalability project:
	http://www.citi.umich.edu/projects/linux-scalability/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8
  1999-06-28 20:51                             ` Kanoj Sarcar
  1999-06-28 21:32                               ` Chuck Lever
@ 1999-06-28 22:08                               ` Stephen C. Tweedie
  1999-06-28 22:59                                 ` Andrea Arcangeli
  1999-06-29  0:53                                 ` Chuck Lever
  1 sibling, 2 replies; 60+ messages in thread
From: Stephen C. Tweedie @ 1999-06-28 22:08 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Chuck Lever, andrea, torvalds, sct, linux-mm

Hi,

On Mon, 28 Jun 1999 13:51:03 -0700 (PDT), kanoj@google.engr.sgi.com (Kanoj Sarcar) said:

>> or perhaps the kernel could start more than one kswapd (one per swap
>> partition?).  with my patch, regular processes never wait for swap out
>> I/O, only kswapd does.

This is a mistake: such blocking is one of the prime ways in which we
can limit the rate at which processes can consume memory.

> Oh no, I was not talking about exotic stuff like RT ... I was 
> simply pointing out that to prevent deadlocks, and guarantee forward
> progress, you have to show that despite what underlying fs/driver
> code does, at least one memory freer is free to do its job. 

Yep, which is why we have a separate kpiod right now: it guarantees that
potential recursive fs locking stalls get shifted from kswapd to a
separate thread to make sure that kswapd can always make progress.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8
  1999-06-28 22:08                               ` Stephen C. Tweedie
@ 1999-06-28 22:59                                 ` Andrea Arcangeli
  1999-06-29  0:53                                 ` Chuck Lever
  1 sibling, 0 replies; 60+ messages in thread
From: Andrea Arcangeli @ 1999-06-28 22:59 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Kanoj Sarcar, Chuck Lever, torvalds, linux-mm

On Mon, 28 Jun 1999, Stephen C. Tweedie wrote:

>Hi,
>
>On Mon, 28 Jun 1999 13:51:03 -0700 (PDT), kanoj@google.engr.sgi.com (Kanoj Sarcar) said:
>
>>> or perhaps the kernel could start more than one kswapd (one per swap
>>> partition?).  with my patch, regular processes never wait for swap out
>>> I/O, only kswapd does.
>
>This is a mistake: such blocking is one of the prime ways in which we
>can limit the rate at which processes can consume memory.

Agreed.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8
  1999-06-28 22:08                               ` Stephen C. Tweedie
  1999-06-28 22:59                                 ` Andrea Arcangeli
@ 1999-06-29  0:53                                 ` Chuck Lever
  1999-06-29 11:14                                   ` Stephen C. Tweedie
  1 sibling, 1 reply; 60+ messages in thread
From: Chuck Lever @ 1999-06-29  0:53 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Kanoj Sarcar, andrea, torvalds, linux-mm

On Mon, 28 Jun 1999, Stephen C. Tweedie wrote:
> On Mon, 28 Jun 1999 13:51:03 -0700 (PDT), kanoj@google.engr.sgi.com (Kanoj Sarcar) said:
> 
> >> or perhaps the kernel could start more than one kswapd (one per swap
> >> partition?).  with my patch, regular processes never wait for swap out
> >> I/O, only kswapd does.
> 
> This is a mistake: such blocking is one of the prime ways in which we
> can limit the rate at which processes can consume memory.

whoops.  i'm sorry, i mis-typed.  i meant that regular processes never
*dispatch* I/O.  neither kswapd nor regular processes will wait.

	- Chuck Lever
--
corporate:	<chuckl@netscape.com>
personal:	<chucklever@netscape.net> or <cel@monkey.org>

The Linux Scalability project:
	http://www.citi.umich.edu/projects/linux-scalability/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8
  1999-06-29  0:53                                 ` Chuck Lever
@ 1999-06-29 11:14                                   ` Stephen C. Tweedie
  0 siblings, 0 replies; 60+ messages in thread
From: Stephen C. Tweedie @ 1999-06-29 11:14 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Stephen C. Tweedie, Kanoj Sarcar, andrea, torvalds, linux-mm

Hi,

On Mon, 28 Jun 1999 20:53:23 -0400 (EDT), Chuck Lever <cel@monkey.org>
said:

> whoops.  i'm sorry, i mis-typed.  i meant that regular processes never
> *dispatch* I/O.  neither kswapd nor regular processes will wait.

Sorry?  That's just the same problem, restated.  If a regular process
will never wait on a memory allocation then you have no way of
throttling the memory allocation rate to the rate at which you can
swap stuff out.  That will kill your machine stone dead very rapidly
under heavy memory load.

--Stephen

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8
  1999-06-28 19:55                         ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Kanoj Sarcar
  1999-06-28 20:33                           ` Chuck Lever
@ 1999-06-28 22:09                           ` Stephen C. Tweedie
  1 sibling, 0 replies; 60+ messages in thread
From: Stephen C. Tweedie @ 1999-06-28 22:09 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Chuck Lever, andrea, torvalds, sct, linux-mm

Hi,

On Mon, 28 Jun 1999 12:55:23 -0700 (PDT), kanoj@google.engr.sgi.com
(Kanoj Sarcar) said:

> Agreed this would be a nice thing to be able to do ...  Other than the
> deadlock problem, there's another issue involved, I think. Processes
> can go to sleep (inside drivers/fs for example while
> mmaping/munmaping/faulting) holding their mmap_sem, so any solution
> should be able to guarantee that (at least one of) the memory free'ers
> do not go to sleep indefinitely (or for some time that is upto
> driver/fs code to determine).

Which is why we don't take the mm semaphore in swapout.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races
  1999-06-28 19:39                       ` Chuck Lever
  1999-06-28 19:55                         ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Kanoj Sarcar
@ 1999-06-28 20:45                         ` Stephen C. Tweedie
  1999-06-28 21:14                           ` Chuck Lever
  1 sibling, 1 reply; 60+ messages in thread
From: Stephen C. Tweedie @ 1999-06-28 20:45 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Kanoj Sarcar, Andrea Arcangeli, torvalds, sct, linux-mm

Hi,

On Mon, 28 Jun 1999 15:39:43 -0400 (EDT), Chuck Lever <cel@monkey.org>
said:

> i'm already working on a patch that will allow kswapd to grab the
> mmap_sem for the task that is about to be swapped.  this takes a
> slightly different approach, since i'm focusing on kswapd and not on
> swapoff.  

Don't, it will create a whole pile of new deadlock conditions.  Think
carefully about what happens when you take a page fault, lock the mm,
and then need to allocate a new page in memory to satisfy the fault.
You end up recursively calling try_to_free_page, and if that needs to
reacquire the mm semaphore then you are in major trouble.  The same
mechanism can also block kswapd from making progress.

We've looked at this before: the reason swapout doesn't take the
semaphore is because the deadlock cases are worse than just living
with the current unlocked behaviour.  There's also the fact that
swapping can deal with multiple mms at the same time: if you fork, you
can get two mms which share the same COW page in memory or on swap.
As a result, mm locking doesn't actually buy you enough extra
protection for data pages to be worth it.

--Stephen

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races
  1999-06-28 20:45                         ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races Stephen C. Tweedie
@ 1999-06-28 21:14                           ` Chuck Lever
  1999-06-28 21:25                             ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Kanoj Sarcar
                                               ` (2 more replies)
  0 siblings, 3 replies; 60+ messages in thread
From: Chuck Lever @ 1999-06-28 21:14 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Kanoj Sarcar, Andrea Arcangeli, torvalds, linux-mm

On Mon, 28 Jun 1999, Stephen C. Tweedie wrote:
> On Mon, 28 Jun 1999 15:39:43 -0400 (EDT), Chuck Lever <cel@monkey.org>
> said:
> > i'm already working on a patch that will allow kswapd to grab the
> > mmap_sem for the task that is about to be swapped.  this takes a
> > slightly different approach, since i'm focusing on kswapd and not on
> > swapoff.  
> 
> Don't, it will create a whole pile of new deadlock conditions.  Think
> carefully about what happens when you take a page fault, lock the mm,
> and then need to allocate a new page in memory to satisfy the fault.
> You end up recursively calling try_to_free_page, and if that needs to
> reacquire the mm semaphore then you are in major trouble.

that doesn't hurt because try_to_free_page() doesn't acquire anything but
the kernel lock in my patch.  it looks something like:

int try_to_free_pages(unsigned int gfp_mask)
{
	int priority = 6;
	int count = pager_daemon.swap_cluster;
 
 	wake_up_process(kswapd_process);

	lock_kernel();
	do {
		while (shrink_mmap(priority, gfp_mask)) {
			if (!--count)
				goto done;
		}

		shrink_dcache_memory(priority, gfp_mask);
	} while (--priority >= 0);
done:
	/* maybe slow this thread down while kswapd catches up */
	if (gfp_mask & __GFP_WAIT) {
		current->policy |= SCHED_YIELD;
		schedule();
	}
	unlock_kernel();
	return 1;
}

> The same mechanism can also block kswapd from making progress.

i'm re-using the mmap_sem, not the mm_sem.  only the mmap_sem for the
about-to-be-swapped object is acquired by kswapd.  is that unsafe?
or just silly?

> There's also the fact that
> swapping can deal with multiple mms at the same time: if you fork, you
> can get two mms which share the same COW page in memory or on swap.
> As a result, mm locking doesn't actually buy you enough extra
> protection for data pages to be worth it.

the eventual goal of my adventure is to drop the kernel lock while doing
the page COW in do_wp_page, since in 2.3.6+, the COW is again protected
because of race conditions with kswapd.  this "protection" serializes all
page faults behind a very expensive memory copy.  what other ways are
there to protect the COW operation while allowing some parallelism?  it
seems like this is worth a little complexity, IMO.

	- Chuck Lever
--
corporate:	<chuckl@netscape.com>
personal:	<chucklever@netscape.net> or <cel@monkey.org>

The Linux Scalability project:
	http://www.citi.umich.edu/projects/linux-scalability/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8
  1999-06-28 21:14                           ` Chuck Lever
@ 1999-06-28 21:25                             ` Kanoj Sarcar
  1999-06-28 22:15                             ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races Stephen C. Tweedie
  1999-06-28 22:48                             ` Andrea Arcangeli
  2 siblings, 0 replies; 60+ messages in thread
From: Kanoj Sarcar @ 1999-06-28 21:25 UTC (permalink / raw)
  To: Chuck Lever; +Cc: sct, andrea, torvalds, linux-mm

> 
> the eventual goal of my adventure is to drop the kernel lock while doing
> the page COW in do_wp_page, since in 2.3.6+, the COW is again protected
> because of race conditions with kswapd.  this "protection" serializes all
> page faults behind a very expensive memory copy.  what other ways are
> there to protect the COW operation while allowing some parallelism?  it
> seems like this is worth a little complexity, IMO.
>

I have already commented on my reservations about holding mmap_sem
in kswapd/try_to_free_pages. 

Just thought I would point out that I have been thinking on the
lines of eliminating kernel_lock from the vm code (experimentally
initially under a CONFIG option), but I am yet to come up with
a complete design. My current ideas involve a per mm spinning pte 
lock, a sleeping vmalist mutex (which processes never go to sleep
holding). I am still struggling to understand whether a per
page lock is needed or swapcache lock will do.

In any case, if someone on the list is working on something
similar, maybe we can exchange notes offline. Of course, we can
not perturb performance for kernels which does not have the 
CONFIG option set. And Linus has to agree its worthwhile doing
this work ....

Thanks.

Kanoj
kanoj@engr.sgi.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races
  1999-06-28 21:14                           ` Chuck Lever
  1999-06-28 21:25                             ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Kanoj Sarcar
@ 1999-06-28 22:15                             ` Stephen C. Tweedie
  1999-06-28 22:48                             ` Andrea Arcangeli
  2 siblings, 0 replies; 60+ messages in thread
From: Stephen C. Tweedie @ 1999-06-28 22:15 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Stephen C. Tweedie, Kanoj Sarcar, Andrea Arcangeli, torvalds, linux-mm

Hi,

On Mon, 28 Jun 1999 17:14:17 -0400 (EDT), Chuck Lever <cel@monkey.org>
said:

> that doesn't hurt because try_to_free_page() doesn't acquire anything but
> the kernel lock in my patch.  

Removing swapout from try_to_free_page is fundamentally broken, since it
removes a critical rate limiter from the vm allocator paths.  Acquiring
it in kswapd is still a deadlock situation.

> the eventual goal of my adventure is to drop the kernel lock while doing
> the page COW in do_wp_page, since in 2.3.6+, the COW is again protected
> because of race conditions with kswapd.  

OK, but doing this by adding extra mm locks to the swapout path is
itself fraught with deadlocks, and doesn't get around the fact that
multiple different mm's can reference the same swap page so you don't
actually eliminate all of the races anyway by adding that locking.

--Stephen

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races
  1999-06-28 21:14                           ` Chuck Lever
  1999-06-28 21:25                             ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Kanoj Sarcar
  1999-06-28 22:15                             ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races Stephen C. Tweedie
@ 1999-06-28 22:48                             ` Andrea Arcangeli
  1999-06-29  1:29                               ` Chuck Lever
                                                 ` (2 more replies)
  2 siblings, 3 replies; 60+ messages in thread
From: Andrea Arcangeli @ 1999-06-28 22:48 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Stephen C. Tweedie, Kanoj Sarcar, torvalds, linux-mm

On Mon, 28 Jun 1999, Chuck Lever wrote:

>that doesn't hurt because try_to_free_page() doesn't acquire anything but
>the kernel lock in my patch.  it looks something like:
>
>int try_to_free_pages(unsigned int gfp_mask)
>{
>	int priority = 6;
>	int count = pager_daemon.swap_cluster;
> 
> 	wake_up_process(kswapd_process);
>
>	lock_kernel();
>	do {
>		while (shrink_mmap(priority, gfp_mask)) {
>			if (!--count)
>				goto done;
>		}
>
>		shrink_dcache_memory(priority, gfp_mask);
>	} while (--priority >= 0);
>done:
>	/* maybe slow this thread down while kswapd catches up */
>	if (gfp_mask & __GFP_WAIT) {
>		current->policy |= SCHED_YIELD;
>		schedule();
>	}
>	unlock_kernel();
>	return 1;
>}

How do you get the information about "when" to start the swap activities?
Maybe you have a separate try_to_free_pages() that does the plain-current
try_to_free_pages() and you call it only from kswapd?

My guess is that you'll end with zero cache and you'll have to page-in
from disk like h*ell when you reach swap with a resulting really bad
iteractive behaviour.

I think that being able to swapout from the process context is a very nice
feature because it cause the trashing task to block. This may looks not
very important with the current low_on_memory bit, but here I have a
per-task `trashing_memory' bitflag :).

Anyway we may re-implement recursive semaphores to avoid deadlocking into
the page fault path...

>the eventual goal of my adventure is to drop the kernel lock while doing
>the page COW in do_wp_page, since in 2.3.6+, the COW is again protected
>because of race conditions with kswapd.  this "protection" serializes all

I thought a bit about that as well. I also coded a maybe possible
solution. Look at this snapshot:

Index: linux/mm/memory.c
===================================================================
RCS file: /var/cvs/linux/mm/memory.c,v
retrieving revision 1.1.1.10
retrieving revision 1.1.2.39
diff -u -r1.1.1.10 -r1.1.2.39
--- linux/mm/memory.c	1999/06/28 15:10:09	1.1.1.10
+++ linux/mm/memory.c	1999/06/28 17:08:59	1.1.2.39
@@ -607,16 +618,23 @@
 	struct page * page;
 	
 	new_page = __get_free_page(GFP_USER);
-	/* Did swap_out() unmap the protected page while we slept? */
-	if (pte_val(*page_table) != pte_val(pte))
-		goto end_wp_page;
 	old_page = pte_page(pte);
 	if (MAP_NR(old_page) >= max_mapnr)
 		goto bad_wp_page;
 	tsk->min_flt++;
 	page = mem_map + MAP_NR(old_page);
-	
+
+	lock_page(page);
 	/*
+	 * We can release the big kernel lock here since
+	 * kswapd will see the page locked. -Andrea
+	 */
+	unlock_kernel();
+	/* Did swap_out() unmap the protected page while we slept? */
+	if (pte_val(*page_table) != pte_val(pte))
+		goto end_wp_page;
+
+	/*
 	 * We can avoid the copy if:
 	 * - we're the only user (count == 1)
 	 * - the only other user is the swap cache,
@@ -630,19 +648,15 @@
 			break;
 		if (swap_count(page->offset) != 1)
 			break;
+		lru_unmap_cache(page);
 		delete_from_swap_cache(page);
+		put_page_refcount(page);
 		/* FallThrough */
 	case 1:
 		flush_cache_page(vma, address);
 		set_pte(page_table, pte_mkdirty(pte_mkwrite(pte)));
 		flush_tlb_page(vma, address);
-end_wp_page:
-		/*
-		 * We can release the kernel lock now.. Now swap_out will see
-		 * a dirty page and so won't get confused and flush_tlb_page
-		 * won't SMP race. -Andrea
-		 */
-		unlock_kernel();
+		UnlockPage(page);
 
 		if (new_page)
 			free_page(new_page);
@@ -652,6 +666,7 @@
 	if (!new_page)
 		goto no_new_page;
 
+	lru_unmap_cache(page);
 	if (PageReserved(page))
 		++vma->vm_mm->rss;
 	copy_cow_page(old_page,new_page);
@@ -660,18 +675,26 @@
 	flush_cache_page(vma, address);
 	set_pte(page_table, pte_mkwrite(pte_mkdirty(mk_pte(new_page, vma->vm_page_prot))));
 	flush_tlb_page(vma, address);
-	unlock_kernel();
+	UnlockPage(page);
 	__free_page(page);
 	return 1;
 
 bad_wp_page:
+	unlock_kernel();
 	printk("do_wp_page: bogus page at address %08lx (%08lx)\n",address,old_page);
 	send_sig(SIGKILL, tsk, 1);
-no_new_page:
-	unlock_kernel();
 	if (new_page)
 		free_page(new_page);
 	return 0;
+no_new_page:
+	UnlockPage(page);
+	oom(tsk);
+	return 0;
+end_wp_page:
+	UnlockPage(page);
+	if (new_page)
+		free_page(new_page);
+	return 1;
 }
 
 /*


It's only a partial snapshot, but it should show the picture. Basically I
am locking down the page with the lock held, then when I have the page
locked (I may sleep as well to lock it) I check if kswapd freed the
mapping or if I can go ahead without the big kernel lock. It basically
works but I had not the time to test it carefully yet.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races
  1999-06-28 22:48                             ` Andrea Arcangeli
@ 1999-06-29  1:29                               ` Chuck Lever
  1999-06-29 11:58                                 ` Stephen C. Tweedie
  1999-06-29 12:09                                 ` Andrea Arcangeli
  1999-06-29 11:55                               ` Stephen C. Tweedie
  1999-06-29 20:08                               ` Andrea Arcangeli
  2 siblings, 2 replies; 60+ messages in thread
From: Chuck Lever @ 1999-06-29  1:29 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Stephen C. Tweedie, Kanoj Sarcar, torvalds, linux-mm

On Tue, 29 Jun 1999, Andrea Arcangeli wrote:
> On Mon, 28 Jun 1999, Chuck Lever wrote:
> >that doesn't hurt because try_to_free_page() doesn't acquire anything but
> >the kernel lock in my patch.  it looks something like:
> >
> >int try_to_free_pages(unsigned int gfp_mask)
> >{
> >	int priority = 6;
> >	int count = pager_daemon.swap_cluster;
> > 
> > 	wake_up_process(kswapd_process);
> >
> >	lock_kernel();
> >	do {
> >		while (shrink_mmap(priority, gfp_mask)) {
> >			if (!--count)
> >				goto done;
> >		}
> >
> >		shrink_dcache_memory(priority, gfp_mask);
> >	} while (--priority >= 0);
> >done:
> >	/* maybe slow this thread down while kswapd catches up */
> >	if (gfp_mask & __GFP_WAIT) {
> >		current->policy |= SCHED_YIELD;
> >		schedule();
> >	}
> >	unlock_kernel();
> >	return 1;
> >}
> 
> How do you get the information about "when" to start the swap activities?

try_to_free_pages() still wakes up kswapd whenever it is called.

> Maybe you have a separate try_to_free_pages() that does the plain-current
> try_to_free_pages() and you call it only from kswapd?

yes, that's exactly what i did.  what i can't figure out is why do the
shrink_mmap in both places?  seems like the shrink_mmap in kswapd is
overkill if it has just been awoken by try_to_free_pages.

> My guess is that you'll end with zero cache and you'll have to page-in
> from disk like h*ell when you reach swap with a resulting really bad
> iteractive behaviour.

nope.  it appears to work as well as the old way, maybe even a little
faster.  i still need to do more testing, though.

> I think that being able to swapout from the process context is a very nice
> feature because it cause the trashing task to block. This may looks not
> very important with the current low_on_memory bit, but here I have a
> per-task `trashing_memory' bitflag :).

swapping out never blocks a thread, since the swap out I/O request is
always asynchronous.  line 162 of mm/vmscan.c ::

        /* OK, do a physical asynchronous write to swap.  */
        rw_swap_page(WRITE, entry, (char *) page, 0);

stephen also mentioned "rate controlling" a trashing process, but since
nothing in swap_out spins or sleeps, how could a process be slowed except
by a little extra CPU time spent behind the global lock?  that will slow
everyone else down too, yes?

seems like try_to_free_pages ought to make a clear effort to recognize a
process that is growing quickly and slow it down by causing it to sleep.

> >the eventual goal of my adventure is to drop the kernel lock while doing
> >the page COW in do_wp_page, since in 2.3.6+, the COW is again protected
> >because of race conditions with kswapd.  this "protection" serializes all
> 
> It's only a partial snapshot, but it should show the picture. Basically I
> am locking down the page with the lock held, then when I have the page
> locked (I may sleep as well to lock it) I check if kswapd freed the
> mapping or if I can go ahead without the big kernel lock. It basically
> works but I had not the time to test it carefully yet.

locking pages is probably the right answer, IMHO.

	- Chuck Lever
--
corporate:	<chuckl@netscape.com>
personal:	<chucklever@netscape.net> or <cel@monkey.org>

The Linux Scalability project:
	http://www.citi.umich.edu/projects/linux-scalability/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races
  1999-06-29  1:29                               ` Chuck Lever
@ 1999-06-29 11:58                                 ` Stephen C. Tweedie
  1999-06-29 12:09                                 ` Andrea Arcangeli
  1 sibling, 0 replies; 60+ messages in thread
From: Stephen C. Tweedie @ 1999-06-29 11:58 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Andrea Arcangeli, Stephen C. Tweedie, Kanoj Sarcar, linux-mm

Hi,

On Mon, 28 Jun 1999 21:29:07 -0400 (EDT), Chuck Lever <cel@monkey.org>
said:

> yes, that's exactly what i did.  what i can't figure out is why do the
> shrink_mmap in both places?  seems like the shrink_mmap in kswapd is
> overkill if it has just been awoken by try_to_free_pages.

It hasn't necessarily.  It may have been woken by networking activity.
If the memory requirements are being driven by interrupts, not
processes, then kswapd is the only chance for shrink_mmap to be called.

> stephen also mentioned "rate controlling" a trashing process, but since
> nothing in swap_out spins or sleeps, how could a process be slowed except
> by a little extra CPU time spent behind the global lock?  that will slow
> everyone else down too, yes?

There are IO queue limits which will eventually stall the process.  The
ll_rw_block itself one rate limiter.  We also have a test in
rw_swap_page_base:

	/* Don't allow too many pending pages in flight.. */
	if (atomic_read(&nr_async_pages) > pager_daemon.swap_cluster)
		wait = 1;

which causes the swapout to become synchronous once we have filled the
swapper queues.

> seems like try_to_free_pages ought to make a clear effort to recognize a
> process that is growing quickly and slow it down by causing it to sleep.

It does. 

--Stephen

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races
  1999-06-29  1:29                               ` Chuck Lever
  1999-06-29 11:58                                 ` Stephen C. Tweedie
@ 1999-06-29 12:09                                 ` Andrea Arcangeli
  1999-06-29 15:27                                   ` Chuck Lever
  1 sibling, 1 reply; 60+ messages in thread
From: Andrea Arcangeli @ 1999-06-29 12:09 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Stephen C. Tweedie, Kanoj Sarcar, torvalds, linux-mm

On Mon, 28 Jun 1999, Chuck Lever wrote:

>yes, that's exactly what i did.  what i can't figure out is why do the
>shrink_mmap in both places?  seems like the shrink_mmap in kswapd is
>overkill if it has just been awoken by try_to_free_pages.

If you remove the shrink_mmap from kswapd then you'll start swapping all
the time. shrink_mmap give us the information about the state of
the VM. So if you run it then you know if you should start swapping or
not.

>faster.  i still need to do more testing, though.

I suggest you to run some memory hog that rotate 20/30mbyte of data in the
swap to check iteractive performances.

>swapping out never blocks a thread, since the swap out I/O request is
>always asynchronous.  line 162 of mm/vmscan.c ::
>
>        /* OK, do a physical asynchronous write to swap.  */
>        rw_swap_page(WRITE, entry, (char *) page, 0);

At some point you must stop. As worse when you go out of request. The rate
at which you eat memory is far higher than the swapout speed.

Since the out-of-request is a too large bank, we have a nr_async_pages
limit after which we do sync I/O (set to SWAP_CLUSTER_MAX as default, 32
pages async than sync I/O).

>stephen also mentioned "rate controlling" a trashing process, but since
>nothing in swap_out spins or sleeps, how could a process be slowed except
>by a little extra CPU time spent behind the global lock?  that will slow
>everyone else down too, yes?

swapout stall. It has to stall since memory is faster than disk.

>> It's only a partial snapshot, but it should show the picture. Basically I
>> am locking down the page with the lock held, then when I have the page
>> locked (I may sleep as well to lock it) I check if kswapd freed the
>> mapping or if I can go ahead without the big kernel lock. It basically
>> works but I had not the time to test it carefully yet.
>
>locking pages is probably the right answer, IMHO.

Happy to hear that :).

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races
  1999-06-29 12:09                                 ` Andrea Arcangeli
@ 1999-06-29 15:27                                   ` Chuck Lever
  0 siblings, 0 replies; 60+ messages in thread
From: Chuck Lever @ 1999-06-29 15:27 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Stephen C. Tweedie, Kanoj Sarcar, torvalds, linux-mm

On Tue, 29 Jun 1999, Andrea Arcangeli wrote:
> On Mon, 28 Jun 1999, Chuck Lever wrote:
> >yes, that's exactly what i did.  what i can't figure out is why do the
> >shrink_mmap in both places?  seems like the shrink_mmap in kswapd is
> >overkill if it has just been awoken by try_to_free_pages.
> 
> If you remove the shrink_mmap from kswapd then you'll start swapping all
> the time.

yes, i discovered that rather quickly when i tried it. :)

> shrink_mmap give us the information about the state of
> the VM. So if you run it then you know if you should start swapping or
> not.

but it also "destroys" that state while it's running.  it would be much
nicer, i think, if there was a way to ascertain the state cheaply, then
decide whether to shrink caches or swap, or both.  i think a better
decision could be made this way.  what do you think about separating
shrink_mmap's function into two separate pieces:  maintain state
information, and trim caches?

i've been studying a hard knee that occurs just as the system exhausts
memory and try_to_free_pages is invoked.  performance drops rather
dramatically.  while i was playing around with kswapd, i noticed that when
my system started to swap more during low-memory scenarios, it seemed to
perform better; the knee is "softened".

by switching back and forth between an "all swap all the time" model and
an "all shrink_mmap all the time" model, it was clear to me, at least for
my workload, that shrink_mmap is valuable up to a point, but swapping is
quite effective at increasing available memory because it's heuristic for
choosing a memory-idle process is very good (based on watching subsequent 
swap-in numbers), and there is probably 10-12M of idle crap that can be
flushed if the system gets loaded down, that currently is left in RAM.

in my opinion, the kernel is using shrink_mmap too much and not swapping
enough.  but it isn't clear to me exactly how to rebalance the two, or how
to gather more information in do_try_to_free_pages to make a better
decision about how to get back some memory.

> I suggest you to run some memory hog that rotate 20/30mbyte of data in the
> swap to check iteractive performances.

i have a test that does roughly this -- diff two kernel source trees.

however, it's clear that breaking try_to_free_pages and kswapd into two
separate paths won't provide the locking gain i was after.  however,
unrelated to the above discussion, do_try_to_free_pages may hold onto the
kernel lock for a long time, so finding a safe place for shrink_mmap
and/or swap_out to release it occassionally would help.

	- Chuck Lever
--
corporate:	<chuckl@netscape.com>
personal:	<chucklever@netscape.net> or <cel@monkey.org>

The Linux Scalability project:
	http://www.citi.umich.edu/projects/linux-scalability/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races
  1999-06-28 22:48                             ` Andrea Arcangeli
  1999-06-29  1:29                               ` Chuck Lever
@ 1999-06-29 11:55                               ` Stephen C. Tweedie
  1999-06-29 20:08                               ` Andrea Arcangeli
  2 siblings, 0 replies; 60+ messages in thread
From: Stephen C. Tweedie @ 1999-06-29 11:55 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Chuck Lever, Stephen C. Tweedie, Kanoj Sarcar, torvalds, linux-mm

Hi,

On Tue, 29 Jun 1999 00:48:18 +0200 (CEST), Andrea Arcangeli
<andrea@suse.de> said:

> I thought a bit about that as well. I also coded a maybe possible
> solution. Look at this snapshot:

Much better: the synchronisation between the page fault and the swapper
is per-page, not per-mm, this way.  That way the swapper can afford
just to skip the one locked page rather than block for an mm lock.  My
only reservation is that it's a bit ugly to overload the "locked" bit
this way, but it's the only obvious test in try_to_swap_out that we can
use.  

Adding a new PG_Locked_PTE flag for the page, to indicate that somebody
is relying on this pte for COW operation and kswapd should skip it,
would be an alternative: it makes the intent much more clear (and keeps
PG_Locked purely for IO locking, which is really as it should be).

--Stephen

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races
  1999-06-28 22:48                             ` Andrea Arcangeli
  1999-06-29  1:29                               ` Chuck Lever
  1999-06-29 11:55                               ` Stephen C. Tweedie
@ 1999-06-29 20:08                               ` Andrea Arcangeli
  2 siblings, 0 replies; 60+ messages in thread
From: Andrea Arcangeli @ 1999-06-29 20:08 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Stephen C. Tweedie, Kanoj Sarcar, torvalds, linux-mm

On Tue, 29 Jun 1999, Andrea Arcangeli wrote:

For the record: the snapshot wasn't SMP safe.

> 	/*
>+	 * We can release the big kernel lock here since
>+	 * kswapd will see the page locked. -Andrea
>+	 */
>+	unlock_kernel();

This was a bit too early (pefectly ok for kswapd but not ok for the swap
cache SMP safety). We must first take over the swap cache and run
swap_count before be allowed to release the big kernel lock. So this
should be moved a bit lower...

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions
  1999-06-21 18:46           ` Kanoj Sarcar
  1999-06-21 23:44             ` Kanoj Sarcar
@ 1999-06-28 22:36             ` Stephen C. Tweedie
  1999-06-28 23:24               ` Kanoj Sarcar
  1 sibling, 1 reply; 60+ messages in thread
From: Stephen C. Tweedie @ 1999-06-28 22:36 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Stephen C. Tweedie, linux-mm

Hi,

On Mon, 21 Jun 1999 11:46:27 -0700 (PDT), kanoj@google.engr.sgi.com
(Kanoj Sarcar) said:

>> Look no further than swap_in(), which knows that there is no pte (so
>> swapout concurrency is not a problem) and it holds the mmap lock (so
>> there are no concurrent swap_ins on the page).  It reads in the page adn
>> unconditionally sets up the pte to point to it, assuming that nobody
>> else can conceivably set the pte while we do the swap outselves.

> Hmm, am I being fooled by the comment in swap_in?

> /*
>  * The tests may look silly, but it essentially makes sure that
>  * no other process did a swap-in on us just as we were waiting.
>  *

afaik only swapoff can trigger that.  Concurrent swap-in on the same
entry can occur into the page cache, but not into the page tables
because those are protected by the semaphore.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: filecache/swapcache questions
  1999-06-28 22:36             ` filecache/swapcache questions Stephen C. Tweedie
@ 1999-06-28 23:24               ` Kanoj Sarcar
  0 siblings, 0 replies; 60+ messages in thread
From: Kanoj Sarcar @ 1999-06-28 23:24 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: linux-mm

> 
> Hi,
> 
> On Mon, 21 Jun 1999 11:46:27 -0700 (PDT), kanoj@google.engr.sgi.com
> (Kanoj Sarcar) said:
> 
> >> Look no further than swap_in(), which knows that there is no pte (so
> >> swapout concurrency is not a problem) and it holds the mmap lock (so
> >> there are no concurrent swap_ins on the page).  It reads in the page adn
> >> unconditionally sets up the pte to point to it, assuming that nobody
> >> else can conceivably set the pte while we do the swap outselves.
> 
> > Hmm, am I being fooled by the comment in swap_in?
> 
> > /*
> >  * The tests may look silly, but it essentially makes sure that
> >  * no other process did a swap-in on us just as we were waiting.
> >  *
> 
> afaik only swapoff can trigger that.  Concurrent swap-in on the same
> entry can occur into the page cache, but not into the page tables
> because those are protected by the semaphore.
> 
> --Stephen
> 

Right ... I was trying to counter your argument that swapoff needs
to hold the mmap_sem to protect ptes (except for the fork/exit/swapin 
races) by pointing out that pte updates are already protected by 
kernel_lock.

Kanoj
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~1999-06-30 18:05 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1999-06-21  5:29 filecache/swapcache questions Kanoj Sarcar
1999-06-21 11:25 ` Stephen C. Tweedie
1999-06-21 16:46   ` Kanoj Sarcar
1999-06-21 16:57     ` Stephen C. Tweedie
1999-06-21 17:36       ` Kanoj Sarcar
1999-06-21 17:49         ` Stephen C. Tweedie
1999-06-21 18:46           ` Kanoj Sarcar
1999-06-21 23:44             ` Kanoj Sarcar
1999-06-24 22:23               ` Andrea Arcangeli
1999-06-24 23:55                 ` Kanoj Sarcar
1999-06-25  0:26                   ` Andrea Arcangeli
1999-06-28  1:48                     ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races Kanoj Sarcar
1999-06-28 10:35                       ` Andrea Arcangeli
1999-06-28 17:11                         ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Kanoj Sarcar
1999-06-28 16:32                       ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races Stephen C. Tweedie
1999-06-28 17:25                         ` Kanoj Sarcar
1999-06-28 20:40                           ` Stephen C. Tweedie
1999-06-28 21:11                             ` Kanoj Sarcar
1999-06-28 22:12                               ` Stephen C. Tweedie
1999-06-28 23:43                                 ` Kanoj Sarcar
1999-06-29 11:44                                   ` Stephen C. Tweedie
1999-06-29 22:01                                     ` Kanoj Sarcar
1999-06-30 17:28                                       ` Stephen C. Tweedie
1999-06-30 18:05                                         ` Kanoj Sarcar
1999-06-28 19:39                       ` Chuck Lever
1999-06-28 19:55                         ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Kanoj Sarcar
1999-06-28 20:33                           ` Chuck Lever
1999-06-28 20:51                             ` Kanoj Sarcar
1999-06-28 21:32                               ` Chuck Lever
1999-06-28 21:38                                 ` Kanoj Sarcar
1999-06-28 21:50                                   ` Chuck Lever
1999-06-28 22:15                                     ` Kanoj Sarcar
1999-06-29 11:23                                       ` Stephen C. Tweedie
1999-06-29 17:36                                         ` Kanoj Sarcar
1999-06-28 22:22                                   ` Stephen C. Tweedie
1999-06-28 22:21                                 ` Stephen C. Tweedie
1999-06-28 22:57                                   ` Andrea Arcangeli
1999-06-29  2:13                                     ` Chuck Lever
1999-06-29 12:01                                       ` Stephen C. Tweedie
1999-06-29 12:32                                         ` Andrea Arcangeli
1999-06-30 15:59                                           ` Stephen C. Tweedie
1999-06-29  1:00                                   ` Chuck Lever
1999-06-28 22:08                               ` Stephen C. Tweedie
1999-06-28 22:59                                 ` Andrea Arcangeli
1999-06-29  0:53                                 ` Chuck Lever
1999-06-29 11:14                                   ` Stephen C. Tweedie
1999-06-28 22:09                           ` Stephen C. Tweedie
1999-06-28 20:45                         ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races Stephen C. Tweedie
1999-06-28 21:14                           ` Chuck Lever
1999-06-28 21:25                             ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Kanoj Sarcar
1999-06-28 22:15                             ` filecache/swapcache questions [RFC] [RFT] [PATCH] kanoj-mm12-2.3.8 Fix swapoff races Stephen C. Tweedie
1999-06-28 22:48                             ` Andrea Arcangeli
1999-06-29  1:29                               ` Chuck Lever
1999-06-29 11:58                                 ` Stephen C. Tweedie
1999-06-29 12:09                                 ` Andrea Arcangeli
1999-06-29 15:27                                   ` Chuck Lever
1999-06-29 11:55                               ` Stephen C. Tweedie
1999-06-29 20:08                               ` Andrea Arcangeli
1999-06-28 22:36             ` filecache/swapcache questions Stephen C. Tweedie
1999-06-28 23:24               ` Kanoj Sarcar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox