Re: patch for 2.1.102 swap code

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: patch for 2.1.102 swap code
       [not found] <356478F0.FE1C378F@star.net>
@ 1998-05-24 17:28 ` Stephen C. Tweedie
  1998-05-25 10:07   ` David S. Miller
                     ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Stephen C. Tweedie @ 1998-05-24 17:28 UTC (permalink / raw)
  To: Bill Hawes
  Cc: Stephen C. Tweedie, Linux Kernel List, Linus Torvalds, linux-mm,
	Alan Cox

Hi Bill,

On Thu, 21 May 1998 14:56:48 -0400, Bill Hawes <whawes@star.net> said:

> I've been reviewing the swap code

Thanks!

>  and spotted a couple of prolems, which the attached patch should take
> care of.

> In read_swap_cache there was a race condition that could lead to pages
> being left locked if the swap entry becomes unused while a process is
> trying to read it. The low level rw_read_swap() routine bails out if the
> entry isn't in use at the time, which would leave the page locked. The
> patch avoids this problem by calling swap_duplicate() before checking
> for the page.

I don't think this should happen: do you have solid evidence that it is
a problem?  The code in read_swap_cache_async() does 

	if (!add_to_swap_cache(new_page, entry)) {
		free_page(new_page_addr);
		return 0;
	}
	swap_duplicate(entry);		/* Account for the swap cache */
	set_bit(PG_locked, &new_page->flags);
	rw_swap_page(READ, entry, (char *) new_page_addr, wait);

which should guarantee a used entry while the IO is in progress.  Even
if the only use of the entry is the swap cache, it should still be
there, and because the page is locked during the IO, it should not be
possible for the swap cache reference to be removed before the check in
rw_swap_page().  There may well be a more subtle problem here, but I
don't think the solution is yet another swap_duplicate(), and I haven't
seen reports that might suggest we're losing an unlock in the swapping
code anywhere.

> In try_to_unuse_page there were some problems with swap counts still
> non-zero after replacing all of the process references to a page,
> apparently due to the swap map count being elevated while swapping is in
> progress. (It shows up if a swapoff command is run while the system is
> swapping heavily.) I've modified the code to make multiple passes in the
> event that pages are still in use, and to report EBUSY if the counts
> can't all be cleared.

Hmm.  That shouldn't be a problem if everything is working correctly.
However, your first change (the extra swap_duplicate) will leave the
swap count elevated while swapin is occurring, and that could certainly
lead to this symptom in swapoff().  Does the swapoff problem still occur
on an unmodified kernel?

> In rw_swap_page there was an atomic_dec() of the page count after a sync
> read or write, and I think it's possible that the page could have become
> unused while waiting for the I/O operation. I've changed the atomic_dec
> into a free_page() and added a printk to show whether it actually occurs
> in practice.

There is also a matching atomic_inc() up above.  All swapout is done in
try_to_swap_out(), which doesn't do a free_page() to match the unhook of
the pte until after the rw_swap_page completes.  Swapin should all be
done via the swap cache now, and that will also guarantee an extra
reference against the page for as long as rw_swap_page is running.
However, there are a few borderline cases such as trying to remove swap
cache from a locked page which I'll have a check for, as they might make
this dangerous.

> The patch also includes a few minor cleanups for unused code and
> prototypes in swap.h. I've tested it here and it seems to work OK, but
> would like some further testing. Also, it probably won't help with the
> "found a writable swap page" message reported recently; I'm continuing
> to look for that problem.

Me too, and I haven't found anything definitely incriminating so far.
The one thing I _have_ found is a day-one threads bug in anonymous
page-in.  do_no_page() handles write accesses to demand-zero memory
with:

anonymous_page:
	entry = pte_wrprotect(mk_pte(ZERO_PAGE, vma->vm_page_prot));
	if (write_access) {
		unsigned long page = __get_free_page(GFP_KERNEL);
		if (!page)
			goto sigbus;
		clear_page(page);
		entry = pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
		vma->vm_mm->rss++;
		tsk->min_flt++;
		flush_page_to_ram(page);
	}
	put_page(page_table, entry);

The __get_free_page() may block, however, and in a threaded environment
this will cause the loss of user data plus a memory leak if two threads
hit this race.  However, I don't think it's related to the current
writable cached page problems.

Could you cast your eyes over the patch below?  It builds fine and
passes the tests I've thrown at it so far, but I'd like a second opinion
before forwarding it as a patch for 2.0.

--Stephen

----------------------------------------------------------------
Index: mm/memory.c
===================================================================
RCS file: /home/rcs/CVS/kswap3/linux/mm/memory.c,v
retrieving revision 1.1.2.2
diff -u -r1.1.2.2 memory.c
--- memory.c	1998/03/08 16:32:57	1.1.2.2
+++ memory.c	1998/05/24 17:19:07
@@ -839,9 +839,14 @@
 anonymous_page:
 	entry = pte_wrprotect(mk_pte(ZERO_PAGE, vma->vm_page_prot));
 	if (write_access) {
+		pte_t old_entry = *page_table; 
 		unsigned long page = __get_free_page(GFP_KERNEL);
 		if (!page)
 			goto sigbus;
+		if (pte_val(old_entry) != pte_val(*page_table)) {
+			free_page(page);
+			return;
+		}
 		clear_page(page);
 		entry = pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
 		vma->vm_mm->rss++;

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: patch for 2.1.102 swap code
  1998-05-24 17:28 ` patch for 2.1.102 swap code Stephen C. Tweedie
@ 1998-05-25 10:07   ` David S. Miller
  1998-05-25 12:38   ` Eric W. Biederman
  1998-05-25 12:52   ` Bill Hawes
  2 siblings, 0 replies; 13+ messages in thread
From: David S. Miller @ 1998-05-25 10:07 UTC (permalink / raw)
  To: sct; +Cc: whawes, linux-kernel, torvalds, linux-mm, number6

   Date: 	Sun, 24 May 1998 18:28:48 +0100
   From: "Stephen C. Tweedie" <sct@dcs.ed.ac.uk>

   The __get_free_page() may block, however, and in a threaded
   environment this will cause the loss of user data plus a memory
   leak if two threads hit this race.  However, I don't think it's
   related to the current writable cached page problems.

The mmap semaphore is held, it cannot happen.

Later,
David S. Miller
davem@dm.cobaltmicro.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: patch for 2.1.102 swap code
  1998-05-24 17:28 ` patch for 2.1.102 swap code Stephen C. Tweedie
  1998-05-25 10:07   ` David S. Miller
@ 1998-05-25 12:38   ` Eric W. Biederman
  1998-05-26 21:52     ` Stephen C. Tweedie
  1998-05-25 12:52   ` Bill Hawes
  2 siblings, 1 reply; 13+ messages in thread
From: Eric W. Biederman @ 1998-05-25 12:38 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Bill Hawes, linux-mm

>>>>> "ST" == Stephen C Tweedie <sct@dcs.ed.ac.uk> writes:

ST> On Thu, 21 May 1998 14:56:48 -0400, Bill Hawes <whawes@star.net> said:
>> In try_to_unuse_page there were some problems with swap counts still
>> non-zero after replacing all of the process references to a page,
>> apparently due to the swap map count being elevated while swapping is in
>> progress. (It shows up if a swapoff command is run while the system is
>> swapping heavily.) I've modified the code to make multiple passes in the
>> event that pages are still in use, and to report EBUSY if the counts
>> can't all be cleared.

ST> Hmm.  That shouldn't be a problem if everything is working correctly.
ST> However, your first change (the extra swap_duplicate) will leave the
ST> swap count elevated while swapin is occurring, and that could certainly
ST> lead to this symptom in swapoff().  Does the swapoff problem still occur
ST> on an unmodified kernel?

Note: there is a problem with swapoff that should at least be considered.
If you use have a SYSV shared memory, and don't map it into a process,
and that memory get's swapped out, swapoff will not be able to find it.

This is a very long standing bug and appears not to be a problem in practice.
But it is certainly a potential problem.

Eric

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: patch for 2.1.102 swap code
  1998-05-25 12:38   ` Eric W. Biederman
@ 1998-05-26 21:52     ` Stephen C. Tweedie
  1998-06-11 14:31       ` Eric W. Biederman
  0 siblings, 1 reply; 13+ messages in thread
From: Stephen C. Tweedie @ 1998-05-26 21:52 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Stephen C. Tweedie, Bill Hawes, linux-mm

Hi,

> Note: there is a problem with swapoff that should at least be considered.
> If you use have a SYSV shared memory, and don't map it into a process,
> and that memory get's swapped out, swapoff will not be able to find it.

> This is a very long standing bug and appears not to be a problem in practice.
> But it is certainly a potential problem.

Thanks; it's added to my list.

--Stephen

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: patch for 2.1.102 swap code
  1998-05-26 21:52     ` Stephen C. Tweedie
@ 1998-06-11 14:31       ` Eric W. Biederman
  1998-06-12 21:29         ` Stephen C. Tweedie
  0 siblings, 1 reply; 13+ messages in thread
From: Eric W. Biederman @ 1998-06-11 14:31 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: linux-mm

>>>>> "ST" == Stephen C Tweedie <sct@dcs.ed.ac.uk> writes:

ST> Hi,
>> Note: there is a problem with swapoff that should at least be considered.
>> If you use have a SYSV shared memory, and don't map it into a process,
>> and that memory get's swapped out, swapoff will not be able to find it.

>> This is a very long standing bug and appears not to be a problem in practice.
>> But it is certainly a potential problem.

ST> Thanks; it's added to my list.

Here is a preliminary patch that should fix the problem.


diff -uNrX /home/eric/projects/linux/linux-ignore-files linux-2.1.101.x0/include/linux/swap.h linux-2.1.101.x2/include/linux/swap.h
--- linux-2.1.101.x0/include/linux/swap.h	Wed Apr 22 11:08:16 1998
+++ linux-2.1.101.x2/include/linux/swap.h	Wed Jun  3 22:21:51 1998
@@ -75,7 +75,18 @@
 void si_swapinfo(struct sysinfo *);
 unsigned long get_swap_page(void);
 extern void FASTCALL(swap_free(unsigned long));
+  
+  /* So that external drivers can use swap, swapoff */
+struct swap_unuse {
+	int (*func)(unsigned int type, void *arg);
+	void *arg;
+	struct swap_unuse *prev;
+	struct swap_unuse *next;
+};
 
+void register_swap_unuse_function (struct swap_unuse *swap_unuse);
+void unregister_swap_unuse_function (struct swap_unuse *swap_unuse);
+ 
 /*
  * vm_ops not present page codes for shared memory.
  *
diff -uNrX /home/eric/projects/linux/linux-ignore-files linux-2.1.101.x0/ipc/shm.c linux-2.1.101.x2/ipc/shm.c
--- linux-2.1.101.x0/ipc/shm.c	Wed Jun  3 22:54:23 1998
+++ linux-2.1.101.x2/ipc/shm.c	Thu Jun  4 09:23:28 1998
@@ -30,6 +30,8 @@
 static void shm_open (struct vm_area_struct *shmd);
 static void shm_close (struct vm_area_struct *shmd);
 static pte_t shm_swap_in(struct vm_area_struct *, unsigned long, unsigned long);
+static int shm_try_to_unuse_seg(struct shmid_ds *shp, unsigned int type);
+static int shm_try_to_unuse(unsigned int type, void *arg);
 
 static int shm_tot = 0; /* total number of shared memory pages */
 static int shm_rss = 0; /* number of shared memory pages that are in memory */
@@ -45,6 +47,11 @@
 static ulong swap_successes = 0;
 static ulong used_segs = 0;
 
+/* swap off */
+static struct swap_unuse shm_swap_unuse = {
+	shm_try_to_unuse, 0, 0, 0
+};
+
 __initfunc(void shm_init (void))
 {
 	int id;
@@ -53,6 +60,7 @@
 		shm_segs[id] = (struct shmid_ds *) IPC_UNUSED;
 	shm_tot = shm_rss = shm_seq = max_shmid = used_segs = 0;
 	shm_lock = NULL;
+	register_swap_unuse_function(&shm_swap_unuse);
 	return;
 }
 
@@ -830,4 +838,56 @@
 	shm_swp++;
 	shm_rss--;
 	return 1;
+}
+
+/* Swap off support */
+static int shm_try_to_unuse_seg(struct shmid_ds *shp, unsigned int type)
+{
+	unsigned long page = 0;
+	int i, numpages;
+	numpages = shp->shm_npages;
+	for (i = 0; i < numpages; i++) {
+		pte_t pte;
+		if (!page) {
+			page = get_free_page(GFP_KERNEL);
+		}
+		pte = __pte(shp->shm_pages[i]);
+		if (pte_none(pte))
+			continue;
+		if (pte_present(pte))
+			continue;
+		if (!page) 
+			return -1;
+		rw_swap_page_nocache(READ, pte_val(pte), (char *)page);
+		pte = __pte(shp->shm_pages[i]);
+		if (!pte_present(pte)) {
+			swap_free(pte_val(pte));
+			shm_swp--;
+			shm_rss++;
+			pte = pte_mkdirty(mk_pte(page, PAGE_SHARED));
+			shp->shm_pages[i] = pte_val(pte);
+		}
+	}
+	if (page) {
+		free_page(page);
+		page = 0;
+	}
+	return 0;
+}
+
+static int shm_try_to_unuse(unsigned int type, void *arg)
+{
+	int id;
+	struct shmid_ds *ident;
+	for(id = 0; id < SHMMNI; id++) {
+		ident = shm_segs[id];
+		if ((ident != (struct shmid_ds *) IPC_UNUSED) && 
+		    (ident != (struct shmid_ds *) IPC_NOID)) {
+			int result;
+			result = shm_try_to_unuse_seg(ident, type);
+			if (result != 0) 
+				return -1;
+		}
+	}
+	return 0;
 }
diff -uNrX /home/eric/projects/linux/linux-ignore-files linux-2.1.101.x0/mm/Makefile linux-2.1.101.x2/mm/Makefile
--- linux-2.1.101.x0/mm/Makefile	Tue May 12 14:17:54 1998
+++ linux-2.1.101.x2/mm/Makefile	Wed Jun  3 22:21:51 1998
@@ -12,4 +12,6 @@
 	    vmalloc.o slab.o \
 	    swap.o vmscan.o page_io.o page_alloc.o swap_state.o swapfile.o
 
+OX_OBJS := swap_syms.o
+
 include $(TOPDIR)/Rules.make
diff -uNrX /home/eric/projects/linux/linux-ignore-files linux-2.1.101.x0/mm/swap_syms.c linux-2.1.101.x2/mm/swap_syms.c
--- linux-2.1.101.x0/mm/swap_syms.c	Wed Dec 31 18:00:00 1969
+++ linux-2.1.101.x2/mm/swap_syms.c	Wed Jun  3 22:21:51 1998
@@ -0,0 +1,15 @@
+#include <linux/config.h>
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/swap.h>
+
+/* Explicit swapping */
+EXPORT_SYMBOL(rw_swap_page);
+EXPORT_SYMBOL(rw_swap_page_nocache);
+EXPORT_SYMBOL(swap_free);
+EXPORT_SYMBOL(get_swap_page);
+EXPORT_SYMBOL(si_swapinfo);
+EXPORT_SYMBOL(register_swap_unuse_function);
+EXPORT_SYMBOL(unregister_swap_unuse_function);
+EXPORT_SYMBOL(swapper_inode);
diff -uNrX /home/eric/projects/linux/linux-ignore-files linux-2.1.101.x0/mm/swapfile.c linux-2.1.101.x2/mm/swapfile.c
--- linux-2.1.101.x0/mm/swapfile.c	Tue May 12 14:17:54 1998
+++ linux-2.1.101.x2/mm/swapfile.c	Wed Jun  3 22:21:51 1998
@@ -298,7 +298,7 @@
  * and then search for the process using it.  All the necessary
  * page table adjustments can then be made atomically.
  */
-static int try_to_unuse(unsigned int type)
+static int try_to_unuse_processes(unsigned int type, void *dummy)
 {
 	struct swap_info_struct * si = &swap_info[type];
 	struct task_struct *p;
@@ -345,6 +345,69 @@
 		}
 	}
 	return 0;
+}
+
+struct swap_unuse unuse_processes =
+{
+	try_to_unuse_processes, 
+	NULL,
+	&unuse_processes, 
+	&unuse_processes
+};
+
+/* Don't add or remove unuse functions while a swapoff is in progress.
+ * It reduces some theoretical races.
+ */
+static int swap_unuse_lock = 0;
+static struct wait_queue *swap_unuse_wait = NULL;
+
+static int try_to_unuse(unsigned int type)
+{
+	int error = 0;
+	struct swap_unuse *unuse_func, *next_func;
+	next_func = unuse_processes.next;
+	swap_unuse_lock = 1;
+	do {
+		unuse_func = next_func;
+		next_func = unuse_func->next;
+		error = unuse_func->func(type, unuse_func->arg);
+		if (error) {
+			return error;
+		}
+	} while (unuse_func != &unuse_processes);
+	swap_unuse_lock = 0;
+	wake_up(&swap_unuse_wait);
+	return 0;
+}
+
+void register_swap_unuse_function (struct swap_unuse *swap_unuse)
+{
+	if (!swap_unuse) {
+		return;
+	}
+	while (swap_unuse_lock) {
+		sleep_on(&swap_unuse_wait);
+	}
+	swap_unuse->prev = &unuse_processes;
+	swap_unuse->next = unuse_processes.next;
+	unuse_processes.next->prev = swap_unuse;
+	unuse_processes.next = swap_unuse;
+}
+
+void unregister_swap_unuse_function (struct swap_unuse *swap_unuse)
+{
+	struct swap_unuse *next;
+	if (!swap_unuse) {
+		return;
+	}
+	while (swap_unuse_lock) {
+		sleep_on(&swap_unuse_wait);
+	}
+	next = swap_unuse->next;
+	next->prev = swap_unuse->prev;
+	next->prev->next = next;
+
+	swap_unuse->next = swap_unuse->prev = NULL;
 }
 
 asmlinkage int sys_swapoff(const char * specialfile)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: patch for 2.1.102 swap code
  1998-06-11 14:31       ` Eric W. Biederman
@ 1998-06-12 21:29         ` Stephen C. Tweedie
  0 siblings, 0 replies; 13+ messages in thread
From: Stephen C. Tweedie @ 1998-06-12 21:29 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Stephen C. Tweedie, linux-mm

Hi,

On 11 Jun 1998 09:31:22 -0500, ebiederm+eric@npwt.net (Eric W. Biederman) said:

>>>>>> "ST" == Stephen C Tweedie <sct@dcs.ed.ac.uk> writes:
ST> Hi,
>>> Note: there is a problem with swapoff that should at least be considered.
>>> If you use have a SYSV shared memory, and don't map it into a process,
>>> and that memory get's swapped out, swapoff will not be able to find it.

ST> Thanks; it's added to my list.

> Here is a preliminary patch that should fix the problem.

Thanks; it's queued for attention.  I'm off to Usenix tomorrow, but
I'll be doing lots and lots of VM stuff when I get back (I'm going
full time courtesy of Red Hat --- yay!), so I'll check and test and
then submit to Linus if all looks OK.

--Stephen

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: patch for 2.1.102 swap code
  1998-05-24 17:28 ` patch for 2.1.102 swap code Stephen C. Tweedie
  1998-05-25 10:07   ` David S. Miller
  1998-05-25 12:38   ` Eric W. Biederman
@ 1998-05-25 12:52   ` Bill Hawes
  1998-05-25 13:42     ` David S. Miller
  1998-05-26 21:38     ` Stephen C. Tweedie
  2 siblings, 2 replies; 13+ messages in thread
From: Bill Hawes @ 1998-05-25 12:52 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Linux Kernel List, Linus Torvalds, linux-mm, Alan Cox

Stephen C. Tweedie wrote:

> I don't think this should happen: do you have solid evidence that it is
> a problem?  The code in read_swap_cache_async() does
> 
>         if (!add_to_swap_cache(new_page, entry)) {
>                 free_page(new_page_addr);
>                 return 0;
>         }
>         swap_duplicate(entry);          /* Account for the swap cache */
>         set_bit(PG_locked, &new_page->flags);
>         rw_swap_page(READ, entry, (char *) new_page_addr, wait);
> 
> which should guarantee a used entry while the IO is in progress.  Even
> if the only use of the entry is the swap cache, it should still be
> there, and because the page is locked during the IO, it should not be
> possible for the swap cache reference to be removed before the check in
> rw_swap_page().  There may well be a more subtle problem here, but I
> don't think the solution is yet another swap_duplicate(), and I haven't
> seen reports that might suggest we're losing an unlock in the swapping
> code anywhere.

Hi Stephen,

The problem with the swap entry being unused could occur before reaching
the code above. If the swap cache lookup fails, the process will have to
allocate a page and may block, allowing multiple processes to block on
get_free_page. Then the process that completes first could end up
releasing the page and swap cache, so that when the other processes wake
up from get_free_page the swap entry will no longer be valid. In the
above code sequence, the add_to_swap_cache will succeed, swap_duplicate
will complain and fail, and rw_swap_cache will complain and return
leaving the page locked. There would be warning messages of the problem,
but I'd rather avoid it in the first place.

I'm reasonably certain this sequence of events could occur, and other
places in the kernel (e.g. swap_in) have tests for the case where
multiple processes try to read in the same swap page. The changes in my
patch will prevent the swap entry from disappearing when we want to read
it in, and will keep read_swap_cache from returning spurious failures.

> > In try_to_unuse_page there were some problems with swap counts still
> > non-zero after replacing all of the process references to a page,
> > apparently due to the swap map count being elevated while swapping is in
> > progress. (It shows up if a swapoff command is run while the system is
> > swapping heavily.) I've modified the code to make multiple passes in the
> > event that pages are still in use, and to report EBUSY if the counts
> > can't all be cleared.
> 
> Hmm.  That shouldn't be a problem if everything is working correctly.
> However, your first change (the extra swap_duplicate) will leave the
> swap count elevated while swapin is occurring, and that could certainly
> lead to this symptom in swapoff().  Does the swapoff problem still occur
> on an unmodified kernel?

Yes, I've seen the problem before making the other changes, and there
have been some problem reports on the kernel list. 

> There is also a matching atomic_inc() up above.  All swapout is done in
> try_to_swap_out(), which doesn't do a free_page() to match the unhook of
> the pte until after the rw_swap_page completes.  Swapin should all be
> done via the swap cache now, and that will also guarantee an extra
> reference against the page for as long as rw_swap_page is running.
> However, there are a few borderline cases such as trying to remove swap
> cache from a locked page which I'll have a check for, as they might make
> this dangerous.

I'm less certain of the possibility of the page being unused in this
case, but in any event replacing the atomic_dec() with a free_page seems
prudent to me, as there have been a number of other kernel memory leaks
caused by an atomic_dec instead of a free_page. But at the very least we
should put the printk warning there so that if the problem does occur it
can be corrected in the future.

> Me too, and I haven't found anything definitely incriminating so far.
> The one thing I _have_ found is a day-one threads bug in anonymous
> page-in.  do_no_page() handles write accesses to demand-zero memory
> with:
> 
> anonymous_page:
>         entry = pte_wrprotect(mk_pte(ZERO_PAGE, vma->vm_page_prot));
>         if (write_access) {
>                 unsigned long page = __get_free_page(GFP_KERNEL);
>                 if (!page)
>                         goto sigbus;
>                 clear_page(page);
>                 entry = pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
>                 vma->vm_mm->rss++;
>                 tsk->min_flt++;
>                 flush_page_to_ram(page);
>         }
>         put_page(page_table, entry);
> 
> The __get_free_page() may block, however, and in a threaded environment
> this will cause the loss of user data plus a memory leak if two threads
> hit this race.  However, I don't think it's related to the current
> writable cached page problems.
> 
> Could you cast your eyes over the patch below?  It builds fine and
> passes the tests I've thrown at it so far, but I'd like a second opinion
> before forwarding it as a patch for 2.0.

The patch looks reasonable to me, but as DaveM mentioned in a later
mail, the
do_wp_page case is supposed to be protected with a semaphore.

Regards,
Bill

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: patch for 2.1.102 swap code
  1998-05-25 12:52   ` Bill Hawes
@ 1998-05-25 13:42     ` David S. Miller
  1998-05-26 18:00       ` Stephen C. Tweedie
  1998-05-26 21:38     ` Stephen C. Tweedie
  1 sibling, 1 reply; 13+ messages in thread
From: David S. Miller @ 1998-05-25 13:42 UTC (permalink / raw)
  To: whawes; +Cc: sct, linux-kernel, torvalds, linux-mm, number6

   Date: 	Mon, 25 May 1998 08:52:46 -0400
   From: Bill Hawes <whawes@star.net>

   > Could you cast your eyes over the patch below?  It builds fine
   > and passes the tests I've thrown at it so far, but I'd like a
   > second opinion before forwarding it as a patch for 2.0.

   The patch looks reasonable to me, but as DaveM mentioned in a later
   mail, the do_wp_page case is supposed to be protected with a
   semaphore.

Alas, I thought about this some more.  And one piece of code needs to
be fixed for this invariant about the semaphore being held in the
fault processing code paths to be true everywhere... ptrace()...

Later,
David S. Miller
davem@dm.cobaltmicro.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: patch for 2.1.102 swap code
  1998-05-25 13:42     ` David S. Miller
@ 1998-05-26 18:00       ` Stephen C. Tweedie
  0 siblings, 0 replies; 13+ messages in thread
From: Stephen C. Tweedie @ 1998-05-26 18:00 UTC (permalink / raw)
  To: David S. Miller; +Cc: whawes, sct, linux-kernel, torvalds, linux-mm, number6

Hi,

On Mon, 25 May 1998 06:42:53 -0700, "David S. Miller"
<davem@dm.cobaltmicro.com> said:

> Alas, I thought about this some more.  And one piece of code needs to
> be fixed for this invariant about the semaphore being held in the
> fault processing code paths to be true everywhere... ptrace()...

Yep --- I was just about to reply to your last mail with this point when
I got your follow-up.  I've also had one report that the writable cached
page reports started when debugging an electric-fenced binary under
gdb.  Has anyody seen these vm messages who has definitely NOT been
running gdb?

There's also the point that the whole swapout code munges page tables
without ever taking the mm semaphore, but that case ought to be
protected by the combination of (a) having the kernel spinlock and (b)
never stalling between starting a vma walk and modifying the pte.  (The
swapout code is pretty paranoid about this.)  However, I'm not
absolutely 100% sure that we don't have any unfortunate races left by
this exception.  (For example, do we ever protect a vma by the mm
semaphore without also doing a lock_kernel()?)

--Stephen

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: patch for 2.1.102 swap code
  1998-05-25 12:52   ` Bill Hawes
  1998-05-25 13:42     ` David S. Miller
@ 1998-05-26 21:38     ` Stephen C. Tweedie
  1998-05-26 21:46       ` Rik van Riel
  1998-05-27 15:27       ` Bill Hawes
  1 sibling, 2 replies; 13+ messages in thread
From: Stephen C. Tweedie @ 1998-05-26 21:38 UTC (permalink / raw)
  To: Bill Hawes
  Cc: Stephen C. Tweedie, Linux Kernel List, Linus Torvalds, linux-mm,
	Alan Cox

Hi Bill,

On Mon, 25 May 1998 08:52:46 -0400, Bill Hawes <whawes@star.net> said:

> The problem with the swap entry being unused could occur before reaching
> the code above. If the swap cache lookup fails, the process will have to
> allocate a page and may block, allowing multiple processes to block on
> get_free_page. Then the process that completes first could end up
> releasing the page and swap cache, 
> ...

That's why read_swap_cache_async repeats the initial entry lookup after
calling __get_free_page().  Unfortunately, I hadn't realised that
swap_duplicate() had the error check against swap_map[entry]==0.  Moving
the swap_duplicate up to before the call to __get_free_page should avoid
that case.

>> > In try_to_unuse_page there were some problems with swap counts still
>> > non-zero after replacing all of the process references to a page,
>> > ...

>> Hmm.  That shouldn't be a problem if everything is working correctly.
>> Does the swapoff problem still occur on an unmodified kernel?

> Yes, I've seen the problem before making the other changes, and there
> have been some problem reports on the kernel list. 

Excellent --- that should mean it's easy to reproduce, and I've got a
test box set up to do tracing on all this code.  Is there anything in
particular you do to trigger the situation?  I've been going over the
obvious places in try_to_swap_out and friends, but haven't found
anything yet where we might block between updating a pte and modifying
the corresponding pte count.

>> There is also a matching atomic_inc() up above.  All swapout is done in
>> try_to_swap_out(), ...

> I'm less certain of the possibility of the page being unused in this
> case, but in any event replacing the atomic_dec() with a free_page seems
> prudent to me, as there have been a number of other kernel memory leaks
> caused by an atomic_dec instead of a free_page. But at the very least we
> should put the printk warning there so that if the problem does occur it
> can be corrected in the future.

Yes, the printk will help.  Swap should _definitely_ be done through the
swap cache, and anything which violates this is broken, so although
there should never be a chance of freeing the page, the warning is a
useful thing to have.

>> Me too, and I haven't found anything definitely incriminating so far.
>> The one thing I _have_ found is a day-one threads bug in anonymous
>> page-in.  do_no_page() handles write accesses to demand-zero memory
>> with:
>> ...

> The patch looks reasonable to me, but as DaveM mentioned in a later
> mail, the do_wp_page case is supposed to be protected with a
> semaphore.

Yep, I've responded to that separately.

--Stephen

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: patch for 2.1.102 swap code
  1998-05-26 21:38     ` Stephen C. Tweedie
@ 1998-05-26 21:46       ` Rik van Riel
  1998-06-02 22:21         ` Stephen C. Tweedie
  1998-05-27 15:27       ` Bill Hawes
  1 sibling, 1 reply; 13+ messages in thread
From: Rik van Riel @ 1998-05-26 21:46 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Bill Hawes, Linux Kernel List, linux-mm

On Tue, 26 May 1998, Stephen C. Tweedie wrote:

> That's why read_swap_cache_async repeats the initial entry lookup after
> calling __get_free_page().  Unfortunately, I hadn't realised that
> swap_duplicate() had the error check against swap_map[entry]==0.  Moving
> the swap_duplicate up to before the call to __get_free_page should avoid
> that case.

Hmm, could read_swap_cache_async() be used to implement swap
readahead?

Rik.
+-------------------------------------------+--------------------------+
| Linux: - LinuxHQ MM-patches page          | Scouting       webmaster |
|        - kswapd ask-him & complain-to guy | Vries    cubscout leader |
|     http://www.phys.uu.nl/~riel/          | <H.H.vanRiel@phys.uu.nl> |
+-------------------------------------------+--------------------------+

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: patch for 2.1.102 swap code
  1998-05-26 21:46       ` Rik van Riel
@ 1998-06-02 22:21         ` Stephen C. Tweedie
  0 siblings, 0 replies; 13+ messages in thread
From: Stephen C. Tweedie @ 1998-06-02 22:21 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Stephen C. Tweedie, Bill Hawes, Linux Kernel List, linux-mm

Hi Rik,

On Tue, 26 May 1998 23:46:35 +0200 (MET DST), Rik van Riel
<H.H.vanRiel@phys.uu.nl> said:

> Hmm, could read_swap_cache_async() be used to implement swap
> readahead?

Absolutely, it was designed with that in mind.  It's a bit close to
2.1 to actually use it, but I'll be doing a lot of work to tighten
things up in 2.3 around the swap code, and readahead will be one
component of that.

--Stephen

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: patch for 2.1.102 swap code
  1998-05-26 21:38     ` Stephen C. Tweedie
  1998-05-26 21:46       ` Rik van Riel
@ 1998-05-27 15:27       ` Bill Hawes
  1 sibling, 0 replies; 13+ messages in thread
From: Bill Hawes @ 1998-05-27 15:27 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Linux Kernel List, Linus Torvalds, linux-mm, Alan Cox

Stephen C. Tweedie wrote:

> That's why read_swap_cache_async repeats the initial entry lookup after
> calling __get_free_page().  Unfortunately, I hadn't realised that
> swap_duplicate() had the error check against swap_map[entry]==0.  Moving
> the swap_duplicate up to before the call to __get_free_page should avoid
> that case.

Hi Stephen,

Moving the swap_duplicate() call above the get_free_page() helps, but
does not entirely avoid the race: it's possible for lookup_swap_cache()
to block on a locked page, and when the process wakes up the swap entry
may have disappeared. In order for read_swap_cache to fulfill its
contract ("this swap entry exists, go get it for me") it must increment
the swap map count before any blocking operation. Hence I moved the
swap_duplicate() call above the lookup.

I could check for the case of finding the unlocked swap cache page, and
only increment the map count if a wait was needed; this would avoid
having to increment and decrement the map count if the page is found, at
the expense of a little more complexity. I'll post a mopdified patch for
comment ...

> Excellent --- that should mean it's easy to reproduce, and I've got a
> test box set up to do tracing on all this code.  Is there anything in
> particular you do to trigger the situation?  I've been going over the
> obvious places in try_to_swap_out and friends, but haven't found
> anything yet where we might block between updating a pte and modifying
> the corresponding pte count.

I've observed the swapoff messages after running swapoff on a quiescent
system that had been swapping heavily previously, and also when running
swapoff with the system currently swapping. Try setting up a condition
of heavy swapping but with adequate memory available (e.g. two kernel
compiles in 32M), and then cycle swapoff -a and swapon -a.

Running swapoff is a good test for the VM system; unfortunately the
current kernels have an unrelated problem with kswapd trying too hard to
keep memory blocks, so that swapoff may put the system in an
unrecoverable kswapd loop. If you try swapoff when the system doesn't
have enough memory, the system will lock rather than let swapoff return
failure. But this is a problem of swap policy rather than mechanism, and
we can try to fix it after the swap mechanics are 100% solid.

Regards,
Bill

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~1998-06-12 21:29 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <356478F0.FE1C378F@star.net>
1998-05-24 17:28 ` patch for 2.1.102 swap code Stephen C. Tweedie
1998-05-25 10:07   ` David S. Miller
1998-05-25 12:38   ` Eric W. Biederman
1998-05-26 21:52     ` Stephen C. Tweedie
1998-06-11 14:31       ` Eric W. Biederman
1998-06-12 21:29         ` Stephen C. Tweedie
1998-05-25 12:52   ` Bill Hawes
1998-05-25 13:42     ` David S. Miller
1998-05-26 18:00       ` Stephen C. Tweedie
1998-05-26 21:38     ` Stephen C. Tweedie
1998-05-26 21:46       ` Rik van Riel
1998-06-02 22:21         ` Stephen C. Tweedie
1998-05-27 15:27       ` Bill Hawes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox