* Re: patch for 2.1.102 swap code [not found] <356478F0.FE1C378F@star.net> @ 1998-05-24 17:28 ` Stephen C. Tweedie 1998-05-25 10:07 ` David S. Miller ` (2 more replies) 0 siblings, 3 replies; 13+ messages in thread From: Stephen C. Tweedie @ 1998-05-24 17:28 UTC (permalink / raw) To: Bill Hawes Cc: Stephen C. Tweedie, Linux Kernel List, Linus Torvalds, linux-mm, Alan Cox Hi Bill, On Thu, 21 May 1998 14:56:48 -0400, Bill Hawes <whawes@star.net> said: > I've been reviewing the swap code Thanks! > and spotted a couple of prolems, which the attached patch should take > care of. > In read_swap_cache there was a race condition that could lead to pages > being left locked if the swap entry becomes unused while a process is > trying to read it. The low level rw_read_swap() routine bails out if the > entry isn't in use at the time, which would leave the page locked. The > patch avoids this problem by calling swap_duplicate() before checking > for the page. I don't think this should happen: do you have solid evidence that it is a problem? The code in read_swap_cache_async() does if (!add_to_swap_cache(new_page, entry)) { free_page(new_page_addr); return 0; } swap_duplicate(entry); /* Account for the swap cache */ set_bit(PG_locked, &new_page->flags); rw_swap_page(READ, entry, (char *) new_page_addr, wait); which should guarantee a used entry while the IO is in progress. Even if the only use of the entry is the swap cache, it should still be there, and because the page is locked during the IO, it should not be possible for the swap cache reference to be removed before the check in rw_swap_page(). There may well be a more subtle problem here, but I don't think the solution is yet another swap_duplicate(), and I haven't seen reports that might suggest we're losing an unlock in the swapping code anywhere. > In try_to_unuse_page there were some problems with swap counts still > non-zero after replacing all of the process references to a page, > apparently due to the swap map count being elevated while swapping is in > progress. (It shows up if a swapoff command is run while the system is > swapping heavily.) I've modified the code to make multiple passes in the > event that pages are still in use, and to report EBUSY if the counts > can't all be cleared. Hmm. That shouldn't be a problem if everything is working correctly. However, your first change (the extra swap_duplicate) will leave the swap count elevated while swapin is occurring, and that could certainly lead to this symptom in swapoff(). Does the swapoff problem still occur on an unmodified kernel? > In rw_swap_page there was an atomic_dec() of the page count after a sync > read or write, and I think it's possible that the page could have become > unused while waiting for the I/O operation. I've changed the atomic_dec > into a free_page() and added a printk to show whether it actually occurs > in practice. There is also a matching atomic_inc() up above. All swapout is done in try_to_swap_out(), which doesn't do a free_page() to match the unhook of the pte until after the rw_swap_page completes. Swapin should all be done via the swap cache now, and that will also guarantee an extra reference against the page for as long as rw_swap_page is running. However, there are a few borderline cases such as trying to remove swap cache from a locked page which I'll have a check for, as they might make this dangerous. > The patch also includes a few minor cleanups for unused code and > prototypes in swap.h. I've tested it here and it seems to work OK, but > would like some further testing. Also, it probably won't help with the > "found a writable swap page" message reported recently; I'm continuing > to look for that problem. Me too, and I haven't found anything definitely incriminating so far. The one thing I _have_ found is a day-one threads bug in anonymous page-in. do_no_page() handles write accesses to demand-zero memory with: anonymous_page: entry = pte_wrprotect(mk_pte(ZERO_PAGE, vma->vm_page_prot)); if (write_access) { unsigned long page = __get_free_page(GFP_KERNEL); if (!page) goto sigbus; clear_page(page); entry = pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot))); vma->vm_mm->rss++; tsk->min_flt++; flush_page_to_ram(page); } put_page(page_table, entry); The __get_free_page() may block, however, and in a threaded environment this will cause the loss of user data plus a memory leak if two threads hit this race. However, I don't think it's related to the current writable cached page problems. Could you cast your eyes over the patch below? It builds fine and passes the tests I've thrown at it so far, but I'd like a second opinion before forwarding it as a patch for 2.0. --Stephen ---------------------------------------------------------------- Index: mm/memory.c =================================================================== RCS file: /home/rcs/CVS/kswap3/linux/mm/memory.c,v retrieving revision 1.1.2.2 diff -u -r1.1.2.2 memory.c --- memory.c 1998/03/08 16:32:57 1.1.2.2 +++ memory.c 1998/05/24 17:19:07 @@ -839,9 +839,14 @@ anonymous_page: entry = pte_wrprotect(mk_pte(ZERO_PAGE, vma->vm_page_prot)); if (write_access) { + pte_t old_entry = *page_table; unsigned long page = __get_free_page(GFP_KERNEL); if (!page) goto sigbus; + if (pte_val(old_entry) != pte_val(*page_table)) { + free_page(page); + return; + } clear_page(page); entry = pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot))); vma->vm_mm->rss++; ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: patch for 2.1.102 swap code 1998-05-24 17:28 ` patch for 2.1.102 swap code Stephen C. Tweedie @ 1998-05-25 10:07 ` David S. Miller 1998-05-25 12:38 ` Eric W. Biederman 1998-05-25 12:52 ` Bill Hawes 2 siblings, 0 replies; 13+ messages in thread From: David S. Miller @ 1998-05-25 10:07 UTC (permalink / raw) To: sct; +Cc: whawes, linux-kernel, torvalds, linux-mm, number6 Date: Sun, 24 May 1998 18:28:48 +0100 From: "Stephen C. Tweedie" <sct@dcs.ed.ac.uk> The __get_free_page() may block, however, and in a threaded environment this will cause the loss of user data plus a memory leak if two threads hit this race. However, I don't think it's related to the current writable cached page problems. The mmap semaphore is held, it cannot happen. Later, David S. Miller davem@dm.cobaltmicro.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: patch for 2.1.102 swap code 1998-05-24 17:28 ` patch for 2.1.102 swap code Stephen C. Tweedie 1998-05-25 10:07 ` David S. Miller @ 1998-05-25 12:38 ` Eric W. Biederman 1998-05-26 21:52 ` Stephen C. Tweedie 1998-05-25 12:52 ` Bill Hawes 2 siblings, 1 reply; 13+ messages in thread From: Eric W. Biederman @ 1998-05-25 12:38 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Bill Hawes, linux-mm >>>>> "ST" == Stephen C Tweedie <sct@dcs.ed.ac.uk> writes: ST> On Thu, 21 May 1998 14:56:48 -0400, Bill Hawes <whawes@star.net> said: >> In try_to_unuse_page there were some problems with swap counts still >> non-zero after replacing all of the process references to a page, >> apparently due to the swap map count being elevated while swapping is in >> progress. (It shows up if a swapoff command is run while the system is >> swapping heavily.) I've modified the code to make multiple passes in the >> event that pages are still in use, and to report EBUSY if the counts >> can't all be cleared. ST> Hmm. That shouldn't be a problem if everything is working correctly. ST> However, your first change (the extra swap_duplicate) will leave the ST> swap count elevated while swapin is occurring, and that could certainly ST> lead to this symptom in swapoff(). Does the swapoff problem still occur ST> on an unmodified kernel? Note: there is a problem with swapoff that should at least be considered. If you use have a SYSV shared memory, and don't map it into a process, and that memory get's swapped out, swapoff will not be able to find it. This is a very long standing bug and appears not to be a problem in practice. But it is certainly a potential problem. Eric ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: patch for 2.1.102 swap code 1998-05-25 12:38 ` Eric W. Biederman @ 1998-05-26 21:52 ` Stephen C. Tweedie 1998-06-11 14:31 ` Eric W. Biederman 0 siblings, 1 reply; 13+ messages in thread From: Stephen C. Tweedie @ 1998-05-26 21:52 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Stephen C. Tweedie, Bill Hawes, linux-mm Hi, > Note: there is a problem with swapoff that should at least be considered. > If you use have a SYSV shared memory, and don't map it into a process, > and that memory get's swapped out, swapoff will not be able to find it. > This is a very long standing bug and appears not to be a problem in practice. > But it is certainly a potential problem. Thanks; it's added to my list. --Stephen ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: patch for 2.1.102 swap code 1998-05-26 21:52 ` Stephen C. Tweedie @ 1998-06-11 14:31 ` Eric W. Biederman 1998-06-12 21:29 ` Stephen C. Tweedie 0 siblings, 1 reply; 13+ messages in thread From: Eric W. Biederman @ 1998-06-11 14:31 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: linux-mm >>>>> "ST" == Stephen C Tweedie <sct@dcs.ed.ac.uk> writes: ST> Hi, >> Note: there is a problem with swapoff that should at least be considered. >> If you use have a SYSV shared memory, and don't map it into a process, >> and that memory get's swapped out, swapoff will not be able to find it. >> This is a very long standing bug and appears not to be a problem in practice. >> But it is certainly a potential problem. ST> Thanks; it's added to my list. Here is a preliminary patch that should fix the problem. diff -uNrX /home/eric/projects/linux/linux-ignore-files linux-2.1.101.x0/include/linux/swap.h linux-2.1.101.x2/include/linux/swap.h --- linux-2.1.101.x0/include/linux/swap.h Wed Apr 22 11:08:16 1998 +++ linux-2.1.101.x2/include/linux/swap.h Wed Jun 3 22:21:51 1998 @@ -75,7 +75,18 @@ void si_swapinfo(struct sysinfo *); unsigned long get_swap_page(void); extern void FASTCALL(swap_free(unsigned long)); + + /* So that external drivers can use swap, swapoff */ +struct swap_unuse { + int (*func)(unsigned int type, void *arg); + void *arg; + struct swap_unuse *prev; + struct swap_unuse *next; +}; +void register_swap_unuse_function (struct swap_unuse *swap_unuse); +void unregister_swap_unuse_function (struct swap_unuse *swap_unuse); + /* * vm_ops not present page codes for shared memory. * diff -uNrX /home/eric/projects/linux/linux-ignore-files linux-2.1.101.x0/ipc/shm.c linux-2.1.101.x2/ipc/shm.c --- linux-2.1.101.x0/ipc/shm.c Wed Jun 3 22:54:23 1998 +++ linux-2.1.101.x2/ipc/shm.c Thu Jun 4 09:23:28 1998 @@ -30,6 +30,8 @@ static void shm_open (struct vm_area_struct *shmd); static void shm_close (struct vm_area_struct *shmd); static pte_t shm_swap_in(struct vm_area_struct *, unsigned long, unsigned long); +static int shm_try_to_unuse_seg(struct shmid_ds *shp, unsigned int type); +static int shm_try_to_unuse(unsigned int type, void *arg); static int shm_tot = 0; /* total number of shared memory pages */ static int shm_rss = 0; /* number of shared memory pages that are in memory */ @@ -45,6 +47,11 @@ static ulong swap_successes = 0; static ulong used_segs = 0; +/* swap off */ +static struct swap_unuse shm_swap_unuse = { + shm_try_to_unuse, 0, 0, 0 +}; + __initfunc(void shm_init (void)) { int id; @@ -53,6 +60,7 @@ shm_segs[id] = (struct shmid_ds *) IPC_UNUSED; shm_tot = shm_rss = shm_seq = max_shmid = used_segs = 0; shm_lock = NULL; + register_swap_unuse_function(&shm_swap_unuse); return; } @@ -830,4 +838,56 @@ shm_swp++; shm_rss--; return 1; +} + +/* Swap off support */ +static int shm_try_to_unuse_seg(struct shmid_ds *shp, unsigned int type) +{ + unsigned long page = 0; + int i, numpages; + numpages = shp->shm_npages; + for (i = 0; i < numpages; i++) { + pte_t pte; + if (!page) { + page = get_free_page(GFP_KERNEL); + } + pte = __pte(shp->shm_pages[i]); + if (pte_none(pte)) + continue; + if (pte_present(pte)) + continue; + if (!page) + return -1; + rw_swap_page_nocache(READ, pte_val(pte), (char *)page); + pte = __pte(shp->shm_pages[i]); + if (!pte_present(pte)) { + swap_free(pte_val(pte)); + shm_swp--; + shm_rss++; + pte = pte_mkdirty(mk_pte(page, PAGE_SHARED)); + shp->shm_pages[i] = pte_val(pte); + } + } + if (page) { + free_page(page); + page = 0; + } + return 0; +} + +static int shm_try_to_unuse(unsigned int type, void *arg) +{ + int id; + struct shmid_ds *ident; + for(id = 0; id < SHMMNI; id++) { + ident = shm_segs[id]; + if ((ident != (struct shmid_ds *) IPC_UNUSED) && + (ident != (struct shmid_ds *) IPC_NOID)) { + int result; + result = shm_try_to_unuse_seg(ident, type); + if (result != 0) + return -1; + } + } + return 0; } diff -uNrX /home/eric/projects/linux/linux-ignore-files linux-2.1.101.x0/mm/Makefile linux-2.1.101.x2/mm/Makefile --- linux-2.1.101.x0/mm/Makefile Tue May 12 14:17:54 1998 +++ linux-2.1.101.x2/mm/Makefile Wed Jun 3 22:21:51 1998 @@ -12,4 +12,6 @@ vmalloc.o slab.o \ swap.o vmscan.o page_io.o page_alloc.o swap_state.o swapfile.o +OX_OBJS := swap_syms.o + include $(TOPDIR)/Rules.make diff -uNrX /home/eric/projects/linux/linux-ignore-files linux-2.1.101.x0/mm/swap_syms.c linux-2.1.101.x2/mm/swap_syms.c --- linux-2.1.101.x0/mm/swap_syms.c Wed Dec 31 18:00:00 1969 +++ linux-2.1.101.x2/mm/swap_syms.c Wed Jun 3 22:21:51 1998 @@ -0,0 +1,15 @@ +#include <linux/config.h> +#include <linux/module.h> +#include <linux/fs.h> +#include <linux/mm.h> +#include <linux/swap.h> + +/* Explicit swapping */ +EXPORT_SYMBOL(rw_swap_page); +EXPORT_SYMBOL(rw_swap_page_nocache); +EXPORT_SYMBOL(swap_free); +EXPORT_SYMBOL(get_swap_page); +EXPORT_SYMBOL(si_swapinfo); +EXPORT_SYMBOL(register_swap_unuse_function); +EXPORT_SYMBOL(unregister_swap_unuse_function); +EXPORT_SYMBOL(swapper_inode); diff -uNrX /home/eric/projects/linux/linux-ignore-files linux-2.1.101.x0/mm/swapfile.c linux-2.1.101.x2/mm/swapfile.c --- linux-2.1.101.x0/mm/swapfile.c Tue May 12 14:17:54 1998 +++ linux-2.1.101.x2/mm/swapfile.c Wed Jun 3 22:21:51 1998 @@ -298,7 +298,7 @@ * and then search for the process using it. All the necessary * page table adjustments can then be made atomically. */ -static int try_to_unuse(unsigned int type) +static int try_to_unuse_processes(unsigned int type, void *dummy) { struct swap_info_struct * si = &swap_info[type]; struct task_struct *p; @@ -345,6 +345,69 @@ } } return 0; +} + +struct swap_unuse unuse_processes = +{ + try_to_unuse_processes, + NULL, + &unuse_processes, + &unuse_processes +}; + +/* Don't add or remove unuse functions while a swapoff is in progress. + * It reduces some theoretical races. + */ +static int swap_unuse_lock = 0; +static struct wait_queue *swap_unuse_wait = NULL; + +static int try_to_unuse(unsigned int type) +{ + int error = 0; + struct swap_unuse *unuse_func, *next_func; + next_func = unuse_processes.next; + swap_unuse_lock = 1; + do { + unuse_func = next_func; + next_func = unuse_func->next; + error = unuse_func->func(type, unuse_func->arg); + if (error) { + return error; + } + } while (unuse_func != &unuse_processes); + swap_unuse_lock = 0; + wake_up(&swap_unuse_wait); + return 0; +} + +void register_swap_unuse_function (struct swap_unuse *swap_unuse) +{ + if (!swap_unuse) { + return; + } + while (swap_unuse_lock) { + sleep_on(&swap_unuse_wait); + } + swap_unuse->prev = &unuse_processes; + swap_unuse->next = unuse_processes.next; + unuse_processes.next->prev = swap_unuse; + unuse_processes.next = swap_unuse; +} + +void unregister_swap_unuse_function (struct swap_unuse *swap_unuse) +{ + struct swap_unuse *next; + if (!swap_unuse) { + return; + } + while (swap_unuse_lock) { + sleep_on(&swap_unuse_wait); + } + next = swap_unuse->next; + next->prev = swap_unuse->prev; + next->prev->next = next; + + swap_unuse->next = swap_unuse->prev = NULL; } asmlinkage int sys_swapoff(const char * specialfile) ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: patch for 2.1.102 swap code 1998-06-11 14:31 ` Eric W. Biederman @ 1998-06-12 21:29 ` Stephen C. Tweedie 0 siblings, 0 replies; 13+ messages in thread From: Stephen C. Tweedie @ 1998-06-12 21:29 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Stephen C. Tweedie, linux-mm Hi, On 11 Jun 1998 09:31:22 -0500, ebiederm+eric@npwt.net (Eric W. Biederman) said: >>>>>> "ST" == Stephen C Tweedie <sct@dcs.ed.ac.uk> writes: ST> Hi, >>> Note: there is a problem with swapoff that should at least be considered. >>> If you use have a SYSV shared memory, and don't map it into a process, >>> and that memory get's swapped out, swapoff will not be able to find it. ST> Thanks; it's added to my list. > Here is a preliminary patch that should fix the problem. Thanks; it's queued for attention. I'm off to Usenix tomorrow, but I'll be doing lots and lots of VM stuff when I get back (I'm going full time courtesy of Red Hat --- yay!), so I'll check and test and then submit to Linus if all looks OK. --Stephen ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: patch for 2.1.102 swap code 1998-05-24 17:28 ` patch for 2.1.102 swap code Stephen C. Tweedie 1998-05-25 10:07 ` David S. Miller 1998-05-25 12:38 ` Eric W. Biederman @ 1998-05-25 12:52 ` Bill Hawes 1998-05-25 13:42 ` David S. Miller 1998-05-26 21:38 ` Stephen C. Tweedie 2 siblings, 2 replies; 13+ messages in thread From: Bill Hawes @ 1998-05-25 12:52 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Linux Kernel List, Linus Torvalds, linux-mm, Alan Cox Stephen C. Tweedie wrote: > I don't think this should happen: do you have solid evidence that it is > a problem? The code in read_swap_cache_async() does > > if (!add_to_swap_cache(new_page, entry)) { > free_page(new_page_addr); > return 0; > } > swap_duplicate(entry); /* Account for the swap cache */ > set_bit(PG_locked, &new_page->flags); > rw_swap_page(READ, entry, (char *) new_page_addr, wait); > > which should guarantee a used entry while the IO is in progress. Even > if the only use of the entry is the swap cache, it should still be > there, and because the page is locked during the IO, it should not be > possible for the swap cache reference to be removed before the check in > rw_swap_page(). There may well be a more subtle problem here, but I > don't think the solution is yet another swap_duplicate(), and I haven't > seen reports that might suggest we're losing an unlock in the swapping > code anywhere. Hi Stephen, The problem with the swap entry being unused could occur before reaching the code above. If the swap cache lookup fails, the process will have to allocate a page and may block, allowing multiple processes to block on get_free_page. Then the process that completes first could end up releasing the page and swap cache, so that when the other processes wake up from get_free_page the swap entry will no longer be valid. In the above code sequence, the add_to_swap_cache will succeed, swap_duplicate will complain and fail, and rw_swap_cache will complain and return leaving the page locked. There would be warning messages of the problem, but I'd rather avoid it in the first place. I'm reasonably certain this sequence of events could occur, and other places in the kernel (e.g. swap_in) have tests for the case where multiple processes try to read in the same swap page. The changes in my patch will prevent the swap entry from disappearing when we want to read it in, and will keep read_swap_cache from returning spurious failures. > > In try_to_unuse_page there were some problems with swap counts still > > non-zero after replacing all of the process references to a page, > > apparently due to the swap map count being elevated while swapping is in > > progress. (It shows up if a swapoff command is run while the system is > > swapping heavily.) I've modified the code to make multiple passes in the > > event that pages are still in use, and to report EBUSY if the counts > > can't all be cleared. > > Hmm. That shouldn't be a problem if everything is working correctly. > However, your first change (the extra swap_duplicate) will leave the > swap count elevated while swapin is occurring, and that could certainly > lead to this symptom in swapoff(). Does the swapoff problem still occur > on an unmodified kernel? Yes, I've seen the problem before making the other changes, and there have been some problem reports on the kernel list. > There is also a matching atomic_inc() up above. All swapout is done in > try_to_swap_out(), which doesn't do a free_page() to match the unhook of > the pte until after the rw_swap_page completes. Swapin should all be > done via the swap cache now, and that will also guarantee an extra > reference against the page for as long as rw_swap_page is running. > However, there are a few borderline cases such as trying to remove swap > cache from a locked page which I'll have a check for, as they might make > this dangerous. I'm less certain of the possibility of the page being unused in this case, but in any event replacing the atomic_dec() with a free_page seems prudent to me, as there have been a number of other kernel memory leaks caused by an atomic_dec instead of a free_page. But at the very least we should put the printk warning there so that if the problem does occur it can be corrected in the future. > Me too, and I haven't found anything definitely incriminating so far. > The one thing I _have_ found is a day-one threads bug in anonymous > page-in. do_no_page() handles write accesses to demand-zero memory > with: > > anonymous_page: > entry = pte_wrprotect(mk_pte(ZERO_PAGE, vma->vm_page_prot)); > if (write_access) { > unsigned long page = __get_free_page(GFP_KERNEL); > if (!page) > goto sigbus; > clear_page(page); > entry = pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot))); > vma->vm_mm->rss++; > tsk->min_flt++; > flush_page_to_ram(page); > } > put_page(page_table, entry); > > The __get_free_page() may block, however, and in a threaded environment > this will cause the loss of user data plus a memory leak if two threads > hit this race. However, I don't think it's related to the current > writable cached page problems. > > Could you cast your eyes over the patch below? It builds fine and > passes the tests I've thrown at it so far, but I'd like a second opinion > before forwarding it as a patch for 2.0. The patch looks reasonable to me, but as DaveM mentioned in a later mail, the do_wp_page case is supposed to be protected with a semaphore. Regards, Bill ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: patch for 2.1.102 swap code 1998-05-25 12:52 ` Bill Hawes @ 1998-05-25 13:42 ` David S. Miller 1998-05-26 18:00 ` Stephen C. Tweedie 1998-05-26 21:38 ` Stephen C. Tweedie 1 sibling, 1 reply; 13+ messages in thread From: David S. Miller @ 1998-05-25 13:42 UTC (permalink / raw) To: whawes; +Cc: sct, linux-kernel, torvalds, linux-mm, number6 Date: Mon, 25 May 1998 08:52:46 -0400 From: Bill Hawes <whawes@star.net> > Could you cast your eyes over the patch below? It builds fine > and passes the tests I've thrown at it so far, but I'd like a > second opinion before forwarding it as a patch for 2.0. The patch looks reasonable to me, but as DaveM mentioned in a later mail, the do_wp_page case is supposed to be protected with a semaphore. Alas, I thought about this some more. And one piece of code needs to be fixed for this invariant about the semaphore being held in the fault processing code paths to be true everywhere... ptrace()... Later, David S. Miller davem@dm.cobaltmicro.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: patch for 2.1.102 swap code 1998-05-25 13:42 ` David S. Miller @ 1998-05-26 18:00 ` Stephen C. Tweedie 0 siblings, 0 replies; 13+ messages in thread From: Stephen C. Tweedie @ 1998-05-26 18:00 UTC (permalink / raw) To: David S. Miller; +Cc: whawes, sct, linux-kernel, torvalds, linux-mm, number6 Hi, On Mon, 25 May 1998 06:42:53 -0700, "David S. Miller" <davem@dm.cobaltmicro.com> said: > Alas, I thought about this some more. And one piece of code needs to > be fixed for this invariant about the semaphore being held in the > fault processing code paths to be true everywhere... ptrace()... Yep --- I was just about to reply to your last mail with this point when I got your follow-up. I've also had one report that the writable cached page reports started when debugging an electric-fenced binary under gdb. Has anyody seen these vm messages who has definitely NOT been running gdb? There's also the point that the whole swapout code munges page tables without ever taking the mm semaphore, but that case ought to be protected by the combination of (a) having the kernel spinlock and (b) never stalling between starting a vma walk and modifying the pte. (The swapout code is pretty paranoid about this.) However, I'm not absolutely 100% sure that we don't have any unfortunate races left by this exception. (For example, do we ever protect a vma by the mm semaphore without also doing a lock_kernel()?) --Stephen ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: patch for 2.1.102 swap code 1998-05-25 12:52 ` Bill Hawes 1998-05-25 13:42 ` David S. Miller @ 1998-05-26 21:38 ` Stephen C. Tweedie 1998-05-26 21:46 ` Rik van Riel 1998-05-27 15:27 ` Bill Hawes 1 sibling, 2 replies; 13+ messages in thread From: Stephen C. Tweedie @ 1998-05-26 21:38 UTC (permalink / raw) To: Bill Hawes Cc: Stephen C. Tweedie, Linux Kernel List, Linus Torvalds, linux-mm, Alan Cox Hi Bill, On Mon, 25 May 1998 08:52:46 -0400, Bill Hawes <whawes@star.net> said: > The problem with the swap entry being unused could occur before reaching > the code above. If the swap cache lookup fails, the process will have to > allocate a page and may block, allowing multiple processes to block on > get_free_page. Then the process that completes first could end up > releasing the page and swap cache, > ... That's why read_swap_cache_async repeats the initial entry lookup after calling __get_free_page(). Unfortunately, I hadn't realised that swap_duplicate() had the error check against swap_map[entry]==0. Moving the swap_duplicate up to before the call to __get_free_page should avoid that case. >> > In try_to_unuse_page there were some problems with swap counts still >> > non-zero after replacing all of the process references to a page, >> > ... >> Hmm. That shouldn't be a problem if everything is working correctly. >> Does the swapoff problem still occur on an unmodified kernel? > Yes, I've seen the problem before making the other changes, and there > have been some problem reports on the kernel list. Excellent --- that should mean it's easy to reproduce, and I've got a test box set up to do tracing on all this code. Is there anything in particular you do to trigger the situation? I've been going over the obvious places in try_to_swap_out and friends, but haven't found anything yet where we might block between updating a pte and modifying the corresponding pte count. >> There is also a matching atomic_inc() up above. All swapout is done in >> try_to_swap_out(), ... > I'm less certain of the possibility of the page being unused in this > case, but in any event replacing the atomic_dec() with a free_page seems > prudent to me, as there have been a number of other kernel memory leaks > caused by an atomic_dec instead of a free_page. But at the very least we > should put the printk warning there so that if the problem does occur it > can be corrected in the future. Yes, the printk will help. Swap should _definitely_ be done through the swap cache, and anything which violates this is broken, so although there should never be a chance of freeing the page, the warning is a useful thing to have. >> Me too, and I haven't found anything definitely incriminating so far. >> The one thing I _have_ found is a day-one threads bug in anonymous >> page-in. do_no_page() handles write accesses to demand-zero memory >> with: >> ... > The patch looks reasonable to me, but as DaveM mentioned in a later > mail, the do_wp_page case is supposed to be protected with a > semaphore. Yep, I've responded to that separately. --Stephen ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: patch for 2.1.102 swap code 1998-05-26 21:38 ` Stephen C. Tweedie @ 1998-05-26 21:46 ` Rik van Riel 1998-06-02 22:21 ` Stephen C. Tweedie 1998-05-27 15:27 ` Bill Hawes 1 sibling, 1 reply; 13+ messages in thread From: Rik van Riel @ 1998-05-26 21:46 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Bill Hawes, Linux Kernel List, linux-mm On Tue, 26 May 1998, Stephen C. Tweedie wrote: > That's why read_swap_cache_async repeats the initial entry lookup after > calling __get_free_page(). Unfortunately, I hadn't realised that > swap_duplicate() had the error check against swap_map[entry]==0. Moving > the swap_duplicate up to before the call to __get_free_page should avoid > that case. Hmm, could read_swap_cache_async() be used to implement swap readahead? Rik. +-------------------------------------------+--------------------------+ | Linux: - LinuxHQ MM-patches page | Scouting webmaster | | - kswapd ask-him & complain-to guy | Vries cubscout leader | | http://www.phys.uu.nl/~riel/ | <H.H.vanRiel@phys.uu.nl> | +-------------------------------------------+--------------------------+ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: patch for 2.1.102 swap code 1998-05-26 21:46 ` Rik van Riel @ 1998-06-02 22:21 ` Stephen C. Tweedie 0 siblings, 0 replies; 13+ messages in thread From: Stephen C. Tweedie @ 1998-06-02 22:21 UTC (permalink / raw) To: Rik van Riel; +Cc: Stephen C. Tweedie, Bill Hawes, Linux Kernel List, linux-mm Hi Rik, On Tue, 26 May 1998 23:46:35 +0200 (MET DST), Rik van Riel <H.H.vanRiel@phys.uu.nl> said: > Hmm, could read_swap_cache_async() be used to implement swap > readahead? Absolutely, it was designed with that in mind. It's a bit close to 2.1 to actually use it, but I'll be doing a lot of work to tighten things up in 2.3 around the swap code, and readahead will be one component of that. --Stephen ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: patch for 2.1.102 swap code 1998-05-26 21:38 ` Stephen C. Tweedie 1998-05-26 21:46 ` Rik van Riel @ 1998-05-27 15:27 ` Bill Hawes 1 sibling, 0 replies; 13+ messages in thread From: Bill Hawes @ 1998-05-27 15:27 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Linux Kernel List, Linus Torvalds, linux-mm, Alan Cox Stephen C. Tweedie wrote: > That's why read_swap_cache_async repeats the initial entry lookup after > calling __get_free_page(). Unfortunately, I hadn't realised that > swap_duplicate() had the error check against swap_map[entry]==0. Moving > the swap_duplicate up to before the call to __get_free_page should avoid > that case. Hi Stephen, Moving the swap_duplicate() call above the get_free_page() helps, but does not entirely avoid the race: it's possible for lookup_swap_cache() to block on a locked page, and when the process wakes up the swap entry may have disappeared. In order for read_swap_cache to fulfill its contract ("this swap entry exists, go get it for me") it must increment the swap map count before any blocking operation. Hence I moved the swap_duplicate() call above the lookup. I could check for the case of finding the unlocked swap cache page, and only increment the map count if a wait was needed; this would avoid having to increment and decrement the map count if the page is found, at the expense of a little more complexity. I'll post a mopdified patch for comment ... > Excellent --- that should mean it's easy to reproduce, and I've got a > test box set up to do tracing on all this code. Is there anything in > particular you do to trigger the situation? I've been going over the > obvious places in try_to_swap_out and friends, but haven't found > anything yet where we might block between updating a pte and modifying > the corresponding pte count. I've observed the swapoff messages after running swapoff on a quiescent system that had been swapping heavily previously, and also when running swapoff with the system currently swapping. Try setting up a condition of heavy swapping but with adequate memory available (e.g. two kernel compiles in 32M), and then cycle swapoff -a and swapon -a. Running swapoff is a good test for the VM system; unfortunately the current kernels have an unrelated problem with kswapd trying too hard to keep memory blocks, so that swapoff may put the system in an unrecoverable kswapd loop. If you try swapoff when the system doesn't have enough memory, the system will lock rather than let swapoff return failure. But this is a problem of swap policy rather than mechanism, and we can try to fix it after the swap mechanics are 100% solid. Regards, Bill ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~1998-06-12 21:29 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <356478F0.FE1C378F@star.net>
1998-05-24 17:28 ` patch for 2.1.102 swap code Stephen C. Tweedie
1998-05-25 10:07 ` David S. Miller
1998-05-25 12:38 ` Eric W. Biederman
1998-05-26 21:52 ` Stephen C. Tweedie
1998-06-11 14:31 ` Eric W. Biederman
1998-06-12 21:29 ` Stephen C. Tweedie
1998-05-25 12:52 ` Bill Hawes
1998-05-25 13:42 ` David S. Miller
1998-05-26 18:00 ` Stephen C. Tweedie
1998-05-26 21:38 ` Stephen C. Tweedie
1998-05-26 21:46 ` Rik van Riel
1998-06-02 22:21 ` Stephen C. Tweedie
1998-05-27 15:27 ` Bill Hawes
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox