* [patch] fix for OOM deadlock in swap_in (2.2.10) [Re: [test program] for OOM situations ] [not found] <Pine.LNX.4.10.9907010152410.17156-100000@Wotan.suse.de> @ 1999-07-03 13:52 ` Andrea Arcangeli 1999-07-04 9:48 ` Rik van Riel 0 siblings, 1 reply; 8+ messages in thread From: Andrea Arcangeli @ 1999-07-03 13:52 UTC (permalink / raw) To: Bernd Kaindl; +Cc: Linux Kernel, kernel, linux-mm, Linus Torvalds On Thu, 1 Jul 1999, Bernd Kaindl wrote: >The program is quite effective in eating every byte of userspace memory: I have a OOM patch for 2.2.10. Here it is a description of the patch: o I discovered a deadlock in the current 2.2.10/2.3.x swapin code. If the swapin was generated by a copy-user, then handle_mm_fault was unconditionally returning 1 to do_page_fault that so was returning to the copy-user code without doing progress and without killing the task. I think this is the reason for the deadlocks experienced by some people using 2.2.x. o I also seen a cupole of problems for init. The first problem is mapping the bad page in place of the swapped out page (init don't like zeroed pages :). Then we could send a SIGKILL to init by mistake in oom. The third problem is that we may force a sigsegv from the signal code if we got a fault in the put_users. The problem is that when we get a fault from a copy-user function we can't know if we got it due OOM (no enough ram to allocate the userspace memory) or due a real fault, this is not a big problem for normal tasks but since we can't crash init due OOM we must not send the sigsegv signal to init in all cases. The simpler way to avoid init getting weird signals due OOM conditions is to forbid force_sig_info to run if the pid of the task is 1. o I made sure that we'll send a SIGKILL and not a SIGBUS if want to kill a task because we are OOM. o I included a longstanding lowest_bit/highest_bit fix (mentioned some time on the list but never included). o I allow only one refresh of the swap_cnt per swapout cycle. (bugfix) o I stop kswapd for 10 seconds if do_try_to_free_pages failed, so other tasks will be able to segfault. I run kswapd a bit more in loop (probably a good idea for routers and when there is plenty of memory in cache). I wakeup kswapd from oom(). If oom() doesn't happens (unlikely) kswapd will return to run after 10 sec (or you can kill kswapd by hand to make it running again before). o minor do_syslog fix (-EFAULT) o set RT priority of sigkilled task (to be sure we are not going to run another task before running the signal handler while returning to userspace) Index: linux/arch/i386/mm/fault.c =================================================================== RCS file: /var/cvs/linux/arch/i386/mm/fault.c,v retrieving revision 1.1.1.1 diff -u -r1.1.1.1 fault.c --- linux/arch/i386/mm/fault.c 1999/01/18 01:28:56 1.1.1.1 +++ linux/arch/i386/mm/fault.c 1999/07/02 17:00:29 @@ -50,7 +50,8 @@ start &= PAGE_MASK; for (;;) { - handle_mm_fault(current,vma, start, 1); + if (!handle_mm_fault(current,vma, start, 1)) + goto bad_area; if (!size) break; size--; @@ -163,7 +164,7 @@ * the fault. */ if (!handle_mm_fault(tsk, vma, address, write)) - goto do_sigbus; + goto failed; /* * Did it hit the DOS screen memory VA from vm86 mode? @@ -255,7 +256,7 @@ * We ran out of memory, or some other thing happened to us that made * us unable to handle the page fault gracefully. */ -do_sigbus: +failed: up(&mm->mmap_sem); /* @@ -265,7 +266,6 @@ tsk->tss.cr2 = address; tsk->tss.error_code = error_code; tsk->tss.trap_no = 14; - force_sig(SIGBUS, tsk); /* Kernel mode? Handle exceptions or die */ if (!(error_code & 4)) Index: linux/include/linux/mm.h =================================================================== RCS file: /var/cvs/linux/include/linux/mm.h,v retrieving revision 1.1.1.10 diff -u -r1.1.1.10 mm.h --- linux/include/linux/mm.h 1999/06/14 15:29:23 1.1.1.10 +++ linux/include/linux/mm.h 1999/07/03 13:10:09 @@ -300,7 +300,7 @@ extern unsigned long paging_init(unsigned long start_mem, unsigned long end_mem); extern void mem_init(unsigned long start_mem, unsigned long end_mem); extern void show_mem(void); -extern void oom(struct task_struct * tsk); +extern void oom(void); extern void si_meminfo(struct sysinfo * val); /* mmap.c */ Index: linux/include/linux/swap.h =================================================================== RCS file: /var/cvs/linux/include/linux/swap.h,v retrieving revision 1.1.1.5 diff -u -r1.1.1.5 swap.h --- linux/include/linux/swap.h 1999/06/14 15:29:38 1.1.1.5 +++ linux/include/linux/swap.h 1999/07/02 01:50:19 @@ -69,6 +69,7 @@ extern struct inode swapper_inode; extern unsigned long page_cache_size; extern int buffermem; +extern struct wait_queue * kswapd_wait_oom; /* Incomplete types for prototype declarations: */ struct task_struct; @@ -91,8 +92,8 @@ extern void swap_after_unlock_page (unsigned long entry); /* linux/mm/page_alloc.c */ -extern void swap_in(struct task_struct *, struct vm_area_struct *, - pte_t *, unsigned long, int); +extern int swap_in(struct task_struct *, struct vm_area_struct *, + pte_t *, unsigned long, int); /* linux/mm/swap_state.c */ Index: linux/ipc/shm.c =================================================================== RCS file: /var/cvs/linux/ipc/shm.c,v retrieving revision 1.1.1.10 diff -u -r1.1.1.10 shm.c --- linux/ipc/shm.c 1999/06/14 15:30:04 1.1.1.10 +++ linux/ipc/shm.c 1999/07/03 01:53:39 @@ -636,7 +636,7 @@ if (!pte_present(pte)) { unsigned long page = get_free_page(GFP_USER); if (!page) { - oom(current); + oom(); return 0; } pte = __pte(shp->shm_pages[idx]); Index: linux/kernel/printk.c =================================================================== RCS file: /var/cvs/linux/kernel/printk.c,v retrieving revision 1.1.1.6 diff -u -r1.1.1.6 printk.c --- linux/kernel/printk.c 1999/06/14 15:30:05 1.1.1.6 +++ linux/kernel/printk.c 1999/07/01 19:31:06 @@ -147,7 +147,8 @@ log_size--; log_start &= LOG_BUF_LEN-1; sti(); - __put_user(c,buf); + if ((error = __put_user(c,buf)) == -EFAULT) + goto out; buf++; i++; cli(); @@ -183,7 +184,8 @@ __restore_flags(flags); for (i = 0; i < count; i++) { c = *((char *) log_buf+(j++ & (LOG_BUF_LEN-1))); - __put_user(c, buf++); + if ((error = __put_user(c, buf++)) == -EFAULT) + goto out; } if (do_clear) logged_chars = 0; Index: linux/kernel/signal.c =================================================================== RCS file: /var/cvs/linux/kernel/signal.c,v retrieving revision 1.1.1.9 diff -u -r1.1.1.9 signal.c --- linux/kernel/signal.c 1999/06/14 15:30:06 1.1.1.9 +++ linux/kernel/signal.c 1999/07/03 02:56:30 @@ -409,6 +409,9 @@ { unsigned long int flags; + if (t->pid == 1) + return 0; + spin_lock_irqsave(&t->sigmask_lock, flags); if (t->sig == NULL) { spin_unlock_irqrestore(&t->sigmask_lock, flags); Index: linux/mm/filemap.c =================================================================== RCS file: /var/cvs/linux/mm/filemap.c,v retrieving revision 1.1.1.17 diff -u -r1.1.1.17 filemap.c --- linux/mm/filemap.c 1999/06/14 15:30:07 1.1.1.17 +++ linux/mm/filemap.c 1999/07/03 01:52:48 @@ -939,7 +939,7 @@ new_page = 0; offset = (address & PAGE_MASK) - area->vm_start + area->vm_offset; if (offset >= inode->i_size && (area->vm_flags & VM_SHARED) && area->vm_mm == current->mm) - goto no_page; + goto do_sigbus; /* * Do we have something in the page cache already? @@ -958,7 +958,7 @@ if (no_share && !new_page) { new_page = page_cache_alloc(); if (!new_page) - goto failure; + goto release_and_oom; } if (PageLocked(page)) @@ -1006,7 +1006,7 @@ if (!new_page) new_page = page_cache_alloc(); if (!new_page) - goto no_page; + goto oom; /* * During getting the above page we might have slept, @@ -1058,7 +1058,16 @@ page_cache_release(page); if (new_page) page_cache_free(new_page); -no_page: + return 0; + +do_sigbus: + force_sig(SIGBUS, current); + return 0; + +release_and_oom: + page_cache_release(page); +oom: + oom(); return 0; } Index: linux/mm/memory.c =================================================================== RCS file: /var/cvs/linux/mm/memory.c,v retrieving revision 1.1.1.8 diff -u -r1.1.1.8 memory.c --- linux/mm/memory.c 1999/06/14 15:30:07 1.1.1.8 +++ linux/mm/memory.c 1999/07/03 02:56:27 @@ -65,10 +65,23 @@ * oom() prints a message (so that the user knows why the process died), * and gives the process an untrappable SIGKILL. */ -void oom(struct task_struct * task) +void oom(void) { - printk("\nOut of memory for %s.\n", task->comm); - force_sig(SIGKILL, task); + if (current->pid == 1) + return; + + /* + * Make sure the killed task will be scheduled ASAP. Otherwise + * we risk to reschedule another task before running the signal + * hander. + */ + current->policy = SCHED_FIFO; + current->rt_priority = 1000000; + + printk("\nOut of memory for %s.\n", current->comm); + force_sig(SIGKILL, current); + + wake_up_interruptible(&kswapd_wait_oom); } /* @@ -545,19 +558,6 @@ } /* - * sanity-check function.. - */ -static void put_page(pte_t * page_table, pte_t pte) -{ - if (!pte_none(*page_table)) { - free_page_and_swap_cache(pte_page(pte)); - return; - } -/* no need for flush_tlb */ - set_pte(page_table, pte); -} - -/* * This routine is used to map in a page into an address space: needed by * execve() for the initial stack and environment pages. */ @@ -575,13 +575,13 @@ pmd = pmd_alloc(pgd, address); if (!pmd) { free_page(page); - oom(tsk); + oom(); return 0; } pte = pte_alloc(pmd, address); if (!pte) { free_page(page); - oom(tsk); + oom(); return 0; } if (!pte_none(*pte)) { @@ -614,21 +614,15 @@ * and potentially makes it more efficient. */ static int do_wp_page(struct task_struct * tsk, struct vm_area_struct * vma, - unsigned long address, pte_t *page_table) + unsigned long address, pte_t *page_table, pte_t pte) { - pte_t pte; unsigned long old_page, new_page; struct page * page_map; - pte = *page_table; new_page = __get_free_page(GFP_USER); /* Did swap_out() unmapped the protected page while we slept? */ if (pte_val(*page_table) != pte_val(pte)) goto end_wp_page; - if (!pte_present(pte)) - goto end_wp_page; - if (pte_write(pte)) - goto end_wp_page; old_page = pte_page(pte); if (MAP_NR(old_page) >= max_mapnr) goto bad_wp_page; @@ -684,13 +678,16 @@ return 1; bad_wp_page: + unlock_kernel(); printk("do_wp_page: bogus page at address %08lx (%08lx)\n",address,old_page); send_sig(SIGKILL, tsk, 1); -no_new_page: - unlock_kernel(); if (new_page) free_page(new_page); return 0; +no_new_page: + unlock_kernel(); + oom(); + return 0; } /* @@ -787,8 +784,9 @@ struct vm_area_struct * vma, unsigned long address, pte_t * page_table, pte_t entry, int write_access) { + int ret = 1; if (!vma->vm_ops || !vma->vm_ops->swapin) { - swap_in(tsk, vma, page_table, pte_val(entry), write_access); + ret = swap_in(tsk, vma, page_table, pte_val(entry), write_access); flush_page_to_ram(pte_page(*page_table)); } else { pte_t page = vma->vm_ops->swapin(vma, address - vma->vm_start + vma->vm_offset, pte_val(entry)); @@ -805,7 +803,7 @@ } } unlock_kernel(); - return 1; + return ret; } /* @@ -817,15 +815,18 @@ if (write_access) { unsigned long page = __get_free_page(GFP_USER); if (!page) - return 0; + goto oom; clear_page(page); entry = pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot))); vma->vm_mm->rss++; tsk->min_flt++; flush_page_to_ram(page); } - put_page(page_table, entry); + set_pte(page_table, entry); return 1; + oom: + oom(); + return 0; } /* @@ -882,7 +883,7 @@ } else if (atomic_read(&mem_map[MAP_NR(page)].count) > 1 && !(vma->vm_flags & VM_SHARED)) entry = pte_wrprotect(entry); - put_page(page_table, entry); + set_pte(page_table, entry); /* no need to invalidate: a not-present page shouldn't be cached */ return 1; } @@ -916,7 +917,7 @@ flush_tlb_page(vma, address); if (write_access) { if (!pte_write(entry)) - return do_wp_page(tsk, vma, address, pte); + return do_wp_page(tsk, vma, address, pte, entry); entry = pte_mkdirty(entry); set_pte(pte, entry); @@ -934,18 +935,23 @@ { pgd_t *pgd; pmd_t *pmd; + pte_t * pte; pgd = pgd_offset(vma->vm_mm, address); pmd = pmd_alloc(pgd, address); - if (pmd) { - pte_t * pte = pte_alloc(pmd, address); - if (pte) { - if (handle_pte_fault(tsk, vma, address, write_access, pte)) { - update_mmu_cache(vma, address, *pte); - return 1; - } - } - } + if (!pmd) + goto oom; + pte = pte_alloc(pmd, address); + if (!pte) + goto oom; + if (!handle_pte_fault(tsk, vma, address, write_access, pte)) + goto fail; + update_mmu_cache(vma, address, *pte); + return 1; + + oom: + oom(); + fail: return 0; } Index: linux/mm/page_alloc.c =================================================================== RCS file: /var/cvs/linux/mm/page_alloc.c,v retrieving revision 1.1.1.10 diff -u -r1.1.1.10 page_alloc.c --- linux/mm/page_alloc.c 1999/06/14 15:30:08 1.1.1.10 +++ linux/mm/page_alloc.c 1999/07/03 13:12:22 @@ -373,8 +373,13 @@ /* Ok, do the async read-ahead now */ new_page = read_swap_cache_async(SWP_ENTRY(SWP_TYPE(entry), offset), 0); - if (new_page != NULL) - __free_page(new_page); + /* + * If we are OOM or if this isn't a contiguous swap cluster + * stop the readahead. + */ + if (!new_page) + return; + __free_page(new_page); offset++; } while (--i); return; @@ -387,7 +392,7 @@ * Also, don't bother to add to the swap cache if this page-in * was due to a write access. */ -void swap_in(struct task_struct * tsk, struct vm_area_struct * vma, +int swap_in(struct task_struct * tsk, struct vm_area_struct * vma, pte_t * page_table, unsigned long entry, int write_access) { unsigned long page; @@ -400,13 +405,11 @@ if (pte_val(*page_table) != entry) { if (page_map) free_page_and_swap_cache(page_address(page_map)); - return; + return 1; } if (!page_map) { - set_pte(page_table, BAD_PAGE); - swap_free(entry); - oom(tsk); - return; + oom(); + return 0; } page = page_address(page_map); @@ -416,7 +419,7 @@ if (!write_access || is_page_shared(page_map)) { set_pte(page_table, mk_pte(page, vma->vm_page_prot)); - return; + return 1; } /* @@ -426,5 +429,5 @@ */ delete_from_swap_cache(page_map); set_pte(page_table, pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)))); - return; + return 1; } Index: linux/mm/swapfile.c =================================================================== RCS file: /var/cvs/linux/mm/swapfile.c,v retrieving revision 1.1.1.8 diff -u -r1.1.1.8 swapfile.c --- linux/mm/swapfile.c 1999/06/14 15:30:09 1.1.1.8 +++ linux/mm/swapfile.c 1999/07/01 12:09:00 @@ -131,15 +131,17 @@ offset = SWP_OFFSET(entry); if (offset >= p->max) goto bad_offset; - if (offset < p->lowest_bit) - p->lowest_bit = offset; - if (offset > p->highest_bit) - p->highest_bit = offset; if (!p->swap_map[offset]) goto bad_free; if (p->swap_map[offset] < SWAP_MAP_MAX) { if (!--p->swap_map[offset]) + { + if (offset < p->lowest_bit) + p->lowest_bit = offset; + if (offset > p->highest_bit) + p->highest_bit = offset; nr_swap_pages++; + } } #ifdef DEBUG_SWAP printk("DebugVM: swap_free(entry %08lx, count now %d)\n", Index: linux/mm/vmscan.c =================================================================== RCS file: /var/cvs/linux/mm/vmscan.c,v retrieving revision 1.1.1.8 diff -u -r1.1.1.8 vmscan.c --- linux/mm/vmscan.c 1999/06/14 15:30:09 1.1.1.8 +++ linux/mm/vmscan.c 1999/07/02 01:36:25 @@ -308,7 +308,8 @@ static int swap_out(unsigned int priority, int gfp_mask) { struct task_struct * p, * pbest; - int counter, assign, max_cnt; + int assign = 0, counter; + unsigned long max_cnt; /* * We make one or two passes through the task list, indexed by @@ -327,11 +328,8 @@ counter = nr_tasks / (priority+1); if (counter < 1) counter = 1; - if (counter > nr_tasks) - counter = nr_tasks; for (; counter >= 0; counter--) { - assign = 0; max_cnt = 0; pbest = NULL; select: @@ -343,7 +341,7 @@ if (p->mm->rss <= 0) continue; /* Refresh swap_cnt? */ - if (assign) + if (assign == 1) p->mm->swap_cnt = p->mm->rss; if (p->mm->swap_cnt > max_cnt) { max_cnt = p->mm->swap_cnt; @@ -351,6 +349,8 @@ } } read_unlock(&tasklist_lock); + if (assign == 1) + assign = 2; if (!pbest) { if (!assign) { assign = 1; @@ -435,7 +435,8 @@ printk ("Starting kswapd v%.*s\n", i, s); } -static struct task_struct *kswapd_process; +static struct wait_queue * kswapd_wait = NULL; +struct wait_queue * kswapd_wait_oom = NULL; /* * The background pageout daemon, started as a kernel thread @@ -455,7 +456,6 @@ { struct task_struct *tsk = current; - kswapd_process = tsk; tsk->session = 1; tsk->pgrp = 1; strcpy(tsk->comm, "kswapd"); @@ -484,16 +484,17 @@ * the processes needing more memory will wake us * up on a more timely basis. */ - do { - if (nr_free_pages >= freepages.high) - break; - - if (!do_try_to_free_pages(GFP_KSWAPD)) - break; - } while (!tsk->need_resched); - run_task_queue(&tq_disk); - tsk->state = TASK_INTERRUPTIBLE; - schedule_timeout(HZ); + interruptible_sleep_on_timeout(&kswapd_wait, HZ); + while (nr_free_pages < freepages.high) + { + if (do_try_to_free_pages(GFP_KSWAPD)) + { + if (tsk->need_resched) + schedule(); + continue; + } + interruptible_sleep_on_timeout(&kswapd_wait_oom,10*HZ); + } } } @@ -516,7 +517,7 @@ { int retval = 1; - wake_up_process(kswapd_process); + wake_up_interruptible(&kswapd_wait); if (gfp_mask & __GFP_WAIT) retval = do_try_to_free_pages(gfp_mask); return retval; With this patch applyed to 2.2.10 I can't deadlock the machine anymore and I can't screw up init anymore. I would like to get some feedback about the patch. Thanks :). Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [patch] fix for OOM deadlock in swap_in (2.2.10) [Re: [test program] for OOM situations ] 1999-07-03 13:52 ` [patch] fix for OOM deadlock in swap_in (2.2.10) [Re: [test program] for OOM situations ] Andrea Arcangeli @ 1999-07-04 9:48 ` Rik van Riel 1999-07-04 14:35 ` Andrea Arcangeli 1999-07-04 17:11 ` Linus Torvalds 0 siblings, 2 replies; 8+ messages in thread From: Rik van Riel @ 1999-07-04 9:48 UTC (permalink / raw) To: Andrea Arcangeli Cc: Bernd Kaindl, Linux Kernel, kernel, linux-mm, Linus Torvalds On Sat, 3 Jul 1999, Andrea Arcangeli wrote: > +void oom(void) > { ... > + force_sig(SIGKILL, current); > I would like to get some feedback about the patch. Thanks :). I'm curious why you haven't yet included my process selection algoritm. I know it can select a blocked or otherwise unkillable process the way the code is in right now, but a workaround for that can be made in about 5 minutes. The "show me the code" attitude doesn't seem to fit me now, unfortunately. I'm still busy moving house and I have to be a good python programmer by tomorrow. Having never seen much python but with the book in front of me :) cheers, Rik -- Open Source: you deserve to be in control of your data. +-------------------------------------------------------------------+ | Le Reseau netwerksystemen BV: http://www.reseau.nl/ | | Linux Memory Management site: http://www.linux.eu.org/Linux-MM/ | | Nederlandse Linux documentatie: http://www.nl.linux.org/ | +-------------------------------------------------------------------+ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [patch] fix for OOM deadlock in swap_in (2.2.10) [Re: [test program] for OOM situations ] 1999-07-04 9:48 ` Rik van Riel @ 1999-07-04 14:35 ` Andrea Arcangeli 1999-07-04 17:11 ` Linus Torvalds 1 sibling, 0 replies; 8+ messages in thread From: Andrea Arcangeli @ 1999-07-04 14:35 UTC (permalink / raw) To: Rik van Riel; +Cc: Bernd Kaindl, Linux Kernel, kernel, linux-mm, Linus Torvalds On Sun, 4 Jul 1999, Rik van Riel wrote: >I'm curious why you haven't yet included my process >selection algoritm. I know it can select a blocked >or otherwise unkillable process the way the code is >in right now, but a workaround for that can be made >in about 5 minutes. Yes, we can try killing the "best" task some time and cache it's task struct, if the task won't go away shortly (say in 5 sec) then we can revert to the safe choice (we always have) of killing the current task that will sure return to userspace soon. I didn't merged your selection algorithm in the patch I posted, only because I tried to produce a patch with only bugfix and simple things included to give it a way to go into 2.2.11. The only object of the patch is to avoid linux to deadlock and avoid init to be screwed up. You can still avoid the admin to login on the console with an evil script, but the kernel will continue running fine (no deadlock). That's what 2.2.x should do just now. >The "show me the code" attitude doesn't seem to fit >me now, unfortunately. I'm still busy moving house >and I have to be a good python programmer by tomorrow. >Having never seen much python but with the book in >front of me :) Don't worry, a friend of mine said me that phyton is very easy to learn :). I can't talk of myself since I never used phyton (I'll try it soon though). Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [patch] fix for OOM deadlock in swap_in (2.2.10) [Re: [test program] for OOM situations ] 1999-07-04 9:48 ` Rik van Riel 1999-07-04 14:35 ` Andrea Arcangeli @ 1999-07-04 17:11 ` Linus Torvalds 1999-07-04 17:38 ` Andrea Arcangeli 1 sibling, 1 reply; 8+ messages in thread From: Linus Torvalds @ 1999-07-04 17:11 UTC (permalink / raw) To: Rik van Riel Cc: Andrea Arcangeli, Bernd Kaindl, Linux Kernel, kernel, linux-mm, Alan Cox On Sun, 4 Jul 1999, Rik van Riel wrote: > On Sat, 3 Jul 1999, Andrea Arcangeli wrote: > > > +void oom(void) > > { > ... > > + force_sig(SIGKILL, current); > > > I would like to get some feedback about the patch. Thanks :). > > I'm curious why you haven't yet included my process > selection algoritm. I know it can select a blocked > or otherwise unkillable process the way the code is > in right now, but a workaround for that can be made > in about 5 minutes. Andreas patch has a much more serious problem: it changes accepted UNIX semantics. Try this before and after the patch: #include <unistd.h> #include <fcntl.h> #include <sys/mman.h> #define PAGE_SIZE 4096 int main(int argc, char **argv) { int fd; char * map; fd = open("/tmp/duh", O_RDWR | O_CREAT, 0666); if (fd < 0) exit(1); ftruncate(fd, PAGE_SIZE); map = mmap(NULL, PAGE_SIZE*2, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); *(volatile char *)(map+PAGE_SIZE); return 0; } and see the difference.. I have tried to fix this _correctly_ in 2.3.10-pre2. That fix could be back-ported to 2.2.x, but Andreas patch really is not acceptable. And Andrea, I told you this once already in private email. I told you why. Why don't you listen? "Fixing" a bug badly is worse than leaving it as a known bug. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [patch] fix for OOM deadlock in swap_in (2.2.10) [Re: [test program] for OOM situations ] 1999-07-04 17:11 ` Linus Torvalds @ 1999-07-04 17:38 ` Andrea Arcangeli 1999-07-04 17:49 ` Linus Torvalds 0 siblings, 1 reply; 8+ messages in thread From: Andrea Arcangeli @ 1999-07-04 17:38 UTC (permalink / raw) To: Linus Torvalds Cc: Rik van Riel, Bernd Kaindl, Linux Kernel, kernel, linux-mm, Alan Cox On Sun, 4 Jul 1999, Linus Torvalds wrote: >Andreas patch has a much more serious problem: it changes accepted UNIX >semantics. Try this before and after the patch: > >#include <unistd.h> >#include <fcntl.h> >#include <sys/mman.h> > >#define PAGE_SIZE 4096 > >int main(int argc, char **argv) >{ > int fd; > char * map; > > fd = open("/tmp/duh", O_RDWR | O_CREAT, 0666); > if (fd < 0) > exit(1); > ftruncate(fd, PAGE_SIZE); > map = mmap(NULL, PAGE_SIZE*2, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); > *(volatile char *)(map+PAGE_SIZE); > return 0; >} On stock 2.2.10: andrea@black:~$ strace ./a.out execve("./a.out", ["./a.out"], [/* 24 vars */]) = 0 brk(0) = 0x804a69c open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY) = 3 fstat(3, {st_mode=0, st_size=0, ...}) = 0 mmap(0, 6184, PROT_READ, MAP_PRIVATE, 3, 0) = 0x4000c000 close(3) = 0 open("/lib/libc.so.6", O_RDONLY) = 3 mmap(0, 4096, PROT_READ, MAP_PRIVATE, 3, 0) = 0x4000e000 munmap(0x4000e000, 4096) = 0 mmap(0, 672848, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x4000e000 mprotect(0x400a0000, 74832, PROT_NONE) = 0 mmap(0x400a0000, 28672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x91000) = 0x400a0000 mmap(0x400a7000, 46160, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x400a7000 close(3) = 0 personality(PER_LINUX) = 0 getpid() = 130 open("/tmp/duh", O_RDWR|O_CREAT, 0666) = 3 ftruncate(3, 4096) = 0 mmap(0, 8192, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0) = 0x400b3000 --- SIGBUS (Bus error) --- +++ killed by SIGBUS +++ On 2.2.10 + my oom patch: andrea@black:~$ strace ./a.out execve("./a.out", ["./a.out"], [/* 24 vars */]) = 0 brk(0) = 0x804a69c open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY) = 3 fstat(3, {st_mode=0, st_size=0, ...}) = 0 mmap(0, 6184, PROT_READ, MAP_PRIVATE, 3, 0) = 0x4000c000 close(3) = 0 open("/lib/libc.so.6", O_RDONLY) = 3 mmap(0, 4096, PROT_READ, MAP_PRIVATE, 3, 0) = 0x4000e000 munmap(0x4000e000, 4096) = 0 mmap(0, 672848, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x4000e000 mprotect(0x400a0000, 74832, PROT_NONE) = 0 mmap(0x400a0000, 28672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x91000) = 0x400a0000 mmap(0x400a7000, 46160, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x400a7000 close(3) = 0 personality(PER_LINUX) = 0 getpid() = 131 open("/tmp/duh", O_RDWR|O_CREAT, 0666) = 3 ftruncate(3, 4096) = 0 mmap(0, 8192, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0) = 0x400b3000 --- SIGBUS (Bus error) --- +++ killed by SIGBUS +++ >and see the difference.. I can't see differences: andrea@black:~$ diff 2.2.10 2.2.10- 17c17 < getpid() = 134 --- > getpid() = 128 >And Andrea, I told you this once already in private email. I told you why. Yes, the first patch I sent to you privately some week ago was plain buggy (I wasn't aware of the shared-mmap-sigbugs UNIX semantic). And I really thank you very much for have spent some time teaching me why it was buggy. >Why don't you listen? "Fixing" a bug badly is worse than leaving it as a >known bug. I think I listen, I remeber well your emails. I written the new patch with your emails in mind. Now I send a sigbus and _nothing_ more when somebody access beyond the end of the file in a shared mmap. The first patch I sent you some time ago was buggy since I replaced the sigbus with a sigkill in do_page_fault, but now I force the signals only at the lower level (as shm and other places was just doing) and the retval of handle_mm_fault now _only_ tells do_page_fault if it has to fixup or not. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [patch] fix for OOM deadlock in swap_in (2.2.10) [Re: [test program] for OOM situations ] 1999-07-04 17:38 ` Andrea Arcangeli @ 1999-07-04 17:49 ` Linus Torvalds 1999-07-04 20:15 ` Andrea Arcangeli 0 siblings, 1 reply; 8+ messages in thread From: Linus Torvalds @ 1999-07-04 17:49 UTC (permalink / raw) To: Andrea Arcangeli Cc: Rik van Riel, Bernd Kaindl, Linux Kernel, kernel, linux-mm, Alan Cox On Sun, 4 Jul 1999, Andrea Arcangeli wrote: > > The first patch I sent you some time ago was buggy since I replaced the > sigbus with a sigkill in do_page_fault, but now I force the signals only > at the lower level (as shm and other places was just doing) and the retval > of handle_mm_fault now _only_ tells do_page_fault if it has to fixup or > not. Ok. I still have your old patch, I'll just flush it so I don't confuse it with anything else. However, I still much prefer the 2.3.x approach (ie just returning more than just 0/1 - a negative number means out-of-memory). In particular, your current approach gets the ptrace() case wrong for the SIGBUS case, and it's pretty much impossible to fix cleanly as far as I can tell. Note that 2.3.10-pre2 also gets ptrace() wrong, but at least it's not impossible to fix - it should just bother to check the return value it gets from handle_mm_fault(). Right now it doesn't. Note that ptrace() is a horrible special case, being the only thing that accesses another process VM space (apart from vmscan which is also horrible, in other ways). HOWEVER, it's rather bad to have a SIGBUS problem and then when you try to debug it the debugger also gets a SIGBUS, which is what your approach results in. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [patch] fix for OOM deadlock in swap_in (2.2.10) [Re: [test program] for OOM situations ] 1999-07-04 17:49 ` Linus Torvalds @ 1999-07-04 20:15 ` Andrea Arcangeli 1999-07-04 20:32 ` Linus Torvalds 0 siblings, 1 reply; 8+ messages in thread From: Andrea Arcangeli @ 1999-07-04 20:15 UTC (permalink / raw) To: Linus Torvalds Cc: Rik van Riel, Bernd Kaindl, Linux Kernel, kernel, linux-mm, Alan Cox On Sun, 4 Jul 1999, Linus Torvalds wrote: >Ok. I still have your old patch, I'll just flush it so I don't confuse it >with anything else. Ok. Excuse me for not specifying it was a new patch. >However, I still much prefer the 2.3.x approach (ie just returning more >than just 0/1 - a negative number means out-of-memory). In particular, >your current approach gets the ptrace() case wrong for the SIGBUS case, >and it's pretty much impossible to fix cleanly as far as I can tell. Woops, I see your point, now I changed the patch as you suggested to catch the ptrace case properly. Now the retval of handle_mm_fault means this: 1 -> page fault resolved succesfully 0 -> we accessed a shared mmap beyond the end of the file. So do_page_fault will send a sigbus in this case while ptrace won't send the sigbus and will return 0. -1 -> OOM. But we don't need to send a sigkill since the lower layer just sent out the sigkill for us. Here it is a new patch against 2.2.10 (called oom-2.2.10-D). The differences with the previous are: o in ptrace I return 0 if handle_mm_fault isn't been succesfully o in mark_page_present I stop if handle_mm_fault gone oom, but I don't send the sigbus there. o in __verify_write in i386 where we don't have the write protect bit I send the sigbus if handle_mm_fault returned 0 This new patch replaces completly the previous one. Please take a look also at what I am doing in swapin_readahead. I could also get the information from read_swap_cache_async to know if the swap entry was empty or if we had not enough memory to do the swapin (and so I could continue the readahead if it's not an oom condition without harming the OOM handling), but I think that if there's an hole in the cluster it's not very important to swapin the other entries. This because even if we access the cluster in write mode (so then we'll have a chance to take over the swap cache) we just had one chance to swapin the whole cluster in readahead mode. I tested the patch with the sigbus testcase you sent in the previous email + with the memory hog I use to check the oom conditions. Index: linux//arch/i386/kernel/ptrace.c =================================================================== RCS file: /var/cvs/linux/arch/i386/kernel/ptrace.c,v retrieving revision 1.1.1.5 diff -u -r1.1.1.5 ptrace.c --- linux//arch/i386/kernel/ptrace.c 1999/06/14 15:17:43 1.1.1.5 +++ linux//arch/i386/kernel/ptrace.c 1999/07/04 19:12:44 @@ -84,8 +84,9 @@ repeat: pgdir = pgd_offset(vma->vm_mm, addr); if (pgd_none(*pgdir)) { - handle_mm_fault(tsk, vma, addr, 0); - goto repeat; + if (handle_mm_fault(tsk, vma, addr, 0) == 1) + goto repeat; + return 0; } if (pgd_bad(*pgdir)) { printk("ptrace: bad page directory %08lx\n", pgd_val(*pgdir)); @@ -94,8 +95,9 @@ } pgmiddle = pmd_offset(pgdir, addr); if (pmd_none(*pgmiddle)) { - handle_mm_fault(tsk, vma, addr, 0); - goto repeat; + if (handle_mm_fault(tsk, vma, addr, 0) == 1) + goto repeat; + return 0; } if (pmd_bad(*pgmiddle)) { printk("ptrace: bad page middle %08lx\n", pmd_val(*pgmiddle)); @@ -104,8 +106,9 @@ } pgtable = pte_offset(pgmiddle, addr); if (!pte_present(*pgtable)) { - handle_mm_fault(tsk, vma, addr, 0); - goto repeat; + if (handle_mm_fault(tsk, vma, addr, 0) == 1) + goto repeat; + return 0; } page = pte_page(*pgtable); /* this is a hack for non-kernel-mapped video buffers and similar */ @@ -135,8 +138,9 @@ repeat: pgdir = pgd_offset(vma->vm_mm, addr); if (!pgd_present(*pgdir)) { - handle_mm_fault(tsk, vma, addr, 1); - goto repeat; + if (handle_mm_fault(tsk, vma, addr, 1) == 1) + goto repeat; + return; } if (pgd_bad(*pgdir)) { printk("ptrace: bad page directory %08lx\n", pgd_val(*pgdir)); @@ -145,8 +149,9 @@ } pgmiddle = pmd_offset(pgdir, addr); if (pmd_none(*pgmiddle)) { - handle_mm_fault(tsk, vma, addr, 1); - goto repeat; + if (handle_mm_fault(tsk, vma, addr, 1) == 1) + goto repeat; + return; } if (pmd_bad(*pgmiddle)) { printk("ptrace: bad page middle %08lx\n", pmd_val(*pgmiddle)); @@ -155,13 +160,15 @@ } pgtable = pte_offset(pgmiddle, addr); if (!pte_present(*pgtable)) { - handle_mm_fault(tsk, vma, addr, 1); - goto repeat; + if (handle_mm_fault(tsk, vma, addr, 1) == 1) + goto repeat; + return; } page = pte_page(*pgtable); if (!pte_write(*pgtable)) { - handle_mm_fault(tsk, vma, addr, 1); - goto repeat; + if (handle_mm_fault(tsk, vma, addr, 1) == 1) + goto repeat; + return; } /* this is a hack for non-kernel-mapped video buffers and similar */ if (MAP_NR(page) < max_mapnr) Index: linux//arch/i386/mm/fault.c =================================================================== RCS file: /var/cvs/linux/arch/i386/mm/fault.c,v retrieving revision 1.1.1.1 diff -u -r1.1.1.1 fault.c --- linux//arch/i386/mm/fault.c 1999/01/18 01:28:56 1.1.1.1 +++ linux//arch/i386/mm/fault.c 1999/07/04 19:44:18 @@ -31,6 +31,7 @@ { struct vm_area_struct * vma; unsigned long start = (unsigned long) addr; + int fault_retval; if (!size) return 1; @@ -50,7 +51,9 @@ start &= PAGE_MASK; for (;;) { - handle_mm_fault(current,vma, start, 1); + fault_retval = handle_mm_fault(current,vma, start, 1); + if (fault_retval != 1) + goto failed; if (!size) break; size--; @@ -70,7 +73,11 @@ goto bad_area; if (expand_stack(vma, start) == 0) goto good_area; + return 0; +failed: + if (!fault_retval) + force_sig(SIGBUS, current); bad_area: return 0; } @@ -96,7 +103,7 @@ unsigned long address; unsigned long page; unsigned long fixup; - int write; + int write, fault_retval; /* get the address */ __asm__("movl %%cr2,%0":"=r" (address)); @@ -162,8 +169,9 @@ * make sure we exit gracefully rather than endlessly redo * the fault. */ - if (!handle_mm_fault(tsk, vma, address, write)) - goto do_sigbus; + fault_retval = handle_mm_fault(tsk, vma, address, write); + if (fault_retval != 1) + goto failed; /* * Did it hit the DOS screen memory VA from vm86 mode? @@ -255,7 +263,9 @@ * We ran out of memory, or some other thing happened to us that made * us unable to handle the page fault gracefully. */ -do_sigbus: +failed: + if (!fault_retval) + force_sig(SIGBUS, tsk); up(&mm->mmap_sem); /* @@ -265,7 +275,6 @@ tsk->tss.cr2 = address; tsk->tss.error_code = error_code; tsk->tss.trap_no = 14; - force_sig(SIGBUS, tsk); /* Kernel mode? Handle exceptions or die */ if (!(error_code & 4)) Index: linux//include/linux/mm.h =================================================================== RCS file: /var/cvs/linux/include/linux/mm.h,v retrieving revision 1.1.1.10 diff -u -r1.1.1.10 mm.h --- linux//include/linux/mm.h 1999/06/14 15:29:23 1.1.1.10 +++ linux//include/linux/mm.h 1999/07/04 17:28:35 @@ -300,7 +300,7 @@ extern unsigned long paging_init(unsigned long start_mem, unsigned long end_mem); extern void mem_init(unsigned long start_mem, unsigned long end_mem); extern void show_mem(void); -extern void oom(struct task_struct * tsk); +extern void oom(void); extern void si_meminfo(struct sysinfo * val); /* mmap.c */ Index: linux//include/linux/swap.h =================================================================== RCS file: /var/cvs/linux/include/linux/swap.h,v retrieving revision 1.1.1.5 diff -u -r1.1.1.5 swap.h --- linux//include/linux/swap.h 1999/06/14 15:29:38 1.1.1.5 +++ linux//include/linux/swap.h 1999/07/04 17:28:35 @@ -69,6 +69,7 @@ extern struct inode swapper_inode; extern unsigned long page_cache_size; extern int buffermem; +extern struct wait_queue * kswapd_wait_oom; /* Incomplete types for prototype declarations: */ struct task_struct; @@ -91,8 +92,8 @@ extern void swap_after_unlock_page (unsigned long entry); /* linux/mm/page_alloc.c */ -extern void swap_in(struct task_struct *, struct vm_area_struct *, - pte_t *, unsigned long, int); +extern int swap_in(struct task_struct *, struct vm_area_struct *, + pte_t *, unsigned long, int); /* linux/mm/swap_state.c */ Index: linux//ipc/shm.c =================================================================== RCS file: /var/cvs/linux/ipc/shm.c,v retrieving revision 1.1.1.10 diff -u -r1.1.1.10 shm.c --- linux//ipc/shm.c 1999/06/14 15:30:04 1.1.1.10 +++ linux//ipc/shm.c 1999/07/04 17:28:35 @@ -636,7 +636,7 @@ if (!pte_present(pte)) { unsigned long page = get_free_page(GFP_USER); if (!page) { - oom(current); + oom(); return 0; } pte = __pte(shp->shm_pages[idx]); Index: linux//kernel/printk.c =================================================================== RCS file: /var/cvs/linux/kernel/printk.c,v retrieving revision 1.1.1.6 diff -u -r1.1.1.6 printk.c --- linux//kernel/printk.c 1999/06/14 15:30:05 1.1.1.6 +++ linux//kernel/printk.c 1999/07/04 17:28:35 @@ -147,7 +147,8 @@ log_size--; log_start &= LOG_BUF_LEN-1; sti(); - __put_user(c,buf); + if ((error = __put_user(c,buf)) == -EFAULT) + goto out; buf++; i++; cli(); @@ -183,7 +184,8 @@ __restore_flags(flags); for (i = 0; i < count; i++) { c = *((char *) log_buf+(j++ & (LOG_BUF_LEN-1))); - __put_user(c, buf++); + if ((error = __put_user(c, buf++)) == -EFAULT) + goto out; } if (do_clear) logged_chars = 0; Index: linux//kernel/signal.c =================================================================== RCS file: /var/cvs/linux/kernel/signal.c,v retrieving revision 1.1.1.9 diff -u -r1.1.1.9 signal.c --- linux//kernel/signal.c 1999/06/14 15:30:06 1.1.1.9 +++ linux//kernel/signal.c 1999/07/04 17:28:35 @@ -409,6 +409,9 @@ { unsigned long int flags; + if (t->pid == 1) + return 0; + spin_lock_irqsave(&t->sigmask_lock, flags); if (t->sig == NULL) { spin_unlock_irqrestore(&t->sigmask_lock, flags); Index: linux//mm/filemap.c =================================================================== RCS file: /var/cvs/linux/mm/filemap.c,v retrieving revision 1.1.1.17 diff -u -r1.1.1.17 filemap.c --- linux//mm/filemap.c 1999/06/14 15:30:07 1.1.1.17 +++ linux//mm/filemap.c 1999/07/04 18:57:48 @@ -958,7 +958,7 @@ if (no_share && !new_page) { new_page = page_cache_alloc(); if (!new_page) - goto failure; + goto release_and_oom; } if (PageLocked(page)) @@ -1006,7 +1006,7 @@ if (!new_page) new_page = page_cache_alloc(); if (!new_page) - goto no_page; + goto oom; /* * During getting the above page we might have slept, @@ -1060,6 +1060,12 @@ page_cache_free(new_page); no_page: return 0; + +release_and_oom: + page_cache_release(page); +oom: + oom(); + return -1; } /* Index: linux//mm/memory.c =================================================================== RCS file: /var/cvs/linux/mm/memory.c,v retrieving revision 1.1.1.8 diff -u -r1.1.1.8 memory.c --- linux//mm/memory.c 1999/06/14 15:30:07 1.1.1.8 +++ linux//mm/memory.c 1999/07/04 20:00:10 @@ -65,10 +65,23 @@ * oom() prints a message (so that the user knows why the process died), * and gives the process an untrappable SIGKILL. */ -void oom(struct task_struct * task) +void oom(void) { - printk("\nOut of memory for %s.\n", task->comm); - force_sig(SIGKILL, task); + if (current->pid == 1) + return; + + /* + * Make sure the killed task will be scheduled ASAP. Otherwise + * we risk to reschedule another task before running the signal + * hander. + */ + current->policy = SCHED_FIFO; + current->rt_priority = 1000000; + + printk("\nOut of memory for %s.\n", current->comm); + force_sig(SIGKILL, current); + + wake_up_interruptible(&kswapd_wait_oom); } /* @@ -545,19 +558,6 @@ } /* - * sanity-check function.. - */ -static void put_page(pte_t * page_table, pte_t pte) -{ - if (!pte_none(*page_table)) { - free_page_and_swap_cache(pte_page(pte)); - return; - } -/* no need for flush_tlb */ - set_pte(page_table, pte); -} - -/* * This routine is used to map in a page into an address space: needed by * execve() for the initial stack and environment pages. */ @@ -575,13 +575,13 @@ pmd = pmd_alloc(pgd, address); if (!pmd) { free_page(page); - oom(tsk); + oom(); return 0; } pte = pte_alloc(pmd, address); if (!pte) { free_page(page); - oom(tsk); + oom(); return 0; } if (!pte_none(*pte)) { @@ -614,21 +614,15 @@ * and potentially makes it more efficient. */ static int do_wp_page(struct task_struct * tsk, struct vm_area_struct * vma, - unsigned long address, pte_t *page_table) + unsigned long address, pte_t *page_table, pte_t pte) { - pte_t pte; unsigned long old_page, new_page; struct page * page_map; - pte = *page_table; new_page = __get_free_page(GFP_USER); /* Did swap_out() unmapped the protected page while we slept? */ if (pte_val(*page_table) != pte_val(pte)) goto end_wp_page; - if (!pte_present(pte)) - goto end_wp_page; - if (pte_write(pte)) - goto end_wp_page; old_page = pte_page(pte); if (MAP_NR(old_page) >= max_mapnr) goto bad_wp_page; @@ -684,13 +678,16 @@ return 1; bad_wp_page: + unlock_kernel(); printk("do_wp_page: bogus page at address %08lx (%08lx)\n",address,old_page); send_sig(SIGKILL, tsk, 1); -no_new_page: - unlock_kernel(); if (new_page) free_page(new_page); return 0; +no_new_page: + unlock_kernel(); + oom(); + return -1; } /* @@ -787,8 +784,9 @@ struct vm_area_struct * vma, unsigned long address, pte_t * page_table, pte_t entry, int write_access) { + int ret = 1; if (!vma->vm_ops || !vma->vm_ops->swapin) { - swap_in(tsk, vma, page_table, pte_val(entry), write_access); + ret = swap_in(tsk, vma, page_table, pte_val(entry), write_access); flush_page_to_ram(pte_page(*page_table)); } else { pte_t page = vma->vm_ops->swapin(vma, address - vma->vm_start + vma->vm_offset, pte_val(entry)); @@ -805,7 +803,7 @@ } } unlock_kernel(); - return 1; + return ret; } /* @@ -817,15 +815,18 @@ if (write_access) { unsigned long page = __get_free_page(GFP_USER); if (!page) - return 0; + goto oom; clear_page(page); entry = pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot))); vma->vm_mm->rss++; tsk->min_flt++; flush_page_to_ram(page); } - put_page(page_table, entry); + set_pte(page_table, entry); return 1; + oom: + oom(); + return -1; } /* @@ -882,7 +883,7 @@ } else if (atomic_read(&mem_map[MAP_NR(page)].count) > 1 && !(vma->vm_flags & VM_SHARED)) entry = pte_wrprotect(entry); - put_page(page_table, entry); + set_pte(page_table, entry); /* no need to invalidate: a not-present page shouldn't be cached */ return 1; } @@ -916,7 +917,7 @@ flush_tlb_page(vma, address); if (write_access) { if (!pte_write(entry)) - return do_wp_page(tsk, vma, address, pte); + return do_wp_page(tsk, vma, address, pte, entry); entry = pte_mkdirty(entry); set_pte(pte, entry); @@ -934,19 +935,24 @@ { pgd_t *pgd; pmd_t *pmd; + pte_t * pte; + int ret; pgd = pgd_offset(vma->vm_mm, address); pmd = pmd_alloc(pgd, address); - if (pmd) { - pte_t * pte = pte_alloc(pmd, address); - if (pte) { - if (handle_pte_fault(tsk, vma, address, write_access, pte)) { - update_mmu_cache(vma, address, *pte); - return 1; - } - } - } - return 0; + if (!pmd) + goto oom; + pte = pte_alloc(pmd, address); + if (!pte) + goto oom; + ret = handle_pte_fault(tsk, vma, address, write_access, pte); + if (ret == 1) + update_mmu_cache(vma, address, *pte); + return ret; + + oom: + oom(); + return -1; } /* @@ -960,7 +966,8 @@ vma = find_vma(current->mm, addr); write = (vma->vm_flags & VM_WRITE) != 0; while (addr < end) { - handle_mm_fault(current, vma, addr, write); + if (handle_mm_fault(current, vma, addr, write) == -1) + break; addr += PAGE_SIZE; } } Index: linux//mm/page_alloc.c =================================================================== RCS file: /var/cvs/linux/mm/page_alloc.c,v retrieving revision 1.1.1.10 diff -u -r1.1.1.10 page_alloc.c --- linux//mm/page_alloc.c 1999/06/14 15:30:08 1.1.1.10 +++ linux//mm/page_alloc.c 1999/07/04 18:52:14 @@ -373,8 +373,13 @@ /* Ok, do the async read-ahead now */ new_page = read_swap_cache_async(SWP_ENTRY(SWP_TYPE(entry), offset), 0); - if (new_page != NULL) - __free_page(new_page); + /* + * If we are OOM or if this isn't a contiguous swap cluster + * stop the readahead. + */ + if (!new_page) + return; + __free_page(new_page); offset++; } while (--i); return; @@ -387,7 +392,7 @@ * Also, don't bother to add to the swap cache if this page-in * was due to a write access. */ -void swap_in(struct task_struct * tsk, struct vm_area_struct * vma, +int swap_in(struct task_struct * tsk, struct vm_area_struct * vma, pte_t * page_table, unsigned long entry, int write_access) { unsigned long page; @@ -400,13 +405,11 @@ if (pte_val(*page_table) != entry) { if (page_map) free_page_and_swap_cache(page_address(page_map)); - return; + return 1; } if (!page_map) { - set_pte(page_table, BAD_PAGE); - swap_free(entry); - oom(tsk); - return; + oom(); + return -1; } page = page_address(page_map); @@ -416,7 +419,7 @@ if (!write_access || is_page_shared(page_map)) { set_pte(page_table, mk_pte(page, vma->vm_page_prot)); - return; + return 1; } /* @@ -426,5 +429,5 @@ */ delete_from_swap_cache(page_map); set_pte(page_table, pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)))); - return; + return 1; } Index: linux//mm/swapfile.c =================================================================== RCS file: /var/cvs/linux/mm/swapfile.c,v retrieving revision 1.1.1.8 diff -u -r1.1.1.8 swapfile.c --- linux//mm/swapfile.c 1999/06/14 15:30:09 1.1.1.8 +++ linux//mm/swapfile.c 1999/07/04 17:28:35 @@ -131,15 +131,17 @@ offset = SWP_OFFSET(entry); if (offset >= p->max) goto bad_offset; - if (offset < p->lowest_bit) - p->lowest_bit = offset; - if (offset > p->highest_bit) - p->highest_bit = offset; if (!p->swap_map[offset]) goto bad_free; if (p->swap_map[offset] < SWAP_MAP_MAX) { if (!--p->swap_map[offset]) + { + if (offset < p->lowest_bit) + p->lowest_bit = offset; + if (offset > p->highest_bit) + p->highest_bit = offset; nr_swap_pages++; + } } #ifdef DEBUG_SWAP printk("DebugVM: swap_free(entry %08lx, count now %d)\n", Index: linux//mm/vmscan.c =================================================================== RCS file: /var/cvs/linux/mm/vmscan.c,v retrieving revision 1.1.1.8 diff -u -r1.1.1.8 vmscan.c --- linux//mm/vmscan.c 1999/06/14 15:30:09 1.1.1.8 +++ linux//mm/vmscan.c 1999/07/04 17:28:35 @@ -308,7 +308,8 @@ static int swap_out(unsigned int priority, int gfp_mask) { struct task_struct * p, * pbest; - int counter, assign, max_cnt; + int assign = 0, counter; + unsigned long max_cnt; /* * We make one or two passes through the task list, indexed by @@ -327,11 +328,8 @@ counter = nr_tasks / (priority+1); if (counter < 1) counter = 1; - if (counter > nr_tasks) - counter = nr_tasks; for (; counter >= 0; counter--) { - assign = 0; max_cnt = 0; pbest = NULL; select: @@ -343,7 +341,7 @@ if (p->mm->rss <= 0) continue; /* Refresh swap_cnt? */ - if (assign) + if (assign == 1) p->mm->swap_cnt = p->mm->rss; if (p->mm->swap_cnt > max_cnt) { max_cnt = p->mm->swap_cnt; @@ -351,6 +349,8 @@ } } read_unlock(&tasklist_lock); + if (assign == 1) + assign = 2; if (!pbest) { if (!assign) { assign = 1; @@ -435,7 +435,8 @@ printk ("Starting kswapd v%.*s\n", i, s); } -static struct task_struct *kswapd_process; +static struct wait_queue * kswapd_wait = NULL; +struct wait_queue * kswapd_wait_oom = NULL; /* * The background pageout daemon, started as a kernel thread @@ -455,7 +456,6 @@ { struct task_struct *tsk = current; - kswapd_process = tsk; tsk->session = 1; tsk->pgrp = 1; strcpy(tsk->comm, "kswapd"); @@ -484,16 +484,17 @@ * the processes needing more memory will wake us * up on a more timely basis. */ - do { - if (nr_free_pages >= freepages.high) - break; - - if (!do_try_to_free_pages(GFP_KSWAPD)) - break; - } while (!tsk->need_resched); - run_task_queue(&tq_disk); - tsk->state = TASK_INTERRUPTIBLE; - schedule_timeout(HZ); + interruptible_sleep_on_timeout(&kswapd_wait, HZ); + while (nr_free_pages < freepages.high) + { + if (do_try_to_free_pages(GFP_KSWAPD)) + { + if (tsk->need_resched) + schedule(); + continue; + } + interruptible_sleep_on_timeout(&kswapd_wait_oom,10*HZ); + } } } @@ -516,7 +517,7 @@ { int retval = 1; - wake_up_process(kswapd_process); + wake_up_interruptible(&kswapd_wait); if (gfp_mask & __GFP_WAIT) retval = do_try_to_free_pages(gfp_mask); return retval; Note: the most important part of the patch is still the swapin deadlock-fix. Previously swap_in was always returning 1 even if it gone OOM, so if the swapin happened in a copy-user call we could deadlock completly (easily reproducible generating some syslog entries while going oom). Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [patch] fix for OOM deadlock in swap_in (2.2.10) [Re: [test program] for OOM situations ] 1999-07-04 20:15 ` Andrea Arcangeli @ 1999-07-04 20:32 ` Linus Torvalds 0 siblings, 0 replies; 8+ messages in thread From: Linus Torvalds @ 1999-07-04 20:32 UTC (permalink / raw) To: Andrea Arcangeli Cc: Rik van Riel, Bernd Kaindl, Linux Kernel, kernel, linux-mm, Alan Cox On Sun, 4 Jul 1999, Andrea Arcangeli wrote: > > Woops, I see your point, now I changed the patch as you suggested to catch > the ptrace case properly. Now the retval of handle_mm_fault means this: > > 1 -> page fault resolved succesfully > 0 -> we accessed a shared mmap beyond the end of the file. > So do_page_fault will send a sigbus in this case while > ptrace won't send the sigbus and will return 0. > -1 -> OOM. But we don't need to send a sigkill since the lower > layer just sent out the sigkill for us. Note that if you handle OOM in the caller instead of in the lower layers, you get _much_ better behaviour. In particular, the page fault handler knows whether the fault happened in kernel mode or in user mode, and can do different things depending on which it was. See 2.3.10-pre2 - when the OOM fault happens from user mode (the most common case by far), you can just do a "do_exit(SIGKILL)" and kill it off much more cleanly without having to go through any extra pseudo-real-time stuff. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~1999-07-04 20:32 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <Pine.LNX.4.10.9907010152410.17156-100000@Wotan.suse.de>
1999-07-03 13:52 ` [patch] fix for OOM deadlock in swap_in (2.2.10) [Re: [test program] for OOM situations ] Andrea Arcangeli
1999-07-04 9:48 ` Rik van Riel
1999-07-04 14:35 ` Andrea Arcangeli
1999-07-04 17:11 ` Linus Torvalds
1999-07-04 17:38 ` Andrea Arcangeli
1999-07-04 17:49 ` Linus Torvalds
1999-07-04 20:15 ` Andrea Arcangeli
1999-07-04 20:32 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox