linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Hugh Dickins <hugh.dickins@tiscali.co.uk>
To: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>,
	Rik van Riel <riel@redhat.com>, Chris Wright <chrisw@redhat.com>,
	Nick Piggin <nickpiggin@yahoo.com.au>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: [PATCH 7/12] ksm: fix endless loop on oom
Date: Mon, 3 Aug 2009 13:16:15 +0100 (BST)	[thread overview]
Message-ID: <Pine.LNX.4.64.0908031315200.16754@sister.anvils> (raw)
In-Reply-To: <Pine.LNX.4.64.0908031304430.16449@sister.anvils>

break_ksm has been looping endlessly ignoring VM_FAULT_OOM: that should
only be a problem for ksmd when a memory control group imposes limits
(normally the OOM killer will kill others with an mm until it succeeds);
but in general (especially for MADV_UNMERGEABLE and KSM_RUN_UNMERGE) we
do need to route the error (or kill) back to the caller (or sighandling).

Test signal_pending in unmerge_ksm_pages, which could be a lengthy
procedure if it has to spill into swap: returning -ERESTARTSYS so that
trivial signals will restart but fatals will terminate (is that right?
we do different things in different places in mm, none exactly this).

unmerge_and_remove_all_rmap_items was forgetting to lock when going
down the mm_list: fix that.  Whether it's successful or not, reset
ksm_scan cursor to head; but only if it's successful, reset seqnr
(shown in full_scans) - page counts will have gone down to zero.

This patch leaves a significant OOM deadlock, but it's a good step
on the way, and that deadlock is fixed in a subsequent patch.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
---

 mm/ksm.c |  108 +++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 85 insertions(+), 23 deletions(-)

--- ksm6/mm/ksm.c	2009-08-02 13:50:15.000000000 +0100
+++ ksm7/mm/ksm.c	2009-08-02 13:50:25.000000000 +0100
@@ -294,10 +294,10 @@ static inline int in_stable_tree(struct
  * Could a ksm page appear anywhere else?  Actually yes, in a VM_PFNMAP
  * mmap of /dev/mem or /dev/kmem, where we would not want to touch it.
  */
-static void break_ksm(struct vm_area_struct *vma, unsigned long addr)
+static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
 {
 	struct page *page;
-	int ret;
+	int ret = 0;
 
 	do {
 		cond_resched();
@@ -310,9 +310,36 @@ static void break_ksm(struct vm_area_str
 		else
 			ret = VM_FAULT_WRITE;
 		put_page(page);
-	} while (!(ret & (VM_FAULT_WRITE | VM_FAULT_SIGBUS)));
-
-	/* Which leaves us looping there if VM_FAULT_OOM: hmmm... */
+	} while (!(ret & (VM_FAULT_WRITE | VM_FAULT_SIGBUS | VM_FAULT_OOM)));
+	/*
+	 * We must loop because handle_mm_fault() may back out if there's
+	 * any difficulty e.g. if pte accessed bit gets updated concurrently.
+	 *
+	 * VM_FAULT_WRITE is what we have been hoping for: it indicates that
+	 * COW has been broken, even if the vma does not permit VM_WRITE;
+	 * but note that a concurrent fault might break PageKsm for us.
+	 *
+	 * VM_FAULT_SIGBUS could occur if we race with truncation of the
+	 * backing file, which also invalidates anonymous pages: that's
+	 * okay, that truncation will have unmapped the PageKsm for us.
+	 *
+	 * VM_FAULT_OOM: at the time of writing (late July 2009), setting
+	 * aside mem_cgroup limits, VM_FAULT_OOM would only be set if the
+	 * current task has TIF_MEMDIE set, and will be OOM killed on return
+	 * to user; and ksmd, having no mm, would never be chosen for that.
+	 *
+	 * But if the mm is in a limited mem_cgroup, then the fault may fail
+	 * with VM_FAULT_OOM even if the current task is not TIF_MEMDIE; and
+	 * even ksmd can fail in this way - though it's usually breaking ksm
+	 * just to undo a merge it made a moment before, so unlikely to oom.
+	 *
+	 * That's a pity: we might therefore have more kernel pages allocated
+	 * than we're counting as nodes in the stable tree; but ksm_do_scan
+	 * will retry to break_cow on each pass, so should recover the page
+	 * in due course.  The important thing is to not let VM_MERGEABLE
+	 * be cleared while any such pages might remain in the area.
+	 */
+	return (ret & VM_FAULT_OOM) ? -ENOMEM : 0;
 }
 
 static void break_cow(struct mm_struct *mm, unsigned long addr)
@@ -462,39 +489,61 @@ static void remove_trailing_rmap_items(s
  * to the next pass of ksmd - consider, for example, how ksmd might be
  * in cmp_and_merge_page on one of the rmap_items we would be removing.
  */
-static void unmerge_ksm_pages(struct vm_area_struct *vma,
-			      unsigned long start, unsigned long end)
+static int unmerge_ksm_pages(struct vm_area_struct *vma,
+			     unsigned long start, unsigned long end)
 {
 	unsigned long addr;
+	int err = 0;
 
-	for (addr = start; addr < end; addr += PAGE_SIZE)
-		break_ksm(vma, addr);
+	for (addr = start; addr < end && !err; addr += PAGE_SIZE) {
+		if (signal_pending(current))
+			err = -ERESTARTSYS;
+		else
+			err = break_ksm(vma, addr);
+	}
+	return err;
 }
 
-static void unmerge_and_remove_all_rmap_items(void)
+static int unmerge_and_remove_all_rmap_items(void)
 {
 	struct mm_slot *mm_slot;
 	struct mm_struct *mm;
 	struct vm_area_struct *vma;
+	int err = 0;
+
+	spin_lock(&ksm_mmlist_lock);
+	mm_slot = list_entry(ksm_mm_head.mm_list.next,
+						struct mm_slot, mm_list);
+	spin_unlock(&ksm_mmlist_lock);
 
-	list_for_each_entry(mm_slot, &ksm_mm_head.mm_list, mm_list) {
+	while (mm_slot != &ksm_mm_head) {
 		mm = mm_slot->mm;
 		down_read(&mm->mmap_sem);
 		for (vma = mm->mmap; vma; vma = vma->vm_next) {
 			if (!(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma)
 				continue;
-			unmerge_ksm_pages(vma, vma->vm_start, vma->vm_end);
+			err = unmerge_ksm_pages(vma,
+						vma->vm_start, vma->vm_end);
+			if (err) {
+				up_read(&mm->mmap_sem);
+				goto out;
+			}
 		}
 		remove_trailing_rmap_items(mm_slot, mm_slot->rmap_list.next);
 		up_read(&mm->mmap_sem);
+
+		spin_lock(&ksm_mmlist_lock);
+		mm_slot = list_entry(mm_slot->mm_list.next,
+						struct mm_slot, mm_list);
+		spin_unlock(&ksm_mmlist_lock);
 	}
 
+	ksm_scan.seqnr = 0;
+out:
 	spin_lock(&ksm_mmlist_lock);
-	if (ksm_scan.mm_slot != &ksm_mm_head) {
-		ksm_scan.mm_slot = &ksm_mm_head;
-		ksm_scan.seqnr++;
-	}
+	ksm_scan.mm_slot = &ksm_mm_head;
 	spin_unlock(&ksm_mmlist_lock);
+	return err;
 }
 
 static void remove_mm_from_lists(struct mm_struct *mm)
@@ -1058,6 +1107,8 @@ static void cmp_and_merge_page(struct pa
 	/*
 	 * A ksm page might have got here by fork, but its other
 	 * references have already been removed from the stable tree.
+	 * Or it might be left over from a break_ksm which failed
+	 * when the mem_cgroup had reached its limit: try again now.
 	 */
 	if (PageKsm(page))
 		break_cow(rmap_item->mm, rmap_item->address);
@@ -1293,6 +1344,7 @@ int ksm_madvise(struct vm_area_struct *v
 		unsigned long end, int advice, unsigned long *vm_flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
+	int err;
 
 	switch (advice) {
 	case MADV_MERGEABLE:
@@ -1305,9 +1357,11 @@ int ksm_madvise(struct vm_area_struct *v
 				 VM_MIXEDMAP  | VM_SAO))
 			return 0;		/* just ignore the advice */
 
-		if (!test_bit(MMF_VM_MERGEABLE, &mm->flags))
-			if (__ksm_enter(mm) < 0)
-				return -EAGAIN;
+		if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) {
+			err = __ksm_enter(mm);
+			if (err)
+				return err;
+		}
 
 		*vm_flags |= VM_MERGEABLE;
 		break;
@@ -1316,8 +1370,11 @@ int ksm_madvise(struct vm_area_struct *v
 		if (!(*vm_flags & VM_MERGEABLE))
 			return 0;		/* just ignore the advice */
 
-		if (vma->anon_vma)
-			unmerge_ksm_pages(vma, start, end);
+		if (vma->anon_vma) {
+			err = unmerge_ksm_pages(vma, start, end);
+			if (err)
+				return err;
+		}
 
 		*vm_flags &= ~VM_MERGEABLE;
 		break;
@@ -1448,8 +1505,13 @@ static ssize_t run_store(struct kobject
 	mutex_lock(&ksm_thread_mutex);
 	if (ksm_run != flags) {
 		ksm_run = flags;
-		if (flags & KSM_RUN_UNMERGE)
-			unmerge_and_remove_all_rmap_items();
+		if (flags & KSM_RUN_UNMERGE) {
+			err = unmerge_and_remove_all_rmap_items();
+			if (err) {
+				ksm_run = KSM_RUN_STOP;
+				count = err;
+			}
+		}
 	}
 	mutex_unlock(&ksm_thread_mutex);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2009-08-03 11:56 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-08-03 12:08 [PATCH 0/12] ksm: stats, oom, doc, misc Hugh Dickins
2009-08-03 12:10 ` [PATCH 1/12] ksm: rename kernel_pages_allocated Hugh Dickins
2009-08-03 14:21   ` Izik Eidus
2009-08-03 16:48     ` Andrea Arcangeli
2009-08-03 12:11 ` [PATCH 2/12] ksm: move pages_sharing updates Hugh Dickins
2009-08-03 14:34   ` Izik Eidus
2009-08-03 16:53   ` Andrea Arcangeli
2009-08-03 17:34     ` Hugh Dickins
2009-08-03 12:11 ` [PATCH 3/12] ksm: pages_unshared and pages_volatile Hugh Dickins
2009-08-03 14:54   ` Izik Eidus
2009-08-04 21:49   ` Andrew Morton
2009-08-05 11:39     ` Hugh Dickins
2009-08-05 15:11       ` Andrea Arcangeli
2009-08-03 12:12 ` [PATCH 4/12] ksm: break cow once unshared Hugh Dickins
2009-08-03 16:00   ` Izik Eidus
2009-08-03 12:14 ` [PATCH 5/12] ksm: keep quiet while list empty Hugh Dickins
2009-08-03 16:55   ` Izik Eidus
2009-08-04 21:59   ` Andrew Morton
2009-08-05 11:54     ` Hugh Dickins
2009-08-03 12:15 ` [PATCH 6/12] ksm: five little cleanups Hugh Dickins
2009-08-04 12:41   ` Izik Eidus
2009-08-03 12:16 ` Hugh Dickins [this message]
2009-08-04 12:55   ` [PATCH 7/12] ksm: fix endless loop on oom Izik Eidus
2009-08-03 12:17 ` [PATCH 8/12] ksm: distribute remove_mm_from_lists Hugh Dickins
2009-08-04 13:03   ` Izik Eidus
2009-08-03 12:18 ` [PATCH 9/12] ksm: fix oom deadlock Hugh Dickins
2009-08-04 19:32   ` Izik Eidus
2009-08-25 14:58   ` Andrea Arcangeli
2009-08-25 15:22     ` [PATCH 13/12] ksm: fix munlock during exit_mmap deadlock Andrea Arcangeli
2009-08-25 17:49       ` Hugh Dickins
2009-08-25 18:10         ` Andrea Arcangeli
2009-08-25 18:58           ` Hugh Dickins
2009-08-25 19:45             ` Andrea Arcangeli
2009-08-26 16:18               ` Justin M. Forbes
2009-08-26 19:17               ` Hugh Dickins
2009-08-26 19:44                 ` Andrea Arcangeli
2009-08-26 19:57                   ` Hugh Dickins
2009-08-26 20:28                     ` Andrea Arcangeli
2009-08-26 20:54                     ` Izik Eidus
2009-08-26 21:14                       ` Andrea Arcangeli
2009-08-26 21:49                         ` Izik Eidus
2009-08-27 19:11                           ` Hugh Dickins
2009-08-27 19:35                             ` Izik Eidus
2009-08-26 22:00                         ` David Rientjes
2009-08-26 20:29                   ` Hugh Dickins
2009-08-25 17:35     ` [PATCH 9/12] ksm: fix oom deadlock Hugh Dickins
2009-08-25 17:47       ` Andrea Arcangeli
2009-08-03 12:19 ` [PATCH 10/12] ksm: sysfs and defaults Hugh Dickins
2009-08-04 19:34   ` Izik Eidus
2009-08-03 12:21 ` [PATCH 11/12] ksm: add some documentation Hugh Dickins
2009-08-04 19:35   ` Izik Eidus
2009-08-03 12:22 ` [PATCH 12/12] ksm: remove VM_MERGEABLE_FLAGS Hugh Dickins
2009-08-04 19:35   ` Izik Eidus

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.64.0908031315200.16754@sister.anvils \
    --to=hugh.dickins@tiscali.co.uk \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=chrisw@redhat.com \
    --cc=ieidus@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nickpiggin@yahoo.com.au \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox