From: Andrea Arcangeli <andrea@suse.de>
To: Linus Torvalds <torvalds@osdl.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>,
Andrew Morton <akpm@osdl.org>,
Linux Kernel list <linux-kernel@vger.kernel.org>,
Ingo Molnar <mingo@elte.hu>, Ben LaHaise <bcrl@redhat.com>,
linux-mm@kvack.org
Subject: Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
Date: Tue, 25 May 2004 05:43:26 +0200 [thread overview]
Message-ID: <20040525034326.GT29378@dualathlon.random> (raw)
In-Reply-To: <Pine.LNX.4.58.0405232149380.25502@ppc970.osdl.org>
On Sun, May 23, 2004 at 10:10:16PM -0700, Linus Torvalds wrote:
> what you found is actually a (really really) unlikely race condition that
> can have serious consequences.
agreed, it could in theory lose the dirty bit.
> course, back then SMP wasn't an issue, and this seems to have survived all
> the SMP fixes.
looks like to trigger this two threads needs to page fault on the same
not-present MAP_SHARED page at the same time and the second thread must be a
read-fault. The first thread will establish the pte writeable and
dirty, then the first thread releases the page_table_lock, then the VM
takes the page_table_lock (from kswapd or similar) then the VM transfers
the dirty bit from pte to page and it even starts the and completes the
I/O before the the read-fault has a chance to execute. Then after the
I/O is completed the second thread finally takes the page_table_lock and
reads the pte. Then the first thread writes to the page from userspace
marking the pte dirty , and finally the second thread completes the
page_fault running pte_establish and losing the dirty bit on the page.
Maybe it can trigger even without I/O in between, not sure, certainly it
can happen in the above scenario, but the I/O window is quite huge for
not being able to take a spinlock before the I/O is started.
The below patch should fix it, the only problem is that it can screwup
some arch that might use page-faults to keep track of the accessed bit,
I think for istance ia64 might do that, not sure if it's using it in
linux though. anyways the below should be the optimal fix for x86 and
x86-64 at least and the other archs will not corrupt memory anymore too
(at worst they will live lock in userspace, and the livelock should be
solvable gracefully with sigkill since it's an userspace one).
warning, patch is absolutely untested.
Index: mm/memory.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/mm/memory.c,v
retrieving revision 1.154
diff -u -p -r1.154 memory.c
--- mm/memory.c 20 Apr 2004 14:22:03 -0000 1.154
+++ mm/memory.c 25 May 2004 03:04:48 -0000
@@ -1665,10 +1665,30 @@ static inline int handle_pte_fault(struc
return do_wp_page(mm, vma, address, pte, pmd, entry);
entry = pte_mkdirty(entry);
+
+ /*
+ * We can overwrite the pte not atomically only if this is a
+ * write fault or we can lose the dirty bit. The dirty and the
+ * accessed bit are the only two things that can change form
+ * under us even if we hold the page_table_lock, they're
+ * set by hardware while other threads runs in userspace.
+ *
+ * This could cause an infinite loop with a read-fault if some
+ * arch tries to keep track of the accessed bit with a kernel
+ * page fault (but keeping track of the accessed bit with
+ * kernel exceptions sounds very inefficient anyways).
+ * If some arch infinite loops, they will have to implement
+ * some sort of atomic ptep_stablish where they atomically
+ * compare and swap, then we'll change the common code here
+ * to learn how to atomic swap the pte.
+ *
+ * x86 and x86-64 are sure optimal without atomic ops here and
+ * with this simple code.
+ */
+ entry = pte_mkyoung(entry);
+ ptep_establish(vma, address, pte, entry);
+ update_mmu_cache(vma, address, entry);
}
- entry = pte_mkyoung(entry);
- ptep_establish(vma, address, pte, entry);
- update_mmu_cache(vma, address, entry);
pte_unmap(pte);
spin_unlock(&mm->page_table_lock);
return VM_FAULT_MINOR;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
next prev parent reply other threads:[~2004-05-25 3:43 UTC|newest]
Thread overview: 71+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <1085369393.15315.28.camel@gaston>
[not found] ` <Pine.LNX.4.58.0405232046210.25502@ppc970.osdl.org>
[not found] ` <1085371988.15281.38.camel@gaston>
[not found] ` <Pine.LNX.4.58.0405232134480.25502@ppc970.osdl.org>
[not found] ` <1085373839.14969.42.camel@gaston>
2004-05-24 5:10 ` Linus Torvalds
2004-05-24 5:34 ` Benjamin Herrenschmidt
2004-05-24 5:38 ` Benjamin Herrenschmidt
2004-05-24 5:52 ` Benjamin Herrenschmidt
2004-05-24 7:39 ` Ingo Molnar
2004-05-24 5:39 ` Benjamin Herrenschmidt
2004-05-25 3:43 ` Andrea Arcangeli [this message]
2004-05-25 4:00 ` Linus Torvalds
2004-05-25 4:17 ` Benjamin Herrenschmidt
2004-05-25 4:37 ` Andrea Arcangeli
2004-05-25 4:40 ` Benjamin Herrenschmidt
2004-05-25 4:20 ` Andrea Arcangeli
2004-05-25 4:39 ` Linus Torvalds
2004-05-25 4:44 ` Linus Torvalds
2004-05-25 4:59 ` Andrea Arcangeli
2004-05-25 5:09 ` Andrea Arcangeli
2004-05-25 4:50 ` Andrea Arcangeli
2004-05-25 4:59 ` Linus Torvalds
2004-05-25 4:43 ` David Mosberger
2004-05-25 4:53 ` Andrea Arcangeli
2004-05-27 21:56 ` David Mosberger
2004-05-27 22:00 ` Benjamin Herrenschmidt
2004-05-27 22:12 ` David Mosberger
2004-05-25 11:44 ` Matthew Wilcox
2004-05-25 14:48 ` Linus Torvalds
2004-05-25 15:35 ` Keith M Wesolowski
2004-05-25 16:19 ` Linus Torvalds
2004-05-25 17:25 ` David S. Miller
2004-05-25 17:49 ` Linus Torvalds
2004-05-25 17:54 ` David S. Miller
2004-05-25 18:05 ` Linus Torvalds
2004-05-25 20:30 ` Linus Torvalds
2004-05-25 20:35 ` David S. Miller
2004-05-25 20:49 ` Linus Torvalds
2004-05-25 20:57 ` David S. Miller
2004-05-26 6:20 ` Keith M Wesolowski
2004-05-25 21:40 ` Benjamin Herrenschmidt
2004-05-25 21:54 ` Linus Torvalds
2004-05-25 22:00 ` Linus Torvalds
2004-05-25 22:07 ` Benjamin Herrenschmidt
2004-05-25 22:14 ` Linus Torvalds
2004-05-26 0:21 ` Benjamin Herrenschmidt
2004-05-26 0:50 ` Linus Torvalds
2004-05-26 3:25 ` Benjamin Herrenschmidt
2004-05-26 4:08 ` Linus Torvalds
2004-05-26 4:12 ` Benjamin Herrenschmidt
2004-05-26 4:18 ` Benjamin Herrenschmidt
2004-05-26 4:50 ` Linus Torvalds
2004-05-26 4:49 ` Benjamin Herrenschmidt
2004-05-26 4:28 ` Linus Torvalds
2004-05-26 4:46 ` Benjamin Herrenschmidt
2004-05-26 4:54 ` Linus Torvalds
2004-05-26 4:55 ` Benjamin Herrenschmidt
2004-05-26 5:41 ` Benjamin Herrenschmidt
2004-05-26 5:59 ` [PATCH] (signoff) " Benjamin Herrenschmidt
2004-05-26 6:55 ` Benjamin Herrenschmidt
2004-05-25 22:05 ` [PATCH] " Benjamin Herrenschmidt
2004-05-25 22:09 ` Linus Torvalds
2004-05-25 22:19 ` Benjamin Herrenschmidt
2004-05-25 22:24 ` Linus Torvalds
2004-05-25 21:27 ` Andrea Arcangeli
2004-05-25 21:43 ` Linus Torvalds
2004-05-25 21:55 ` Andrea Arcangeli
2004-05-25 22:01 ` Linus Torvalds
2004-05-25 22:18 ` Ivan Kokshaysky
2004-05-25 22:42 ` Andrea Arcangeli
2004-05-26 2:26 ` Linus Torvalds
2004-05-26 7:06 ` Andrea Arcangeli
2004-05-25 21:44 ` Andrea Arcangeli
2004-06-01 12:04 Martin Schwidefsky
2004-06-01 12:10 Martin Schwidefsky
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20040525034326.GT29378@dualathlon.random \
--to=andrea@suse.de \
--cc=akpm@osdl.org \
--cc=bcrl@redhat.com \
--cc=benh@kernel.crashing.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mingo@elte.hu \
--cc=torvalds@osdl.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox