[patch] mm: fix PageUptodate memory ordering bug

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [patch] mm: fix PageUptodate memory ordering bug
@ 2007-12-18  1:26 Nick Piggin
  2007-12-22  8:57 ` Andrew Morton
  0 siblings, 1 reply; 29+ messages in thread
From: Nick Piggin @ 2007-12-18  1:26 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins, Linux Memory Management List
  Cc: Benjamin Herrenschmidt

Hi,

Here is another version of this fix. It is now much simpler in that no
filesystems have to be audited and changed to SetPageUptodate under page
lock.  It also only executes the smp_rmb() conditionally if the page is
actually uptodate as suggested by Hugh. I guess this will be a win on most
architectures with out of order loads, but I don't know... maybe the
branch will be worse. On x86 it is a noop (except the compiler barrier).

---
After running SetPageUptodate, preceeding stores to the page contents to
actually bring it uptodate may not be ordered with the store to set the page
uptodate.

Therefore, another CPU which checks PageUptodate is true, then reads the
page contents can get stale data.

Fix this by having an smp_wmb before SetPageUptodate, and smp_rmb after
PageUptodate.

Many places that test PageUptodate, do so with the page locked, and this
would be enough to ensure memory ordering in those places if SetPageUptodate
were only called while the page is locked. Unfortunately that is not always
the case for some filesystems, but it could be an idea for the future.

One thing I like about it is that it brings the handling of anonymous page
uptodateness in line with that of file backed page management, by marking anon
pages as uptodate when they _are_ uptodate, rather than when our implementation
requires that they be marked as such. Doing allows us to get rid of the
smp_wmb's in the page copying functions, which were especially added for
anonymous pages for an analogous memory ordering problem, and are now handled
with the same code as the PageUptodate memory ordering problem.

Introduce a SetNewPageUptodate for these anonymous pages: it contains non
atomic bitops so as not to introduce too much overhead into these paths.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/include/linux/highmem.h
===================================================================
--- linux-2.6.orig/include/linux/highmem.h
+++ linux-2.6/include/linux/highmem.h
@@ -68,8 +68,6 @@ static inline void clear_user_highpage(s
 	void *addr = kmap_atomic(page, KM_USER0);
 	clear_user_page(addr, vaddr, page);
 	kunmap_atomic(addr, KM_USER0);
-	/* Make sure this page is cleared on other CPU's too before using it */
-	smp_wmb();
 }
 
 #ifndef __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
@@ -160,8 +158,6 @@ static inline void copy_user_highpage(st
 	copy_user_page(vto, vfrom, vaddr, to);
 	kunmap_atomic(vfrom, KM_USER0);
 	kunmap_atomic(vto, KM_USER1);
-	/* Make sure this page is cleared on other CPU's too before using it */
-	smp_wmb();
 }
 
 #endif
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h
+++ linux-2.6/include/linux/page-flags.h
@@ -131,16 +131,52 @@
 #define ClearPageReferenced(page)	clear_bit(PG_referenced, &(page)->flags)
 #define TestClearPageReferenced(page) test_and_clear_bit(PG_referenced, &(page)->flags)
 
-#define PageUptodate(page)	test_bit(PG_uptodate, &(page)->flags)
+static inline int PageUptodate(struct page *page)
+{
+	int ret = test_bit(PG_uptodate, &(page)->flags);
+
+	/*
+	 * Must ensure that the data we read out of the page is loaded
+	 * _after_ we've loaded page->flags to check for PageUptodate.
+	 * We can skip the barrier if the page is not uptodate, because
+	 * we wouldn't be reading anything from it.
+	 *
+	 * See SetPageUptodate() for the other side of the story.
+	 */
+	if (ret)
+		smp_rmb();
+
+	return ret;
+}
+
+static inline void SetNewPageUptodate(struct page *page)
+{
+	smp_wmb();
+	__set_bit(PG_uptodate, &(page)->flags);
 #ifdef CONFIG_S390
+	page_clear_dirty(page);
+#endif
+}
+
 static inline void SetPageUptodate(struct page *page)
 {
+#ifdef CONFIG_S390
 	if (!test_and_set_bit(PG_uptodate, &page->flags))
 		page_clear_dirty(page);
-}
 #else
-#define SetPageUptodate(page)	set_bit(PG_uptodate, &(page)->flags)
+	/*
+	 * Memory barrier must be issued before setting the PG_uptodate bit,
+	 * so that all previous stores issued in order to bring the page
+	 * uptodate are actually visible before PageUptodate becomes true.
+	 *
+	 * s390 doesn't need an explicit smp_wmb here because the test and
+	 * set bit already provides full barriers.
+	 */
+	smp_wmb();
+	set_bit(PG_uptodate, &(page)->flags);
 #endif
+}
+
 #define ClearPageUptodate(page)	clear_bit(PG_uptodate, &(page)->flags)
 
 #define PageDirty(page)		test_bit(PG_dirty, &(page)->flags)
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -753,6 +753,7 @@ static int hugetlb_cow(struct mm_struct 
 
 	spin_unlock(&mm->page_table_lock);
 	copy_huge_page(new_page, old_page, address, vma);
+	SetNewPageUptodate(new_page);
 	spin_lock(&mm->page_table_lock);
 
 	ptep = huge_pte_offset(mm, address & HPAGE_MASK);
@@ -798,6 +799,7 @@ retry:
 			goto out;
 		}
 		clear_huge_page(page, address);
+		SetNewPageUptodate(page);
 
 		if (vma->vm_flags & VM_SHARED) {
 			int err;
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1516,10 +1516,8 @@ static inline void cow_user_page(struct 
 			memset(kaddr, 0, PAGE_SIZE);
 		kunmap_atomic(kaddr, KM_USER0);
 		flush_dcache_page(dst);
-		return;
-
-	}
-	copy_user_highpage(dst, src, va, vma);
+	} else
+		copy_user_highpage(dst, src, va, vma);
 }
 
 /*
@@ -1628,6 +1626,7 @@ gotten:
 	if (!new_page)
 		goto oom;
 	cow_user_page(new_page, old_page, address, vma);
+	SetNewPageUptodate(new_page);
 
 	/*
 	 * Re-check the pte - we dropped the lock
@@ -2160,6 +2159,7 @@ static int do_anonymous_page(struct mm_s
 	page = alloc_zeroed_user_highpage_movable(vma, address);
 	if (!page)
 		goto oom;
+	SetNewPageUptodate(page);
 
 	entry = mk_pte(page, vma->vm_page_prot);
 	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
@@ -2260,6 +2260,7 @@ static int __do_fault(struct mm_struct *
 				goto out;
 			}
 			copy_user_highpage(page, vmf.page, address, vma);
+			SetNewPageUptodate(page);
 		} else {
 			/*
 			 * If the page will be shareable, see if the backing
Index: linux-2.6/mm/page_io.c
===================================================================
--- linux-2.6.orig/mm/page_io.c
+++ linux-2.6/mm/page_io.c
@@ -126,7 +126,7 @@ int swap_readpage(struct file *file, str
 	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
-	ClearPageUptodate(page);
+	BUG_ON(PageUptodate(page));
 	bio = get_swap_bio(GFP_KERNEL, page_private(page), page,
 				end_swap_bio_read);
 	if (bio == NULL) {
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -152,6 +152,7 @@ int add_to_swap(struct page * page, gfp_
 	int err;
 
 	BUG_ON(!PageLocked(page));
+	BUG_ON(!PageUptodate(page));
 
 	for (;;) {
 		entry = get_swap_page();
@@ -174,7 +175,6 @@ int add_to_swap(struct page * page, gfp_
 
 		switch (err) {
 		case 0:				/* Success */
-			SetPageUptodate(page);
 			SetPageDirty(page);
 			INC_CACHE_INFO(add_total);
 			return 1;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm: fix PageUptodate memory ordering bug
  2007-12-18  1:26 [patch] mm: fix PageUptodate memory ordering bug Nick Piggin
@ 2007-12-22  8:57 ` Andrew Morton
  2007-12-22 12:14   ` Hugh Dickins
  2007-12-23  5:57   ` Nick Piggin
  0 siblings, 2 replies; 29+ messages in thread
From: Andrew Morton @ 2007-12-22  8:57 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, Linux Memory Management List, Benjamin Herrenschmidt

On Tue, 18 Dec 2007 02:26:32 +0100 Nick Piggin <npiggin@suse.de> wrote:

> After running SetPageUptodate, preceeding stores to the page contents to
> actually bring it uptodate may not be ordered with the store to set the page
> uptodate.
> 
> Therefore, another CPU which checks PageUptodate is true, then reads the
> page contents can get stale data.
> 
> Fix this by having an smp_wmb before SetPageUptodate, and smp_rmb after
> PageUptodate.
> 
> Many places that test PageUptodate, do so with the page locked, and this
> would be enough to ensure memory ordering in those places if SetPageUptodate
> were only called while the page is locked. Unfortunately that is not always
> the case for some filesystems, but it could be an idea for the future.
> 
> One thing I like about it is that it brings the handling of anonymous page
> uptodateness in line with that of file backed page management, by marking anon
> pages as uptodate when they _are_ uptodate, rather than when our implementation
> requires that they be marked as such. Doing allows us to get rid of the
> smp_wmb's in the page copying functions, which were especially added for
> anonymous pages for an analogous memory ordering problem, and are now handled
> with the same code as the PageUptodate memory ordering problem.
> 
> Introduce a SetNewPageUptodate for these anonymous pages: it contains non
> atomic bitops so as not to introduce too much overhead into these paths.
> 

hrm.

> +static inline void SetNewPageUptodate(struct page *page)
> +{
> +	smp_wmb();
> +	__set_bit(PG_uptodate, &(page)->flags);

argh.  Put the pin back in that thing before you hurt someone.

Sigh.  I guess it's fairly clear but it could do with a big fat warning
over it before you go and kill someone.

Because if this little hand grenade gets used in the wrong place, it will
cause a horrid, horrid data-corrupting bug which might take us literally
years to hunt down and fix.

>  #ifdef CONFIG_S390

> +	page_clear_dirty(page);
> +#endif
> +}


For an overall 0.5% increase in the i386 size of several core mm files.  If
you don't blow us up on the spot, you'll slowly bleed us to death.

Can it be improved?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm: fix PageUptodate memory ordering bug
  2007-12-22  8:57 ` Andrew Morton
@ 2007-12-22 12:14   ` Hugh Dickins
  2007-12-23  6:54     ` Nick Piggin
  2007-12-23  5:57   ` Nick Piggin
  1 sibling, 1 reply; 29+ messages in thread
From: Hugh Dickins @ 2007-12-22 12:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Linux Memory Management List, Benjamin Herrenschmidt

On Sat, 22 Dec 2007, Andrew Morton wrote:
> On Tue, 18 Dec 2007 02:26:32 +0100 Nick Piggin <npiggin@suse.de> wrote:
> 
> > After running SetPageUptodate, preceeding stores to the page contents to
> > actually bring it uptodate may not be ordered with the store to set the page
> > uptodate.
> > 
> > Therefore, another CPU which checks PageUptodate is true, then reads the
> > page contents can get stale data.
> > 
> > Fix this by having an smp_wmb before SetPageUptodate, and smp_rmb after
> > PageUptodate.
> > 
> > Many places that test PageUptodate, do so with the page locked, and this
> > would be enough to ensure memory ordering in those places if SetPageUptodate
> > were only called while the page is locked. Unfortunately that is not always
> > the case for some filesystems, but it could be an idea for the future.
> > 
> > One thing I like about it is that it brings the handling of anonymous page
> > uptodateness in line with that of file backed page management, by marking anon
> > pages as uptodate when they _are_ uptodate, rather than when our implementation
> > requires that they be marked as such.

Nick, you're welcome to make that a separate, less controversial patch,
to send in ahead.  Though I think the last time this came around, I hit
one of your BUGs in testing shmem.c swapout or swapin or swapoff:
something missing there that I've lost the record of - please do
try testing that, maybe it's already fixed this time around.

> > Doing allows us to get rid of the
> > smp_wmb's in the page copying functions, which were especially added for
> > anonymous pages for an analogous memory ordering problem, and are now handled
> > with the same code as the PageUptodate memory ordering problem.
> > 
> > Introduce a SetNewPageUptodate for these anonymous pages: it contains non
> > atomic bitops so as not to introduce too much overhead into these paths.
> > 
> 
> hrm.
> 
> > +static inline void SetNewPageUptodate(struct page *page)
> > +{
> > +	smp_wmb();
> > +	__set_bit(PG_uptodate, &(page)->flags);
> 
> argh.  Put the pin back in that thing before you hurt someone.
> 
> Sigh.  I guess it's fairly clear but it could do with a big fat warning
> over it before you go and kill someone.
> 
> Because if this little hand grenade gets used in the wrong place, it will
> cause a horrid, horrid data-corrupting bug which might take us literally
> years to hunt down and fix.
> 
> >  #ifdef CONFIG_S390
> > +	page_clear_dirty(page);
> > +#endif
> > +}

That's an odd little extract, since page_clear_dirty only does anything
on s390.

> For an overall 0.5% increase in the i386 size of several core mm files.  If
> you don't blow us up on the spot, you'll slowly bleed us to death.
> 
> Can it be improved?

I do wish it could be.

I never find the time to give it the thought it needs; and any criticism
I make is probably unjust, probably patiently answered by Nick on a
previous round.

I'm never convinced that SetPageUptodate is the right place for
this: what's wrong with doing it in those page copying functions?
Or flush_dcache_page?  Don't we need different kinds of barrier
according to how the data got into the page (by DMA or not)?
Doesn't that enter territory discussed down the years between
James Bottomley and Russell King?  Worth CC'ing them the original?

Let me fall silent for a few days...

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm: fix PageUptodate memory ordering bug
  2007-12-22  8:57 ` Andrew Morton
  2007-12-22 12:14   ` Hugh Dickins
@ 2007-12-23  5:57   ` Nick Piggin
  2007-12-23  6:32     ` Andrew Morton
  1 sibling, 1 reply; 29+ messages in thread
From: Nick Piggin @ 2007-12-23  5:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Linux Memory Management List, Benjamin Herrenschmidt

On Sat, Dec 22, 2007 at 12:57:37AM -0800, Andrew Morton wrote:
> On Tue, 18 Dec 2007 02:26:32 +0100 Nick Piggin <npiggin@suse.de> wrote:
> 
> > After running SetPageUptodate, preceeding stores to the page contents to
> > actually bring it uptodate may not be ordered with the store to set the page
> > uptodate.
> > 
> > Therefore, another CPU which checks PageUptodate is true, then reads the
> > page contents can get stale data.
> > 
> > Fix this by having an smp_wmb before SetPageUptodate, and smp_rmb after
> > PageUptodate.
> > 
> > Many places that test PageUptodate, do so with the page locked, and this
> > would be enough to ensure memory ordering in those places if SetPageUptodate
> > were only called while the page is locked. Unfortunately that is not always
> > the case for some filesystems, but it could be an idea for the future.
> > 
> > One thing I like about it is that it brings the handling of anonymous page
> > uptodateness in line with that of file backed page management, by marking anon
> > pages as uptodate when they _are_ uptodate, rather than when our implementation
> > requires that they be marked as such. Doing allows us to get rid of the
> > smp_wmb's in the page copying functions, which were especially added for
> > anonymous pages for an analogous memory ordering problem, and are now handled
> > with the same code as the PageUptodate memory ordering problem.
> > 
> > Introduce a SetNewPageUptodate for these anonymous pages: it contains non
> > atomic bitops so as not to introduce too much overhead into these paths.
> > 
> 
> hrm.
> 
> > +static inline void SetNewPageUptodate(struct page *page)
> > +{
> > +	smp_wmb();
> > +	__set_bit(PG_uptodate, &(page)->flags);
> 
> argh.  Put the pin back in that thing before you hurt someone.
> 
> Sigh.  I guess it's fairly clear but it could do with a big fat warning
> over it before you go and kill someone.

Hmm, perhaps it should use the more conventional __SetPageUptodate. I had
named it SetNewPageUptodate in an earlier version of the ptach which was
slightly different.

 
> Because if this little hand grenade gets used in the wrong place, it will
> cause a horrid, horrid data-corrupting bug which might take us literally
> years to hunt down and fix.

Yeah, like the other non-atomic bitops. I guess we just have to make sure
they're used correctly... they really make a big difference to performance.

I guess the memory ordering bug would take years to hunt down and fix too,
OTOH I guess it would probably be the powerpc guys, rather than you or I,
that have to debug it ;)


> >  #ifdef CONFIG_S390
> 
> > +	page_clear_dirty(page);
> > +#endif
> > +}
> 
> 
> For an overall 0.5% increase in the i386 size of several core mm files.  If
> you don't blow us up on the spot, you'll slowly bleed us to death.
> 
> Can it be improved?

At first glance you'd think that, loads being in order on i386, it should be
a noop, but we actually still require a barrier to be technically correct
(even on i386). Which increases the size of some otherwise unchanged files.

Adding a few SetNewPageUptodates adds the rest, I guess. The alternative would
be to have more open coded smp_wmb()s around. I like this way much better.

Given the amount of crap that's "pending", I'd be surprised if I was the one
who bleeds us to death with bugfixes ;) But if you'd rather see some speedups,
I could certainly rustle something up.

Anyway, thanks for picking it up. That's already tripled the amount of feedback
it hsa got ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm: fix PageUptodate memory ordering bug
  2007-12-23  5:57   ` Nick Piggin
@ 2007-12-23  6:32     ` Andrew Morton
  2007-12-23  7:15       ` Nick Piggin
  0 siblings, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2007-12-23  6:32 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, Linux Memory Management List, Benjamin Herrenschmidt

On Sun, 23 Dec 2007 06:57:30 +0100 Nick Piggin <npiggin@suse.de> wrote:

> > > +static inline void SetNewPageUptodate(struct page *page)
> > > +{
> > > +	smp_wmb();
> > > +	__set_bit(PG_uptodate, &(page)->flags);
> > 
> > argh.  Put the pin back in that thing before you hurt someone.
> > 
> > Sigh.  I guess it's fairly clear but it could do with a big fat warning
> > over it before you go and kill someone.
> 
> Hmm, perhaps it should use the more conventional __SetPageUptodate. I had
> named it SetNewPageUptodate in an earlier version of the ptach which was
> slightly different.

It's a death trap.  __GFP_UPTODATE might be safer, dunno.

> ...
>
> > For an overall 0.5% increase in the i386 size of several core mm files.  If
> > you don't blow us up on the spot, you'll slowly bleed us to death.
> > 
> > Can it be improved?
> 
> At first glance you'd think that, loads being in order on i386, it should be
> a noop, but we actually still require a barrier to be technically correct
> (even on i386). Which increases the size of some otherwise unchanged files.
> 
> Adding a few SetNewPageUptodates adds the rest, I guess. The alternative would
> be to have more open coded smp_wmb()s around. I like this way much better.

That's just speculation.  Please find out why such a small patch caused
such a large code size increase and see if it can be fixed.

> Given the amount of crap that's "pending", I'd be surprised if I was the one
> who bleeds us to death with bugfixes ;)

Please consider spending some time reviewing other people's crap.

> But if you'd rather see some speedups,
> I could certainly rustle something up.
> 
> Anyway, thanks for picking it up. That's already tripled the amount of feedback
> it hsa got ;)

Rather than expecting others to review yours (not singling you out - you
just had good timing).

I didn't actually include the patch in -mm due to Hugh's comments.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm: fix PageUptodate memory ordering bug
  2007-12-22 12:14   ` Hugh Dickins
@ 2007-12-23  6:54     ` Nick Piggin
  0 siblings, 0 replies; 29+ messages in thread
From: Nick Piggin @ 2007-12-23  6:54 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Linux Memory Management List,
	Benjamin Herrenschmidt, linux-arch, rmk, James.Bottomley

On Sat, Dec 22, 2007 at 12:14:42PM +0000, Hugh Dickins wrote:
> On Sat, 22 Dec 2007, Andrew Morton wrote:
> > On Tue, 18 Dec 2007 02:26:32 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > After running SetPageUptodate, preceeding stores to the page contents to
> > > actually bring it uptodate may not be ordered with the store to set the page
> > > uptodate.
> > > 
> > > Therefore, another CPU which checks PageUptodate is true, then reads the
> > > page contents can get stale data.
> > > 
> > > Fix this by having an smp_wmb before SetPageUptodate, and smp_rmb after
> > > PageUptodate.
> > > 
> > > Many places that test PageUptodate, do so with the page locked, and this
> > > would be enough to ensure memory ordering in those places if SetPageUptodate
> > > were only called while the page is locked. Unfortunately that is not always
> > > the case for some filesystems, but it could be an idea for the future.
> > > 
> > > One thing I like about it is that it brings the handling of anonymous page
> > > uptodateness in line with that of file backed page management, by marking anon
> > > pages as uptodate when they _are_ uptodate, rather than when our implementation
> > > requires that they be marked as such.
> 
> Nick, you're welcome to make that a separate, less controversial patch,
> to send in ahead.  Though I think the last time this came around, I hit
> one of your BUGs in testing shmem.c swapout or swapin or swapoff:
> something missing there that I've lost the record of - please do
> try testing that, maybe it's already fixed this time around.

I've given it some hours in your patented swapping kbuild-on-ext2-on-loop-on-tmpfs
stress testing (including swapoff). Haven't seen a problem as yet (except the tmpfs
swapin deadlock, which I've been patching out).

But if you see anything, please let me know...

> > >  #ifdef CONFIG_S390
> > > +	page_clear_dirty(page);
> > > +#endif
> > > +}
> 
> That's an odd little extract, since page_clear_dirty only does anything
> on s390.

Ah yeah, we could just get rid of the ifdef. Although I don't mind it too much,
as it kind of helps the reader match the other ifdef there...

> > For an overall 0.5% increase in the i386 size of several core mm files.  If
> > you don't blow us up on the spot, you'll slowly bleed us to death.
> > 
> > Can it be improved?
> 
> I do wish it could be.
> 
> I never find the time to give it the thought it needs; and any criticism
> I make is probably unjust, probably patiently answered by Nick on a
> previous round.
> 
> I'm never convinced that SetPageUptodate is the right place for
> this: what's wrong with doing it in those page copying functions?
> Or flush_dcache_page?

There are various places we _could_ do it, but I think PG_uptodate macros
are logically the best, without being too intrusive.

Let me explain. Normally I think the convention would be to open-code the
barriers in the callees (ie. between memset(); SetPageUptodate();, and
if (PageUptodate()) { read from page }).

However I think that would require going through quite a bit of code (including
filesystems) to audit. So I think having them in these macros is pretty
reasonable, and amounts to less thinking required by others.

Why don't I like doing it in page copying functions? Just because there are more
and more varied uses. I can't think of any reasons to rather do it in the page
copying functions, and some reasons against.

flush_dcache_page? Well this bug really is a problem ordering stores to the
page with store to page flags against loads from the same; nothing to do with
cache aliasing. So putting the smp_wmb in flush_dcache_page leaves you without
a natural complement to put the smp_rmb. Although it could be done, I think it
makes it more tangled than having the ordering done in the macros.  We also
only need to order the *initial* stores which bring the page uptodate, rather
than for each store, in the case of flush_dcache_page.

>  Don't we need different kinds of barrier
> according to how the data got into the page (by DMA or not)?

I had thought of that (my previous patch had an XXX: help...) for this
very issue. Without actually knowing what the underlying architecture does,
I "concluded" that it should be done somewhere down at the block layer. I
think it would be silly for the block layer to signal completion if the
results are still incoherent with the CPU cache... but if the experts have
a different opinion, then this needs to be solved with another call anyway
(not in the page uptodate macros and it's not exactly a memory ordering issue).
eg. direct IO reads would have the same DMA cache synchronisation before it
completes to userspace, and this is completely independent of PG_uptodate...

> Doesn't that enter territory discussed down the years between
> James Bottomley and Russell King?  Worth CC'ing them the original?

... but since you bring this up again, I think that would be worthwhile. In
the interest of maintaining this thread I'll just link the original:

http://marc.info/?l=linux-mm&m=119794127303483&w=2

The question is this:

Must read from net/disk/etc into page P.
Device DMAs into P, signals completion

CPU0: handles completion, store to ram to mark P uptodate

CPU0/1: load from ram sees P uptodate, load from P must only see uptodate data

Are we guaranteed to get uptodate data from above the block layer, or do we
need to do anything special?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm: fix PageUptodate memory ordering bug
  2007-12-23  6:32     ` Andrew Morton
@ 2007-12-23  7:15       ` Nick Piggin
  2007-12-23  7:29         ` Andrew Morton
  2007-12-23 17:22         ` Linus Torvalds
  0 siblings, 2 replies; 29+ messages in thread
From: Nick Piggin @ 2007-12-23  7:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Linux Memory Management List,
	Benjamin Herrenschmidt, Linus Torvalds

On Sat, Dec 22, 2007 at 10:32:34PM -0800, Andrew Morton wrote:
> On Sun, 23 Dec 2007 06:57:30 +0100 Nick Piggin <npiggin@suse.de> wrote:
> 
> > > > +static inline void SetNewPageUptodate(struct page *page)
> > > > +{
> > > > +	smp_wmb();
> > > > +	__set_bit(PG_uptodate, &(page)->flags);
> > > 
> > > argh.  Put the pin back in that thing before you hurt someone.
> > > 
> > > Sigh.  I guess it's fairly clear but it could do with a big fat warning
> > > over it before you go and kill someone.
> > 
> > Hmm, perhaps it should use the more conventional __SetPageUptodate. I had
> > named it SetNewPageUptodate in an earlier version of the ptach which was
> > slightly different.
> 
> It's a death trap.  __GFP_UPTODATE might be safer, dunno.

It's like natural selection for VM developers that don't know their page lifetime
rules ;)

GFP_UPTODATE can't be done, because we're not only talking about zeroed pages
here.


> > > For an overall 0.5% increase in the i386 size of several core mm files.  If
> > > you don't blow us up on the spot, you'll slowly bleed us to death.
> > > 
> > > Can it be improved?
> > 
> > At first glance you'd think that, loads being in order on i386, it should be
> > a noop, but we actually still require a barrier to be technically correct
> > (even on i386). Which increases the size of some otherwise unchanged files.
> > 
> > Adding a few SetNewPageUptodates adds the rest, I guess. The alternative would
> > be to have more open coded smp_wmb()s around. I like this way much better.
> 
> That's just speculation.  Please find out why such a small patch caused
> such a large code size increase and see if it can be fixed.

It's not actually increasing size by that much here... hmm, do you have
CONFIG_X86_PPRO_FENCE defined, by any chance?

It looks like this gets defined by default for i386, and also probably for
distro configs. Linus? This is a fairly heavy hammer for such an unlikely bug on
such a small number of systems (that admittedly doesn't even fix the bug in all
cases anyway). It's not only heavy for my proposed patch, but it also halves the
speed of spinlocks. Can we have some special config option for this instead? 


> > Given the amount of crap that's "pending", I'd be surprised if I was the one
> > who bleeds us to death with bugfixes ;)
> 
> Please consider spending some time reviewing other people's crap.

Fixing existing bugs is important too. Granted this doesn't seem to be a pressing
issue, but it's a bug. But I do review too.

 
> > But if you'd rather see some speedups,
> > I could certainly rustle something up.
> > 
> > Anyway, thanks for picking it up. That's already tripled the amount of feedback
> > it hsa got ;)
> 
> Rather than expecting others to review yours (not singling you out - you
> just had good timing).

By crap, I wasn't referring to bugfixes. But if you have any bugfixes in my area
in -mm that need review, I can take a look, sure. Actually I always try to read
mm/ bugfixes that go by, though I often don't have anything extra to say.


> I didn't actually include the patch in -mm due to Hugh's comments.

How about my counter-comments? :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm: fix PageUptodate memory ordering bug
  2007-12-23  7:15       ` Nick Piggin
@ 2007-12-23  7:29         ` Andrew Morton
  2007-12-23  9:14           ` Nick Piggin
  2007-12-23 17:22         ` Linus Torvalds
  1 sibling, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2007-12-23  7:29 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, Linux Memory Management List,
	Benjamin Herrenschmidt, Linus Torvalds

On Sun, 23 Dec 2007 08:15:29 +0100 Nick Piggin <npiggin@suse.de> wrote:

> > That's just speculation.  Please find out why such a small patch caused
> > such a large code size increase and see if it can be fixed.
> 
> It's not actually increasing size by that much here... hmm, do you have
> CONFIG_X86_PPRO_FENCE defined, by any chance?

I expect it was just allmodconfig, so: yes

It's a quite repeatable experiment though.

> It looks like this gets defined by default for i386, and also probably for
> distro configs. Linus? This is a fairly heavy hammer for such an unlikely bug on
> such a small number of systems (that admittedly doesn't even fix the bug in all
> cases anyway). It's not only heavy for my proposed patch, but it also halves the
> speed of spinlocks. Can we have some special config option for this instead? 

Sounds worthwhile, if we can't do it via altinstructions.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm: fix PageUptodate memory ordering bug
  2007-12-23  7:29         ` Andrew Morton
@ 2007-12-23  9:14           ` Nick Piggin
  2007-12-23  9:28             ` Andrew Morton
  2007-12-30 16:33             ` Ingo Molnar
  0 siblings, 2 replies; 29+ messages in thread
From: Nick Piggin @ 2007-12-23  9:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Linux Memory Management List,
	Benjamin Herrenschmidt, Linus Torvalds, Andi Kleen, Ingo Molnar,
	Alan Cox

On Sat, Dec 22, 2007 at 11:29:32PM -0800, Andrew Morton wrote:
> On Sun, 23 Dec 2007 08:15:29 +0100 Nick Piggin <npiggin@suse.de> wrote:
> 
> > > That's just speculation.  Please find out why such a small patch caused
> > > such a large code size increase and see if it can be fixed.
> > 
> > It's not actually increasing size by that much here... hmm, do you have
> > CONFIG_X86_PPRO_FENCE defined, by any chance?
> 
> I expect it was just allmodconfig, so: yes
> 
> It's a quite repeatable experiment though.

Yep. I only saw half the text size blot that you did, but it was still quite a bit.

i386 allmodconf: size mm/built-in.o
           text    data     bss     dec     hex         text ratio
vanilla: 163082   20372   40120  223574   36956          100.00%
bugfix : 163509   20372   40120  224001   36b01            0.26%
noppro : 162191   20372   40120  222683   365db         -  0.55%
both   : 162267   20372   40120  222759   36627         -  0.50% (+0.05% vs noppro)

So with the ppro memory ordering bug out of the way, the PG_uptodate fix
only adds 76 bytes of text.


> > It looks like this gets defined by default for i386, and also probably for
> > distro configs. Linus? This is a fairly heavy hammer for such an unlikely bug on
> > such a small number of systems (that admittedly doesn't even fix the bug in all
> > cases anyway). It's not only heavy for my proposed patch, but it also halves the
> > speed of spinlocks. Can we have some special config option for this instead? 
> 
> Sounds worthwhile, if we can't do it via altinstructions.

Altinstructions means we still have code bloat, and sometimes extra branches
etc (an extra 900 bytes of icache in mm/ alone, even before my fix). I'll let
Linus or one of the x86 guys weigh in, though. It's a really sad cost for
distro kernels to carry.

---

Index: linux-2.6/arch/x86/Kconfig.cpu
===================================================================
--- linux-2.6.orig/arch/x86/Kconfig.cpu
+++ linux-2.6/arch/x86/Kconfig.cpu
@@ -322,9 +322,18 @@ config X86_XADD
 	default y
 
 config X86_PPRO_FENCE
-	bool
+	bool "PentiumPro memory ordering errata workaround"
 	depends on M686 || M586MMX || M586TSC || M586 || M486 || M386 || MGEODEGX1
-	default y
+	default n
+	help
+	  Old PentiumPro multiprocessor systems had errata that could cause memory
+	  operations to violate the x86 ordering standard in rare cases. Enabling this
+	  option will attempt to work around some (but not all) occurances of
+	  this problem, at the cost of much heavier spinlock and memory barrier
+	  operations.
+
+	  If unsure, say n here. Even distro kernels should think twice before enabling
+	  this: there are few systems, and an unlikely bug.
 
 config X86_F00F_BUG
 	bool

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm: fix PageUptodate memory ordering bug
  2007-12-23  9:14           ` Nick Piggin
@ 2007-12-23  9:28             ` Andrew Morton
  2007-12-23 16:02               ` Andi Kleen
  2007-12-30 16:33             ` Ingo Molnar
  1 sibling, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2007-12-23  9:28 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, Linux Memory Management List,
	Benjamin Herrenschmidt, Linus Torvalds, Andi Kleen, Ingo Molnar,
	Alan Cox

On Sun, 23 Dec 2007 10:14:05 +0100 Nick Piggin <npiggin@suse.de> wrote:

>  config X86_PPRO_FENCE
> -	bool
> +	bool "PentiumPro memory ordering errata workaround"
>  	depends on M686 || M586MMX || M586TSC || M586 || M486 || M386 || MGEODEGX1
> -	default y
> +	default n
> +	help
> +	  Old PentiumPro multiprocessor systems had errata that could cause memory
> +	  operations to violate the x86 ordering standard in rare cases. Enabling this
> +	  option will attempt to work around some (but not all) occurances of
> +	  this problem, at the cost of much heavier spinlock and memory barrier
> +	  operations.
> +
> +	  If unsure, say n here. Even distro kernels should think twice before enabling
> +	  this: there are few systems, and an unlikely bug.
>  

I think if we're going to do this then we should add a runtime check for the
offending CPU then do panic("your kernel config ain't right").

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm: fix PageUptodate memory ordering bug
  2007-12-23  9:28             ` Andrew Morton
@ 2007-12-23 16:02               ` Andi Kleen
  0 siblings, 0 replies; 29+ messages in thread
From: Andi Kleen @ 2007-12-23 16:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Hugh Dickins, Linux Memory Management List,
	Benjamin Herrenschmidt, Linus Torvalds, Andi Kleen, Ingo Molnar,
	Alan Cox

Attacking that has been on my todo list forever.

I think the right way would be to define a separate Config symbol outside
the normal CPU list for this case

CONFIG_X86_BROKEN_PPRO 

or similar with Kconfig describing that it will have a large .text overhead.
Then distributions can chose to not set it.
 
> I think if we're going to do this then we should add a runtime check for the
> offending CPU then do panic("your kernel config ain't right").

Panic is not needed, it is sufficient to force the system to one 
CPU (= set cpu_possible_map to { 1 } ) and a warning to suggest
enabling CONFOG_X86_BROKEN_PPRO

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm: fix PageUptodate memory ordering bug
  2007-12-23  7:15       ` Nick Piggin
  2007-12-23  7:29         ` Andrew Morton
@ 2007-12-23 17:22         ` Linus Torvalds
  2007-12-23 21:35           ` Nick Piggin
                             ` (2 more replies)
  1 sibling, 3 replies; 29+ messages in thread
From: Linus Torvalds @ 2007-12-23 17:22 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Hugh Dickins, Linux Memory Management List,
	Benjamin Herrenschmidt, Alan Cox

On Sun, 23 Dec 2007, Nick Piggin wrote:
> 
> It's not actually increasing size by that much here... hmm, do you have
> CONFIG_X86_PPRO_FENCE defined, by any chance?
> 
> It looks like this gets defined by default for i386, and also probably for
> distro configs. Linus? This is a fairly heavy hammer for such an unlikely bug on
> such a small number of systems (that admittedly doesn't even fix the bug in all
> cases anyway). It's not only heavy for my proposed patch, but it also halves the
> speed of spinlocks. Can we have some special config option for this instead? 

A special config option isn't great, since vendors would probably then 
enable it for those old P6's.

But maybe an "alternative()" thing that depends on a CPU capability?

Of course, it definitely *is* true that the number of CPU's that have that 
bug _and_ are actually used in SMP environments is probably vanishingly 
small. So maybe even vendors don't really care any more, and we could make 
the PPRO_FENCE thing a thing of the past.

There's actually a few different PPro errata. There's #51, which is an IO 
ordering thing, and can happen on UP too. There's #66, which breaks CPU 
ordering, and is SMP-only (and which is probably at least *mostly* fixed 
by PPRO_FENCE), and there is #92 which can cause cache incoherency and 
where PPRO_FENCE *may* indirectly help.

We could decide to just ignore all of them, or perhaps ignore all but #51. 
I think Alan still has an old four-way PPro hidden away somewhere, but 
he's probably one of the few people who could even *test* this thing.

Or we could make PPRO_FENCE just be something that you have to enable by 
hand, rather than enabling it automatically for a set of CPU's. That way 
Alan can add it to his config, and everybody else can clear it if they 
decide that they don't have one of the old 200MHz PPro's any more..

				Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm: fix PageUptodate memory ordering bug
  2007-12-23 17:22         ` Linus Torvalds
@ 2007-12-23 21:35           ` Nick Piggin
  2007-12-23 22:41           ` Nick Piggin
  2008-01-01 23:41           ` Alan Cox
  2 siblings, 0 replies; 29+ messages in thread
From: Nick Piggin @ 2007-12-23 21:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Hugh Dickins, Linux Memory Management List,
	Benjamin Herrenschmidt, Alan Cox

On Sun, Dec 23, 2007 at 09:22:17AM -0800, Linus Torvalds wrote:
> 
> 
> On Sun, 23 Dec 2007, Nick Piggin wrote:
> > 
> > It's not actually increasing size by that much here... hmm, do you have
> > CONFIG_X86_PPRO_FENCE defined, by any chance?
> > 
> > It looks like this gets defined by default for i386, and also probably for
> > distro configs. Linus? This is a fairly heavy hammer for such an unlikely bug on
> > such a small number of systems (that admittedly doesn't even fix the bug in all
> > cases anyway). It's not only heavy for my proposed patch, but it also halves the
> > speed of spinlocks. Can we have some special config option for this instead? 
> 
> A special config option isn't great, since vendors would probably then 
> enable it for those old P6's.
> 
> But maybe an "alternative()" thing that depends on a CPU capability?
> 
> Of course, it definitely *is* true that the number of CPU's that have that 
> bug _and_ are actually used in SMP environments is probably vanishingly 
> small. So maybe even vendors don't really care any more, and we could make 
> the PPRO_FENCE thing a thing of the past.
> 
> There's actually a few different PPro errata. There's #51, which is an IO 
> ordering thing, and can happen on UP too. There's #66, which breaks CPU 
> ordering, and is SMP-only (and which is probably at least *mostly* fixed 
> by PPRO_FENCE), and there is #92 which can cause cache incoherency and 
> where PPRO_FENCE *may* indirectly help.
> 
> We could decide to just ignore all of them, or perhaps ignore all but #51. 
> I think Alan still has an old four-way PPro hidden away somewhere, but 
> he's probably one of the few people who could even *test* this thing.
> 
> Or we could make PPRO_FENCE just be something that you have to enable by 
> hand, rather than enabling it automatically for a set of CPU's. That way 
> Alan can add it to his config, and everybody else can clear it if they 
> decide that they don't have one of the old 200MHz PPro's any more..

Yes I would prefer it to be something you can turn off completely. I think
having alternative() would be slightly overkill, and it still costs a fair
bit of icache. We could take Andi's good idea of printing a warning and
enabling just a single CPU, which would probably be reasonable defence
against #66 and #91.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm: fix PageUptodate memory ordering bug
  2007-12-23 17:22         ` Linus Torvalds
  2007-12-23 21:35           ` Nick Piggin
@ 2007-12-23 22:41           ` Nick Piggin
  2008-01-01 23:41           ` Alan Cox
  2 siblings, 0 replies; 29+ messages in thread
From: Nick Piggin @ 2007-12-23 22:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Hugh Dickins, Linux Memory Management List,
	Benjamin Herrenschmidt, Alan Cox

On Sun, Dec 23, 2007 at 09:22:17AM -0800, Linus Torvalds wrote:
> 
> There's actually a few different PPro errata. There's #51, which is an IO 
> ordering thing, and can happen on UP too. There's #66, which breaks CPU 
> ordering, and is SMP-only (and which is probably at least *mostly* fixed 
> by PPRO_FENCE), and there is #92 which can cause cache incoherency and 
> where PPRO_FENCE *may* indirectly help.

BTW. #66 seems to be an issue where a CPU may see the wrong results from a
RAW, if another CPU has written to the cacheline in the meantime. #92 is a
data loss bug (although it said they couldn't reproduce it on real hardware).

It says they're actually not a problem if semaphore operations are used to
protect the data. However a) it is becoming increasingly common that we don't
do that (eg. with lockless operations), and b) I don't know how the case of
false sharing in a cacheline can be safe.

Anyway, I think it is very rare, even on those two systems (one being in
Alan's basement) that run Linux... The number of cycles everybody else loses
in spin_unlock combined far outweighs the number of cycles these additional
CPUs are going to sit idle :)


> We could decide to just ignore all of them, or perhaps ignore all but #51. 
> I think Alan still has an old four-way PPro hidden away somewhere, but 
> he's probably one of the few people who could even *test* this thing.

This patch uses both your and Andi's ideas... Untested though.

X86_PPRO_FENCE means we might encounter these systems, so workaround #51, and
disable multiple cpus... unless X86_PPRO_FENCE_SMP, which includes the workarounds
for #66 and #92.

---

Index: linux-2.6/arch/x86/Kconfig.cpu
===================================================================
--- linux-2.6.orig/arch/x86/Kconfig.cpu
+++ linux-2.6/arch/x86/Kconfig.cpu
@@ -321,7 +321,7 @@ config X86_XADD
 	depends on X86_32 && !M386
 	default y
 
-config X86_PPRO_FENCE
+config X86_BROKEN_PPRO
 	bool
 	depends on M686 || M586MMX || M586TSC || M586 || M486 || M386 || MGEODEGX1
 	default y
Index: linux-2.6/include/asm-x86/io_32.h
===================================================================
--- linux-2.6.orig/include/asm-x86/io_32.h
+++ linux-2.6/include/asm-x86/io_32.h
@@ -235,7 +235,7 @@ memcpy_toio(volatile void __iomem *dst, 
  *	2. Accidentally out of order processors (PPro errata #51)
  */
  
-#if defined(CONFIG_X86_OOSTORE) || defined(CONFIG_X86_PPRO_FENCE)
+#if defined(CONFIG_X86_OOSTORE) || defined(CONFIG_X86_BROKEN_PPRO)
 
 static inline void flush_write_buffers(void)
 {
Index: linux-2.6/include/asm-x86/spinlock_32.h
===================================================================
--- linux-2.6.orig/include/asm-x86/spinlock_32.h
+++ linux-2.6/include/asm-x86/spinlock_32.h
@@ -101,7 +101,7 @@ static inline int __raw_spin_trylock(raw
  * (PPro errata 66, 92)
  */
 
-#if !defined(CONFIG_X86_OOSTORE) && !defined(CONFIG_X86_PPRO_FENCE)
+#if !defined(CONFIG_X86_OOSTORE) && !defined(CONFIG_X86_BROKEN_PPRO_SMP)
 
 static inline void __raw_spin_unlock(raw_spinlock_t *lock)
 {
Index: linux-2.6/arch/x86/kernel/cpu/intel.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/intel.c
+++ linux-2.6/arch/x86/kernel/cpu/intel.c
@@ -108,6 +108,23 @@ static void __cpuinit trap_init_f00f_bug
 }
 #endif
 
+/*
+ * Errata #66, #92
+ */
+static void __cpuinit ppro_smp_fence_bug(void)
+{
+#if defined(CONFIG_SMP) && defined(CONFIG_X86_BROKEN_PPRO) && !defined(CONFIG_X86_BROKEN_PPRO_SMP)
+	extern unsigned int maxcpus;
+
+	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
+	    boot_cpu_data.x86 == 6 &&
+	    boot_cpu_data.x86_model == 1) {
+		printk(KERN_INFO "Pentium Pro with Errata#66, #92 detected. Limiting maxcpus to 1. Enable CONFIG_X86_BROKEN_PPRO_SMP to run with multiple CPUs\n");
+		maxcpus = 1;
+	}
+#endif
+}
+
 static void __cpuinit init_intel(struct cpuinfo_x86 *c)
 {
 	unsigned int l2 = 0;
@@ -132,6 +149,8 @@ static void __cpuinit init_intel(struct 
 	}
 #endif
 
+	ppro_smp_fence_bug();
+
 	select_idle_routine(c);
 	l2 = init_intel_cacheinfo(c);
 	if (c->cpuid_level > 9 ) {
Index: linux-2.6/include/asm-x86/system_32.h
===================================================================
--- linux-2.6.orig/include/asm-x86/system_32.h
+++ linux-2.6/include/asm-x86/system_32.h
@@ -279,7 +279,7 @@ static inline unsigned long get_limit(un
 
 #ifdef CONFIG_SMP
 #define smp_mb()	mb()
-#ifdef CONFIG_X86_PPRO_FENCE
+#ifdef CONFIG_X86_BROKEN_PPRO_SMP
 # define smp_rmb()	rmb()
 #else
 # define smp_rmb()	barrier()
Index: linux-2.6/arch/x86/Kconfig
===================================================================
--- linux-2.6.orig/arch/x86/Kconfig
+++ linux-2.6/arch/x86/Kconfig
@@ -378,6 +378,22 @@ config ES7000_CLUSTERED_APIC
 
 source "arch/x86/Kconfig.cpu"
 
+config X86_BROKEN_PPRO_SMP
+	bool "PentiumPro memory ordering errata workaround"
+	depends on X86_BROKEN_PPRO && SMP
+	default n
+	help
+	  Old PentiumPro multiprocessor systems had errata that could cause memory
+	  operations to violate the x86 ordering standard in rare cases. Enabling this
+	  option will attempt to work around some (but not all) occurances of these
+	  problems, at the cost of much heavier spinlock and memory barrier operations.
+
+	  If you say N here, these systems will be detected and limited to a single CPU
+	  at boot time.
+
+	  If unsure, say N here. Even distro kernels should think twice before enabling
+	  this: there are few systems, and an unlikely bug.
+
 config HPET_TIMER
 	bool
 	prompt "HPET Timer Support" if X86_32

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm: fix PageUptodate memory ordering bug
  2007-12-23  9:14           ` Nick Piggin
  2007-12-23  9:28             ` Andrew Morton
@ 2007-12-30 16:33             ` Ingo Molnar
  2008-01-01 23:26               ` Nick Piggin
  1 sibling, 1 reply; 29+ messages in thread
From: Ingo Molnar @ 2007-12-30 16:33 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Hugh Dickins, Linux Memory Management List,
	Benjamin Herrenschmidt, Linus Torvalds, Andi Kleen, Alan Cox

* Nick Piggin <npiggin@suse.de> wrote:

> > Sounds worthwhile, if we can't do it via altinstructions.
> 
> Altinstructions means we still have code bloat, and sometimes extra 
> branches etc (an extra 900 bytes of icache in mm/ alone, even before 
> my fix). I'll let Linus or one of the x86 guys weigh in, though. It's 
> a really sad cost for distro kernels to carry.

hm, we should at minimum display a warning if the workaround is not 
enabled and such a kernel is booted on a true PPro that is affected by 
this.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm: fix PageUptodate memory ordering bug
  2007-12-30 16:33             ` Ingo Molnar
@ 2008-01-01 23:26               ` Nick Piggin
  2008-01-02 21:01                 ` Andi Kleen
  0 siblings, 1 reply; 29+ messages in thread
From: Nick Piggin @ 2008-01-01 23:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Hugh Dickins, Linux Memory Management List,
	Benjamin Herrenschmidt, Linus Torvalds, Andi Kleen, Alan Cox

On Sun, Dec 30, 2007 at 05:33:15PM +0100, Ingo Molnar wrote:
> 
> * Nick Piggin <npiggin@suse.de> wrote:
> 
> > > Sounds worthwhile, if we can't do it via altinstructions.
> > 
> > Altinstructions means we still have code bloat, and sometimes extra 
> > branches etc (an extra 900 bytes of icache in mm/ alone, even before 
> > my fix). I'll let Linus or one of the x86 guys weigh in, though. It's 
> > a really sad cost for distro kernels to carry.
> 
> hm, we should at minimum display a warning if the workaround is not 
> enabled and such a kernel is booted on a true PPro that is affected by 
> this.

The patch does have the warning:
  printk(KERN_INFO "Pentium Pro with Errata#66, #92 detected. Limiting maxcpus to 1.
         Enable CONFIG_X86_BROKEN_PPRO_SMP to run with multiple CPUs\n");

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm: fix PageUptodate memory ordering bug
  2007-12-23 17:22         ` Linus Torvalds
  2007-12-23 21:35           ` Nick Piggin
  2007-12-23 22:41           ` Nick Piggin
@ 2008-01-01 23:41           ` Alan Cox
  2008-01-02 11:02             ` [patch] i386: avoid expensive ppro ordering workaround for default 686 kernels Nick Piggin
  2 siblings, 1 reply; 29+ messages in thread
From: Alan Cox @ 2008-01-01 23:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Andrew Morton, Hugh Dickins,
	Linux Memory Management List, Benjamin Herrenschmidt

On Sun, 23 Dec 2007 09:22:17 -0800 (PST)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Sun, 23 Dec 2007, Nick Piggin wrote:
> > 
> > It's not actually increasing size by that much here... hmm, do you have
> > CONFIG_X86_PPRO_FENCE defined, by any chance?
> > 
> > It looks like this gets defined by default for i386, and also probably for
> > distro configs. Linus? This is a fairly heavy hammer for such an unlikely bug on
> > such a small number of systems (that admittedly doesn't even fix the bug in all
> > cases anyway). It's not only heavy for my proposed patch, but it also halves the
> > speed of spinlocks. Can we have some special config option for this instead? 
> 
> A special config option isn't great, since vendors would probably then 
> enable it for those old P6's.
> 
> But maybe an "alternative()" thing that depends on a CPU capability?
> Of course, it definitely *is* true that the number of CPU's that have that 
> bug _and_ are actually used in SMP environments is probably vanishingly 
> small. So maybe even vendors don't really care any more, and we could make 
> the PPRO_FENCE thing a thing of the past.

If the PPro fencing isn't built for SMP kernels for set for CPU of
Pentium II or greater then nobody is going to care. The only reasons to
build distro support for older processors is

	-	VIA C3/C5 to work around the gcc 686 cmov bug
	-	Geode (embedded and OLPC)

neither of which are exactly multiprocessor, and the VIA stuff can be
handled by beating up the gcc options. I also doubt PPro will be terribly
high on anyones enterprise product line list for the next generation of
enterprise distributions.

Alan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch] i386: avoid expensive ppro ordering workaround for default 686 kernels
  2008-01-01 23:41           ` Alan Cox
@ 2008-01-02 11:02             ` Nick Piggin
  2008-01-02 13:44               ` Alan Cox
  0 siblings, 1 reply; 29+ messages in thread
From: Nick Piggin @ 2008-01-02 11:02 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linus Torvalds, Andrew Morton, Hugh Dickins,
	Linux Memory Management List, Benjamin Herrenschmidt

On Tue, Jan 01, 2008 at 11:41:33PM +0000, Alan Cox wrote:
> On Sun, 23 Dec 2007 09:22:17 -0800 (PST)
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > 
> > 
> > On Sun, 23 Dec 2007, Nick Piggin wrote:
> > > 
> > > It's not actually increasing size by that much here... hmm, do you have
> > > CONFIG_X86_PPRO_FENCE defined, by any chance?
> > > 
> > > It looks like this gets defined by default for i386, and also probably for
> > > distro configs. Linus? This is a fairly heavy hammer for such an unlikely bug on
> > > such a small number of systems (that admittedly doesn't even fix the bug in all
> > > cases anyway). It's not only heavy for my proposed patch, but it also halves the
> > > speed of spinlocks. Can we have some special config option for this instead? 
> > 
> > A special config option isn't great, since vendors would probably then 
> > enable it for those old P6's.
> > 
> > But maybe an "alternative()" thing that depends on a CPU capability?
> > Of course, it definitely *is* true that the number of CPU's that have that 
> > bug _and_ are actually used in SMP environments is probably vanishingly 
> > small. So maybe even vendors don't really care any more, and we could make 
> > the PPRO_FENCE thing a thing of the past.
> 
> If the PPro fencing isn't built for SMP kernels for set for CPU of
> Pentium II or greater then nobody is going to care. The only reasons to
> build distro support for older processors is

SLES10 seems to build for M586 by default, and I think it would be very
reasonable to expect M686 builds to be common in future. Even for SMP
kernels -- maybe we'd see fewer special case UP kernels with better kernel
support and more common multicore chips.
 
What's more, it might give me a chance of getting this uninitialised-data
-leaking, data-corrupting bugfix past Andrew's allyesconfig "sanity" check ;)


> 	-	VIA C3/C5 to work around the gcc 686 cmov bug
> 	-	Geode (embedded and OLPC)
> 
> neither of which are exactly multiprocessor, and the VIA stuff can be
> handled by beating up the gcc options. I also doubt PPro will be terribly
> high on anyones enterprise product line list for the next generation of
> enterprise distributions.

And that's also exactly why this patch is also pretty reasonable. Ingo, please
apply? (have changed the info message to a warning).

--
The selection of many CPU architecture families causes pentium pro memory ordering
errata workarounds to be enabled. This causes memory barriers and spinlocks to become
much more expensive, just to provide a few hacks for a very rare (nowadays)
class of system.

Take a different approach: after this patch, we just disable all but one CPU on those
systems, and print a warning. Also printed is a suggestion for a new CONFIG option that
can be enabled for the previous behaviour.

This is a big hammer for those few smp ppro systems, but it removes the big hammer that
was there for everyone else. It is also arguably the most correct way to work around the
problems -- from the description of at least one errata, loss of cache coherency, could
cause problems when particular classes of lockless data accesses are used (even with the
existing workarounds in place).

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/arch/x86/Kconfig.cpu
===================================================================
--- linux-2.6.orig/arch/x86/Kconfig.cpu
+++ linux-2.6/arch/x86/Kconfig.cpu
@@ -321,7 +321,7 @@ config X86_XADD
 	depends on X86_32 && !M386
 	default y
 
-config X86_PPRO_FENCE
+config X86_BROKEN_PPRO
 	bool
 	depends on M686 || M586MMX || M586TSC || M586 || M486 || M386 || MGEODEGX1
 	default y
Index: linux-2.6/include/asm-x86/io_32.h
===================================================================
--- linux-2.6.orig/include/asm-x86/io_32.h
+++ linux-2.6/include/asm-x86/io_32.h
@@ -235,7 +235,7 @@ memcpy_toio(volatile void __iomem *dst, 
  *	2. Accidentally out of order processors (PPro errata #51)
  */
  
-#if defined(CONFIG_X86_OOSTORE) || defined(CONFIG_X86_PPRO_FENCE)
+#if defined(CONFIG_X86_OOSTORE) || defined(CONFIG_X86_BROKEN_PPRO)
 
 static inline void flush_write_buffers(void)
 {
Index: linux-2.6/include/asm-x86/spinlock_32.h
===================================================================
--- linux-2.6.orig/include/asm-x86/spinlock_32.h
+++ linux-2.6/include/asm-x86/spinlock_32.h
@@ -101,7 +101,7 @@ static inline int __raw_spin_trylock(raw
  * (PPro errata 66, 92)
  */
 
-#if !defined(CONFIG_X86_OOSTORE) && !defined(CONFIG_X86_PPRO_FENCE)
+#if !defined(CONFIG_X86_OOSTORE) && !defined(CONFIG_X86_BROKEN_PPRO_SMP)
 
 static inline void __raw_spin_unlock(raw_spinlock_t *lock)
 {
Index: linux-2.6/arch/x86/kernel/cpu/intel.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/intel.c
+++ linux-2.6/arch/x86/kernel/cpu/intel.c
@@ -108,6 +108,23 @@ static void __cpuinit trap_init_f00f_bug
 }
 #endif
 
+/*
+ * Errata #66, #92
+ */
+static void __cpuinit ppro_smp_fence_bug(void)
+{
+#if defined(CONFIG_SMP) && defined(CONFIG_X86_BROKEN_PPRO) && !defined(CONFIG_X86_BROKEN_PPRO_SMP)
+	extern unsigned int maxcpus;
+
+	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
+	    boot_cpu_data.x86 == 6 &&
+	    boot_cpu_data.x86_model == 1) {
+		printk(KERN_WARNING "WARNING: Pentium Pro with Errata#66, #92 detected. Limiting maxcpus to 1. Enable CONFIG_X86_BROKEN_PPRO_SMP to run with multiple CPUs\n");
+		maxcpus = 1;
+	}
+#endif
+}
+
 static void __cpuinit init_intel(struct cpuinfo_x86 *c)
 {
 	unsigned int l2 = 0;
@@ -132,6 +149,8 @@ static void __cpuinit init_intel(struct 
 	}
 #endif
 
+	ppro_smp_fence_bug();
+
 	select_idle_routine(c);
 	l2 = init_intel_cacheinfo(c);
 	if (c->cpuid_level > 9 ) {
Index: linux-2.6/include/asm-x86/system_32.h
===================================================================
--- linux-2.6.orig/include/asm-x86/system_32.h
+++ linux-2.6/include/asm-x86/system_32.h
@@ -279,7 +279,7 @@ static inline unsigned long get_limit(un
 
 #ifdef CONFIG_SMP
 #define smp_mb()	mb()
-#ifdef CONFIG_X86_PPRO_FENCE
+#ifdef CONFIG_X86_BROKEN_PPRO_SMP
 # define smp_rmb()	rmb()
 #else
 # define smp_rmb()	barrier()
Index: linux-2.6/arch/x86/Kconfig
===================================================================
--- linux-2.6.orig/arch/x86/Kconfig
+++ linux-2.6/arch/x86/Kconfig
@@ -378,6 +378,22 @@ config ES7000_CLUSTERED_APIC
 
 source "arch/x86/Kconfig.cpu"
 
+config X86_BROKEN_PPRO_SMP
+	bool "PentiumPro memory ordering errata workaround"
+	depends on X86_BROKEN_PPRO && SMP
+	default n
+	help
+	  Old PentiumPro multiprocessor systems had errata that could cause memory
+	  operations to violate the x86 ordering standard in rare cases. Enabling this
+	  option will attempt to work around some (but not all) occurances of these
+	  problems, at the cost of much heavier spinlock and memory barrier operations.
+
+	  If you say N here, these systems will be detected and limited to a single CPU
+	  at boot time.
+
+	  If unsure, say N here. Even distro kernels should think twice before enabling
+	  this: there are few systems, and an unlikely bug.
+
 config HPET_TIMER
 	bool
 	prompt "HPET Timer Support" if X86_32


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] i386: avoid expensive ppro ordering workaround for default 686 kernels
  2008-01-02 11:02             ` [patch] i386: avoid expensive ppro ordering workaround for default 686 kernels Nick Piggin
@ 2008-01-02 13:44               ` Alan Cox
  2008-01-03  4:17                 ` Nick Piggin
  0 siblings, 1 reply; 29+ messages in thread
From: Alan Cox @ 2008-01-02 13:44 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Andrew Morton, Hugh Dickins,
	Linux Memory Management List, Benjamin Herrenschmidt

> Take a different approach: after this patch, we just disable all but one CPU on those
> systems, and print a warning. Also printed is a suggestion for a new CONFIG option that
> can be enabled for the previous behaviour.

How does that help. The processor isn't the only bus master.

Maybe this works as a SuSE specific convenience solution aligned to
your particular build pattern but it isn't the right solution for
upstream IMHO. 

We should either

- re-order the assumed processor generations supported to put VIA C3/C5
above Preventium Pro
- fix the gcc or gcc settings not to generate invalid cmov instructions
on 686. cmov is slower on all the modern processors anyway.

And you change the assumption that 586 < 686 < PPro < PII < PIII ...

to 586 < 686 < PPro < C3 < PII < ...

then selecting VIA C3 support will get you a kernel with the properties
all the distribution vendors want for their higher end mainstream kernel -
"runs on modern systems".

Alan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm: fix PageUptodate memory ordering bug
  2008-01-01 23:26               ` Nick Piggin
@ 2008-01-02 21:01                 ` Andi Kleen
  2008-01-03  3:32                   ` Nick Piggin
  0 siblings, 1 reply; 29+ messages in thread
From: Andi Kleen @ 2008-01-02 21:01 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Andrew Morton, Hugh Dickins,
	Linux Memory Management List, Benjamin Herrenschmidt,
	Linus Torvalds, Alan Cox

On Wednesday 02 January 2008 00:26:34 Nick Piggin wrote:
> On Sun, Dec 30, 2007 at 05:33:15PM +0100, Ingo Molnar wrote:
> > 
> > * Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > > Sounds worthwhile, if we can't do it via altinstructions.
> > > 
> > > Altinstructions means we still have code bloat, and sometimes extra 
> > > branches etc (an extra 900 bytes of icache in mm/ alone, even before 
> > > my fix). I'll let Linus or one of the x86 guys weigh in, though. It's 
> > > a really sad cost for distro kernels to carry.
> > 
> > hm, we should at minimum display a warning if the workaround is not 
> > enabled and such a kernel is booted on a true PPro that is affected by 
> > this.
> 
> The patch does have the warning:
>   printk(KERN_INFO "Pentium Pro with Errata#66, #92 detected. Limiting maxcpus to 1.
>          Enable CONFIG_X86_BROKEN_PPRO_SMP to run with multiple CPUs\n");

Haven't seen the full patch, but the printk suggest you're changing the max_cpus 
variable. That is not 100% safe because user space could hot plug CPUs later
using sysfs. The only safe way would be to limit cpu_possible_map

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm: fix PageUptodate memory ordering bug
  2008-01-02 21:01                 ` Andi Kleen
@ 2008-01-03  3:32                   ` Nick Piggin
  2008-01-03 13:08                     ` Andi Kleen
  0 siblings, 1 reply; 29+ messages in thread
From: Nick Piggin @ 2008-01-03  3:32 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Andrew Morton, Hugh Dickins,
	Linux Memory Management List, Benjamin Herrenschmidt,
	Linus Torvalds, Alan Cox

On Wed, Jan 02, 2008 at 10:01:27PM +0100, Andi Kleen wrote:
> On Wednesday 02 January 2008 00:26:34 Nick Piggin wrote:
> > On Sun, Dec 30, 2007 at 05:33:15PM +0100, Ingo Molnar wrote:
> > > 
> > > * Nick Piggin <npiggin@suse.de> wrote:
> > > 
> > > > > Sounds worthwhile, if we can't do it via altinstructions.
> > > > 
> > > > Altinstructions means we still have code bloat, and sometimes extra 
> > > > branches etc (an extra 900 bytes of icache in mm/ alone, even before 
> > > > my fix). I'll let Linus or one of the x86 guys weigh in, though. It's 
> > > > a really sad cost for distro kernels to carry.
> > > 
> > > hm, we should at minimum display a warning if the workaround is not 
> > > enabled and such a kernel is booted on a true PPro that is affected by 
> > > this.
> > 
> > The patch does have the warning:
> >   printk(KERN_INFO "Pentium Pro with Errata#66, #92 detected. Limiting maxcpus to 1.
> >          Enable CONFIG_X86_BROKEN_PPRO_SMP to run with multiple CPUs\n");
> 
> Haven't seen the full patch, but the printk suggest you're changing the max_cpus 
> variable. That is not 100% safe because user space could hot plug CPUs later
> using sysfs. The only safe way would be to limit cpu_possible_map

Hmm, but I did want to allow it to be overridden via maxcpus= command line (or
hotplug I guess, I hadn't thought of the hotplug case but it allows runtime
override).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] i386: avoid expensive ppro ordering workaround for default 686 kernels
  2008-01-02 13:44               ` Alan Cox
@ 2008-01-03  4:17                 ` Nick Piggin
  2008-01-03 14:23                   ` Alan Cox
  0 siblings, 1 reply; 29+ messages in thread
From: Nick Piggin @ 2008-01-03  4:17 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linus Torvalds, Andrew Morton, Hugh Dickins,
	Linux Memory Management List, Benjamin Herrenschmidt

On Wed, Jan 02, 2008 at 01:44:33PM +0000, Alan Cox wrote:
> > Take a different approach: after this patch, we just disable all but one CPU on those
> > systems, and print a warning. Also printed is a suggestion for a new CONFIG option that
> > can be enabled for the previous behaviour.
> 
> How does that help. The processor isn't the only bus master.

Hmm, I don't understand what you mean. Obviously other busmasters aren't
participating in any locking or smp_*mb() ordering protocols.

The non-smp_-prefixed barriers are retained.

> Maybe this works as a SuSE specific convenience solution aligned to
> your particular build pattern but it isn't the right solution for
> upstream IMHO. 

Actually it is nothing to do with SUSE but I was using SLES as a counterexample
when you said nobody would care about M686 with SMP builds. The reason for the
patch is because I noticed the suboptimal configuration (because Andrew flagged
my patch).

> We should either
> 
> - re-order the assumed processor generations supported to put VIA C3/C5
> above Preventium Pro

Adrian Bunk's patch to make each CPU type explicitly selectable IMO is the
best way to do this.

> - fix the gcc or gcc settings not to generate invalid cmov instructions
> on 686. cmov is slower on all the modern processors anyway.

Not for unpredictable branches, but I agree gcc seems to use it too often
(in my version, it seems to even use it for likely/unlikely branches :( ).

> And you change the assumption that 586 < 686 < PPro < PII < PIII ...
> 
> to 586 < 686 < PPro < C3 < PII < ...
> 
> then selecting VIA C3 support will get you a kernel with the properties
> all the distribution vendors want for their higher end mainstream kernel -
> "runs on modern systems".

I don't agree. If we support those options, we should support them properly.
And if you build an SMP kernel for a 586 for example, you should not get lumped
with those pentiumpro workarounds.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm: fix PageUptodate memory ordering bug
  2008-01-03  3:32                   ` Nick Piggin
@ 2008-01-03 13:08                     ` Andi Kleen
  0 siblings, 0 replies; 29+ messages in thread
From: Andi Kleen @ 2008-01-03 13:08 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Andrew Morton, Hugh Dickins,
	Linux Memory Management List, Benjamin Herrenschmidt,
	Linus Torvalds, Alan Cox

> Hmm, but I did want to allow it to be overridden via maxcpus= command line (or
> hotplug I guess, I hadn't thought of the hotplug case but it allows runtime
> override).

In theory some user space could be set up to always boot with maxcpus=1 and then
only hotplug CPUs later (e.g. might make sense to get faster initial booting on systems
with a lot of CPUs) 

It would be better to use a separate option for such workarounds.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] i386: avoid expensive ppro ordering workaround for default 686 kernels
  2008-01-03  4:17                 ` Nick Piggin
@ 2008-01-03 14:23                   ` Alan Cox
  2008-01-03 20:20                     ` Benjamin Herrenschmidt
  2008-01-03 23:10                     ` Nick Piggin
  0 siblings, 2 replies; 29+ messages in thread
From: Alan Cox @ 2008-01-03 14:23 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Andrew Morton, Hugh Dickins,
	Linux Memory Management List, Benjamin Herrenschmidt

> Hmm, I don't understand what you mean. Obviously other busmasters aren't
> participating in any locking or smp_*mb() ordering protocols.

The unlock paths are visible to busmasters

		write to DMA buffer
		unlock
can turn into

		unlock
		write to DMA buffer

Whether that actually matters in any code we have I don't know.

> > - re-order the assumed processor generations supported to put VIA C3/C5
> > above Preventium Pro
> 
> Adrian Bunk's patch to make each CPU type explicitly selectable IMO is the
> best way to do this.

If you get to pick the combination you want then yes.

> > to 586 < 686 < PPro < C3 < PII < ...
> > 
> > then selecting VIA C3 support will get you a kernel with the properties
> > all the distribution vendors want for their higher end mainstream kernel -
> > "runs on modern systems".
> 
> I don't agree. If we support those options, we should support them properly.
> And if you build an SMP kernel for a 586 for example, you should not get lumped
> with those pentiumpro workarounds.

The cost on a 586 SMP box (ie pentium) is basically nil. All cross CPU
transactions are hideously slow anyway locked or not. On a PII or later I
agree entirely you don't want them by default.

Alan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] i386: avoid expensive ppro ordering workaround for default 686 kernels
  2008-01-03 14:23                   ` Alan Cox
@ 2008-01-03 20:20                     ` Benjamin Herrenschmidt
  2008-01-03 22:23                       ` Alan Cox
  2008-01-03 23:10                     ` Nick Piggin
  1 sibling, 1 reply; 29+ messages in thread
From: Benjamin Herrenschmidt @ 2008-01-03 20:20 UTC (permalink / raw)
  To: Alan Cox
  Cc: Nick Piggin, Linus Torvalds, Andrew Morton, Hugh Dickins,
	Linux Memory Management List

On Thu, 2008-01-03 at 14:23 +0000, Alan Cox wrote:
> > Hmm, I don't understand what you mean. Obviously other busmasters aren't
> > participating in any locking or smp_*mb() ordering protocols.
> 
> The unlock paths are visible to busmasters
> 
> 		write to DMA buffer
> 		unlock
> can turn into
> 
> 		unlock
> 		write to DMA buffer
> 
> Whether that actually matters in any code we have I don't know.

It matters with writes to MMIO at least, I don't know if your PPro bug
can cause that to be re-ordered tho.

Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] i386: avoid expensive ppro ordering workaround for default 686 kernels
  2008-01-03 20:20                     ` Benjamin Herrenschmidt
@ 2008-01-03 22:23                       ` Alan Cox
  0 siblings, 0 replies; 29+ messages in thread
From: Alan Cox @ 2008-01-03 22:23 UTC (permalink / raw)
  To: benh
  Cc: Nick Piggin, Linus Torvalds, Andrew Morton, Hugh Dickins,
	Linux Memory Management List

On Fri, 04 Jan 2008 07:20:43 +1100
> > Whether that actually matters in any code we have I don't know.
> 
> It matters with writes to MMIO at least, I don't know if your PPro bug
> can cause that to be re-ordered tho.

In certain cases yes - the 3Dfx Voodoo cards were particularly good at
triggering it as they map the command registers cleverly to get bursts.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] i386: avoid expensive ppro ordering workaround for default 686 kernels
  2008-01-03 14:23                   ` Alan Cox
  2008-01-03 20:20                     ` Benjamin Herrenschmidt
@ 2008-01-03 23:10                     ` Nick Piggin
  2008-01-04 16:27                       ` Alan Cox
  1 sibling, 1 reply; 29+ messages in thread
From: Nick Piggin @ 2008-01-03 23:10 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linus Torvalds, Andrew Morton, Hugh Dickins,
	Linux Memory Management List, Benjamin Herrenschmidt

On Thu, Jan 03, 2008 at 02:23:30PM +0000, Alan Cox wrote:
> > Hmm, I don't understand what you mean. Obviously other busmasters aren't
> > participating in any locking or smp_*mb() ordering protocols.
> 
> The unlock paths are visible to busmasters
> 
> 		write to DMA buffer
> 		unlock
> can turn into
> 
> 		unlock
> 		write to DMA buffer
> 
> Whether that actually matters in any code we have I don't know.

These actually require wmb() anyway, because we only guarantee that spinlocks
contain operations to cacheable memory.

Observe that when !CONFIG_SMP, smp_*mb and spin_*lock are basically noops, even
when the existing pentiumpro workaround is in place.

Though wmb() will actually provide the correct ordering.


> > > - re-order the assumed processor generations supported to put VIA C3/C5
> > > above Preventium Pro
> > 
> > Adrian Bunk's patch to make each CPU type explicitly selectable IMO is the
> > best way to do this.
> 
> If you get to pick the combination you want then yes.
> 
> > > to 586 < 686 < PPro < C3 < PII < ...
> > > 
> > > then selecting VIA C3 support will get you a kernel with the properties
> > > all the distribution vendors want for their higher end mainstream kernel -
> > > "runs on modern systems".
> > 
> > I don't agree. If we support those options, we should support them properly.
> > And if you build an SMP kernel for a 586 for example, you should not get lumped
> > with those pentiumpro workarounds.
> 
> The cost on a 586 SMP box (ie pentium) is basically nil. All cross CPU
> transactions are hideously slow anyway locked or not.

That's wrong. Firstly, spinlocks are very often retaken by the same CPU, and they
are also often released while the lock word is still in the cache of the acquiring
CPU. Secondly, the extra code and lock ops in smp_* barriers is fairly expensive.

On 586 SMP systems, didn't lock ops actually go out on the bus, and hence are much
more expensive than they are today (although today they are still one or two orders
of magnitude more expensive than regular memory ops).


> On a PII or later I
> agree entirely you don't want them by default.

And if you build a kernel which is to support 686, PII, and later, you don't want
them by default either, most probably.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] i386: avoid expensive ppro ordering workaround for default 686 kernels
  2008-01-03 23:10                     ` Nick Piggin
@ 2008-01-04 16:27                       ` Alan Cox
  2008-01-07  0:12                         ` Nick Piggin
  0 siblings, 1 reply; 29+ messages in thread
From: Alan Cox @ 2008-01-04 16:27 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Andrew Morton, Hugh Dickins,
	Linux Memory Management List, Benjamin Herrenschmidt

> > The cost on a 586 SMP box (ie pentium) is basically nil. All cross CPU
> > transactions are hideously slow anyway locked or not.
> 
> That's wrong. Firstly, spinlocks are very often retaken by the same CPU, and they

Go time it. It certainly used to be the case.

> On 586 SMP systems, didn't lock ops actually go out on the bus, and hence are much
> more expensive than they are today (although today they are still one or two orders
> of magnitude more expensive than regular memory ops).

All cross CPU cache stuff goes that way, not that may P5 SMP boxes are
running today except in Mr Bottomley's residence.

> And if you build a kernel which is to support 686, PII, and later, you don't want
> them by default either, most probably.

I want a kernel for VIA C3 or later which happens to include K6 (still
common), PII+ and maybe not PPro.

So your patch should go in the bitbucket by the sound of it and Adrian's
get merged.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] i386: avoid expensive ppro ordering workaround for default 686 kernels
  2008-01-04 16:27                       ` Alan Cox
@ 2008-01-07  0:12                         ` Nick Piggin
  0 siblings, 0 replies; 29+ messages in thread
From: Nick Piggin @ 2008-01-07  0:12 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linus Torvalds, Andrew Morton, Hugh Dickins,
	Linux Memory Management List, Benjamin Herrenschmidt

On Fri, Jan 04, 2008 at 04:27:00PM +0000, Alan Cox wrote:
> > > The cost on a 586 SMP box (ie pentium) is basically nil. All cross CPU
> > > transactions are hideously slow anyway locked or not.
> > 
> > That's wrong. Firstly, spinlocks are very often retaken by the same CPU, and they
> 
> Go time it. It certainly used to be the case.

I don't understand what your point is. You bring up socket-to-socket transfers,
but we are talking about the cost of locked ops (which may or may not involve a
cacheline transfer).

For example, a spin lock + unlock (in cache) is around 40 cycles on my Core2. A
ppro style unlock makes it 70.

And it is quite common with a lot of locks for the cacheline *not* to be bouncing
(eg. scheduler runqueue locks).

> > On 586 SMP systems, didn't lock ops actually go out on the bus, and hence are much
> > more expensive than they are today (although today they are still one or two orders
> > of magnitude more expensive than regular memory ops).
> 
> All cross CPU cache stuff goes that way, not that may P5 SMP boxes are
> running today except in Mr Bottomley's residence.
> 
> > And if you build a kernel which is to support 686, PII, and later, you don't want
> > them by default either, most probably.
> 
> I want a kernel for VIA C3 or later which happens to include K6 (still
> common), PII+ and maybe not PPro.

I want a kernel for all those and also ppros, but I'm not willing to hammer
everyone else to work around some rare errata. There are probably quite fewer
SMP ppros than UP ones too.

You're just skirting the whole issue by saying you don't want to support ppro
at all. The issue is having a kernel that supports a wide range of CPUs
including ppro. *You* may not want that. Others may.

> So your patch should go in the bitbucket by the sound of it and Adrian's
> get merged.

Adrian's patch IMO should get merged, sure (although last time it came up, it
got rejected for some maybe not completely valid reasons). But Adrian's patch
does not attack the problem which mine does. If you select PPro + PII + PIII +
etc..., then you will still have SMP ppro errata workarounds compiled in. You
need another option specifically asking whether you want them or not. Ie. my
patch.

What is your exact objection to making this configurable?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2008-01-07  0:12 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-12-18  1:26 [patch] mm: fix PageUptodate memory ordering bug Nick Piggin
2007-12-22  8:57 ` Andrew Morton
2007-12-22 12:14   ` Hugh Dickins
2007-12-23  6:54     ` Nick Piggin
2007-12-23  5:57   ` Nick Piggin
2007-12-23  6:32     ` Andrew Morton
2007-12-23  7:15       ` Nick Piggin
2007-12-23  7:29         ` Andrew Morton
2007-12-23  9:14           ` Nick Piggin
2007-12-23  9:28             ` Andrew Morton
2007-12-23 16:02               ` Andi Kleen
2007-12-30 16:33             ` Ingo Molnar
2008-01-01 23:26               ` Nick Piggin
2008-01-02 21:01                 ` Andi Kleen
2008-01-03  3:32                   ` Nick Piggin
2008-01-03 13:08                     ` Andi Kleen
2007-12-23 17:22         ` Linus Torvalds
2007-12-23 21:35           ` Nick Piggin
2007-12-23 22:41           ` Nick Piggin
2008-01-01 23:41           ` Alan Cox
2008-01-02 11:02             ` [patch] i386: avoid expensive ppro ordering workaround for default 686 kernels Nick Piggin
2008-01-02 13:44               ` Alan Cox
2008-01-03  4:17                 ` Nick Piggin
2008-01-03 14:23                   ` Alan Cox
2008-01-03 20:20                     ` Benjamin Herrenschmidt
2008-01-03 22:23                       ` Alan Cox
2008-01-03 23:10                     ` Nick Piggin
2008-01-04 16:27                       ` Alan Cox
2008-01-07  0:12                         ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox