[rfc] data race in page table setup/walking?

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [rfc] data race in page table setup/walking?
@ 2008-04-29  5:00 Nick Piggin
  2008-04-29  5:08 ` Benjamin Herrenschmidt
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Nick Piggin @ 2008-04-29  5:00 UTC (permalink / raw)
  To: Linus Torvalds, Hugh Dickins, linux-arch, Linux Memory Management List
  Cc: Benjamin Herrenschmidt

Hi,

I *think* there is a possible data race in the page table walking code. After
the split ptlock patches, it actually seems to have been introduced to the core
code, but even before that I think it would have impacted some architectures.

The race is as follows:
The pte page is allocated, zeroed, and its struct page gets its spinlock
initialized. The mm-wide ptl is then taken, and then the pte page is inserted
into the pagetables.

At this point, the spinlock is not guaranteed to have ordered the previous
stores to initialize the pte page with the subsequent store to put it in the
page tables. So another Linux page table walker might be walking down (without
any locks, because we have split-leaf-ptls), and find that new pte we've
inserted. It might try to take the spinlock before the store from the other
CPU initializes it. And subsequently it might read a pte_t out before stores
from the other CPU have cleared the memory.

There seem to be similar races in higher levels of the page tables, but they
obviously don't involve the spinlock, but one could see uninitialized memory.

Arch code and hardware pagetable walkers that walk the pagetables without
locks could see similar uninitialized memory problems (regardless of whether
we have split ptes or not).

Fortunately, on x86 (except stupid OOSTORE), nothing needs to be done, because
stores are in order, and so are loads. Even on OOSTORE we wouldn't have to take
the smp_wmb hit, if only we have a smp_wmb_before/after_spin_lock function.

This isn't a complete patch yet, but a demonstration of the problem, and an
RFC really as to the form of the solution. I prefer to put the barriers in
core code, because that's where the higher level logic happens, but the page
table accessors are per-arch, and open-coding them everywhere I don't think
is an option.

So anyway... comments, please? Am I dreaming the whole thing up? I suspect
that if I'm not, then powerpc at least might have been impacted by the race,
but as far as I know of, they haven't seen stability problems around there...
Might just be terribly rare, though. I'd like to try to make a test program
to reproduce the problem if I can get access to a box...

Thanks,
Nick

Index: linux-2.6/include/asm-x86/pgtable_32.h
===================================================================
--- linux-2.6.orig/include/asm-x86/pgtable_32.h
+++ linux-2.6/include/asm-x86/pgtable_32.h
@@ -179,7 +179,10 @@ static inline int pud_large(pud_t pud) {
 #define pte_index(address)					\
 	(((address) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1))
 #define pte_offset_kernel(dir, address)				\
-	((pte_t *)pmd_page_vaddr(*(dir)) +  pte_index((address)))
+{(								\
+	(pte_t *)pmd_page_vaddr(*(dir)) +  pte_index((address));\
+	smp_read_barrier_depends();				\
+})

 #define pmd_page(pmd) (pfn_to_page(pmd_val((pmd)) >> PAGE_SHIFT))

@@ -188,16 +191,32 @@ static inline int pud_large(pud_t pud) {

 #if defined(CONFIG_HIGHPTE)
 #define pte_offset_map(dir, address)					\
-	((pte_t *)kmap_atomic_pte(pmd_page(*(dir)), KM_PTE0) +		\
-	 pte_index((address)))
+{(									\
+	pte_t *ret = (pte_t *)kmap_atomic_pte(pmd_page(*(dir)), KM_PTE0) + \
+		 pte_index((address));					\
+	smp_read_barrier_depends();					\
+	ret;								\
+)}
+
 #define pte_offset_map_nested(dir, address)				\
-	((pte_t *)kmap_atomic_pte(pmd_page(*(dir)), KM_PTE1) +		\
-	 pte_index((address)))
+{(									\
+	pte_t *ret = (pte_t *)kmap_atomic_pte(pmd_page(*(dir)), KM_PTE1) + \
+		 pte_index((address));					\
+	smp_read_barrier_depends();					\
+	ret;								\
+)}
+
 #define pte_unmap(pte) kunmap_atomic((pte), KM_PTE0)
 #define pte_unmap_nested(pte) kunmap_atomic((pte), KM_PTE1)
 #else
 #define pte_offset_map(dir, address)					\
-	((pte_t *)page_address(pmd_page(*(dir))) + pte_index((address)))
+{(									\
+	pte_t *ret = (pte_t *)page_address(pmd_page(*(dir))) +		\
+		pte_index((address));					\
+	smp_read_barrier_depends();					\
+	ret;								\
+)}
+
 #define pte_offset_map_nested(dir, address) pte_offset_map((dir), (address))
 #define pte_unmap(pte) do { } while (0)
 #define pte_unmap_nested(pte) do { } while (0)
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -311,6 +311,13 @@ int __pte_alloc(struct mm_struct *mm, pm
 	if (!new)
 		return -ENOMEM;

+	/*
+	 * Ensure all pte setup (eg. pte page lock and page clearing) are
+	 * visible before the pte is made visible to other CPUs by being
+	 * put into page tables.
+	 */
+	smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
+
 	spin_lock(&mm->page_table_lock);
 	if (!pmd_present(*pmd)) {	/* Has another populated it ? */
 		mm->nr_ptes++;
@@ -329,6 +336,8 @@ int __pte_alloc_kernel(pmd_t *pmd, unsig
 	if (!new)
 		return -ENOMEM;

+	smp_wmb(); /* See comment in __pte_alloc */
+
 	spin_lock(&init_mm.page_table_lock);
 	if (!pmd_present(*pmd)) {	/* Has another populated it ? */
 		pmd_populate_kernel(&init_mm, pmd, new);
@@ -2546,6 +2555,8 @@ int __pud_alloc(struct mm_struct *mm, pg
 	if (!new)
 		return -ENOMEM;

+	smp_wmb(); /* See comment in __pte_alloc */
+
 	spin_lock(&mm->page_table_lock);
 	if (pgd_present(*pgd))		/* Another has populated it */
 		pud_free(mm, new);
@@ -2567,6 +2578,8 @@ int __pmd_alloc(struct mm_struct *mm, pu
 	if (!new)
 		return -ENOMEM;

+	smp_wmb(); /* See comment in __pte_alloc */
+
 	spin_lock(&mm->page_table_lock);
 #ifndef __ARCH_HAS_4LEVEL_HACK
 	if (pud_present(*pud))		/* Another has populated it */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [rfc] data race in page table setup/walking?
  2008-04-29  5:00 [rfc] data race in page table setup/walking? Nick Piggin
@ 2008-04-29  5:08 ` Benjamin Herrenschmidt
  2008-04-29  5:41   ` Nick Piggin
  2008-04-29 10:56 ` David Miller, Nick Piggin
  2008-04-29 12:36 ` Hugh Dickins
  2 siblings, 1 reply; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2008-04-29  5:08 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Hugh Dickins, linux-arch, Linux Memory Management List

On Tue, 2008-04-29 at 07:00 +0200, Nick Piggin wrote:
> 
> At this point, the spinlock is not guaranteed to have ordered the previous
> stores to initialize the pte page with the subsequent store to put it in the
> page tables. So another Linux page table walker might be walking down (without
> any locks, because we have split-leaf-ptls), and find that new pte we've
> inserted. It might try to take the spinlock before the store from the other
> CPU initializes it. And subsequently it might read a pte_t out before stores
> from the other CPU have cleared the memory.

Funny, we used to have a similar race where the zeros for clearing a
newly allocated anonymous pages end up reaching the coherency domain
after the new PTE in set_pte, causing memory corruption on threaded
apps. I think back then we fixed that with an explicit smp_wmb() before
a set_pte(). Maybe we need that also when setting the higher levels.

Ben.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [rfc] data race in page table setup/walking?
  2008-04-29  5:08 ` Benjamin Herrenschmidt
@ 2008-04-29  5:41   ` Nick Piggin
  0 siblings, 0 replies; 20+ messages in thread
From: Nick Piggin @ 2008-04-29  5:41 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Linus Torvalds, Hugh Dickins, linux-arch, Linux Memory Management List

On Tue, Apr 29, 2008 at 03:08:44PM +1000, Benjamin Herrenschmidt wrote:
> 
> On Tue, 2008-04-29 at 07:00 +0200, Nick Piggin wrote:
> > 
> > At this point, the spinlock is not guaranteed to have ordered the previous
> > stores to initialize the pte page with the subsequent store to put it in the
> > page tables. So another Linux page table walker might be walking down (without
> > any locks, because we have split-leaf-ptls), and find that new pte we've
> > inserted. It might try to take the spinlock before the store from the other
> > CPU initializes it. And subsequently it might read a pte_t out before stores
> > from the other CPU have cleared the memory.
> 
> Funny, we used to have a similar race where the zeros for clearing a
> newly allocated anonymous pages end up reaching the coherency domain
> after the new PTE in set_pte, causing memory corruption on threaded
> apps. I think back then we fixed that with an explicit smp_wmb() before
> a set_pte().

Yep, I remember that one. We had the same problem with inserting pages
into the pagecache radix-tree, so I recently changed the fix to encompass
both problems: the barriers are now in SetPageUptodate and (Test)PageUptodate.

> Maybe we need that also when setting the higher levels.

That is my reading of the situation, yes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [rfc] data race in page table setup/walking?
  2008-04-29  5:00 [rfc] data race in page table setup/walking? Nick Piggin
  2008-04-29  5:08 ` Benjamin Herrenschmidt
@ 2008-04-29 10:56 ` David Miller, Nick Piggin
  2008-04-29 12:36 ` Hugh Dickins
  2 siblings, 0 replies; 20+ messages in thread
From: David Miller, Nick Piggin @ 2008-04-29 10:56 UTC (permalink / raw)
  To: npiggin; +Cc: torvalds, hugh, linux-arch, linux-mm, benh

> So anyway... comments, please? Am I dreaming the whole thing up? I suspect
> that if I'm not, then powerpc at least might have been impacted by the race,
> but as far as I know of, they haven't seen stability problems around there...
> Might just be terribly rare, though. I'd like to try to make a test program
> to reproduce the problem if I can get access to a box...

This definitely does look like a real problem, albeit pretty hard to
trigger I would say. :-)

Thanks for looking into this.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [rfc] data race in page table setup/walking?
  2008-04-29  5:00 [rfc] data race in page table setup/walking? Nick Piggin
  2008-04-29  5:08 ` Benjamin Herrenschmidt
  2008-04-29 10:56 ` David Miller, Nick Piggin
@ 2008-04-29 12:36 ` Hugh Dickins
  2008-04-29 21:37   ` Benjamin Herrenschmidt
  2008-04-30  6:03   ` Nick Piggin
  2 siblings, 2 replies; 20+ messages in thread
From: Hugh Dickins @ 2008-04-29 12:36 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, linux-arch, Linux Memory Management List,
	Benjamin Herrenschmidt

On Tue, 29 Apr 2008, Nick Piggin wrote:
> I *think* there is a possible data race in the page table walking code. After
> the split ptlock patches, it actually seems to have been introduced to the core
> code, but even before that I think it would have impacted some architectures.
> 
> The race is as follows:
> The pte page is allocated, zeroed, and its struct page gets its spinlock
> initialized. The mm-wide ptl is then taken, and then the pte page is inserted
> into the pagetables.
> 
> At this point, the spinlock is not guaranteed to have ordered the previous
> stores to initialize the pte page with the subsequent store to put it in the
> page tables. So another Linux page table walker might be walking down (without
> any locks, because we have split-leaf-ptls), and find that new pte we've
> inserted. It might try to take the spinlock before the store from the other
> CPU initializes it. And subsequently it might read a pte_t out before stores
> from the other CPU have cleared the memory.
> 
> There seem to be similar races in higher levels of the page tables, but they
> obviously don't involve the spinlock, but one could see uninitialized memory.

It's sad, but I have to believe you're right.  I'm slightly more barrier-
aware now than I was back when doing split ptlock (largely thanks to your
persistence); and looking back at it, I cannot now imagine how it could
be correct to remove a lock from that walkdown without adding barriers.

Ugh.  It's just so irritating to introduce these blockages against
such a remote possibility (but there again, that's what so much of
kernel code has to be about).  Is there any other way of handling it?

> 
> Arch code and hardware pagetable walkers that walk the pagetables without
> locks could see similar uninitialized memory problems (regardless of whether
> we have split ptes or not).

The hardware walkers, hmm.  Well, I guess each arch has its own rules
to protect against those, and all you can do is provide a macro for
each to fill in.   You assume smp_read_barrier_depends versus smp_wmb
below: sure of those, or is it worth providing particular new stubs?

> 
> Fortunately, on x86 (except stupid OOSTORE), nothing needs to be done, because
> stores are in order, and so are loads. Even on OOSTORE we wouldn't have to take
> the smp_wmb hit, if only we have a smp_wmb_before/after_spin_lock function.
> 
> This isn't a complete patch yet, but a demonstration of the problem, and an
> RFC really as to the form of the solution. I prefer to put the barriers in
> core code, because that's where the higher level logic happens, but the page
> table accessors are per-arch, and open-coding them everywhere I don't think
> is an option.

If there's no better way (I think not), this looks about right to me;
though I leave all the hard thought to you ;)

While I'm in the confessional, something else you probably need to
worry about there: handle_pte_fault's "entry = *pte" without holding
the lock; several cases are self-righting, but there's pte_unmap_same
for a couple of cases where we need to make sure of the right decision
- presently it's only worrying about the PAE case, when it might have
got the top of one pte with the bottom of another, but now you need
some barrier thinking?  Oh, perhaps this is already safely covered
by your pte_offset_map.

The pte_offset_kernel one (aside from the trivial of needing a ret):
I'm not convinced that needs to be changed at all.  I still believe,
as I believed at split ptlock time, that the kernel walkdowns need
no locking (or barriers) of their own: that it's a separate kernel
bug if a kernel subsystem is making speculative accesses to addresses
it cannot be sure have been allocated.  Counter-examples?

Ah, but perhaps naughty userspace (depending on architecture) could
make those speculative accesses into kernel address space, and have
a chance of striking lucky with the hardware walker, without proper
barriers at the kernel end?

> 
> So anyway... comments, please? Am I dreaming the whole thing up? I suspect
> that if I'm not, then powerpc at least might have been impacted by the race,
> but as far as I know of, they haven't seen stability problems around there...
> Might just be terribly rare, though. I'd like to try to make a test program
> to reproduce the problem if I can get access to a box...

Please do, if you're feeling ingenious: it's tiresome adding overhead
without being able to show it's really achieved something.

> 
> Thanks,
> Nick
> 
> Index: linux-2.6/include/asm-x86/pgtable_32.h
> ===================================================================
> --- linux-2.6.orig/include/asm-x86/pgtable_32.h
> +++ linux-2.6/include/asm-x86/pgtable_32.h
> @@ -179,7 +179,10 @@ static inline int pud_large(pud_t pud) {
>  #define pte_index(address)					\
>  	(((address) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1))
>  #define pte_offset_kernel(dir, address)				\
> -	((pte_t *)pmd_page_vaddr(*(dir)) +  pte_index((address)))
> +{(								\
> +	(pte_t *)pmd_page_vaddr(*(dir)) +  pte_index((address));\
> +	smp_read_barrier_depends();				\
> +})
>  
>  #define pmd_page(pmd) (pfn_to_page(pmd_val((pmd)) >> PAGE_SHIFT))
>  
> @@ -188,16 +191,32 @@ static inline int pud_large(pud_t pud) {
>  
>  #if defined(CONFIG_HIGHPTE)
>  #define pte_offset_map(dir, address)					\
> -	((pte_t *)kmap_atomic_pte(pmd_page(*(dir)), KM_PTE0) +		\
> -	 pte_index((address)))
> +{(									\
> +	pte_t *ret = (pte_t *)kmap_atomic_pte(pmd_page(*(dir)), KM_PTE0) + \
> +		 pte_index((address));					\
> +	smp_read_barrier_depends();					\
> +	ret;								\
> +)}
> +
>  #define pte_offset_map_nested(dir, address)				\
> -	((pte_t *)kmap_atomic_pte(pmd_page(*(dir)), KM_PTE1) +		\
> -	 pte_index((address)))
> +{(									\
> +	pte_t *ret = (pte_t *)kmap_atomic_pte(pmd_page(*(dir)), KM_PTE1) + \
> +		 pte_index((address));					\
> +	smp_read_barrier_depends();					\
> +	ret;								\
> +)}
> +
>  #define pte_unmap(pte) kunmap_atomic((pte), KM_PTE0)
>  #define pte_unmap_nested(pte) kunmap_atomic((pte), KM_PTE1)
>  #else
>  #define pte_offset_map(dir, address)					\
> -	((pte_t *)page_address(pmd_page(*(dir))) + pte_index((address)))
> +{(									\
> +	pte_t *ret = (pte_t *)page_address(pmd_page(*(dir))) +		\
> +		pte_index((address));					\
> +	smp_read_barrier_depends();					\
> +	ret;								\
> +)}
> +
>  #define pte_offset_map_nested(dir, address) pte_offset_map((dir), (address))
>  #define pte_unmap(pte) do { } while (0)
>  #define pte_unmap_nested(pte) do { } while (0)
> Index: linux-2.6/mm/memory.c
> ===================================================================
> --- linux-2.6.orig/mm/memory.c
> +++ linux-2.6/mm/memory.c
> @@ -311,6 +311,13 @@ int __pte_alloc(struct mm_struct *mm, pm
>  	if (!new)
>  		return -ENOMEM;
>  
> +	/*
> +	 * Ensure all pte setup (eg. pte page lock and page clearing) are
> +	 * visible before the pte is made visible to other CPUs by being
> +	 * put into page tables.
> +	 */
> +	smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
> +
>  	spin_lock(&mm->page_table_lock);
>  	if (!pmd_present(*pmd)) {	/* Has another populated it ? */
>  		mm->nr_ptes++;
> @@ -329,6 +336,8 @@ int __pte_alloc_kernel(pmd_t *pmd, unsig
>  	if (!new)
>  		return -ENOMEM;
>  
> +	smp_wmb(); /* See comment in __pte_alloc */
> +
>  	spin_lock(&init_mm.page_table_lock);
>  	if (!pmd_present(*pmd)) {	/* Has another populated it ? */
>  		pmd_populate_kernel(&init_mm, pmd, new);
> @@ -2546,6 +2555,8 @@ int __pud_alloc(struct mm_struct *mm, pg
>  	if (!new)
>  		return -ENOMEM;
>  
> +	smp_wmb(); /* See comment in __pte_alloc */
> +
>  	spin_lock(&mm->page_table_lock);
>  	if (pgd_present(*pgd))		/* Another has populated it */
>  		pud_free(mm, new);
> @@ -2567,6 +2578,8 @@ int __pmd_alloc(struct mm_struct *mm, pu
>  	if (!new)
>  		return -ENOMEM;
>  
> +	smp_wmb(); /* See comment in __pte_alloc */
> +
>  	spin_lock(&mm->page_table_lock);
>  #ifndef __ARCH_HAS_4LEVEL_HACK
>  	if (pud_present(*pud))		/* Another has populated it */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [rfc] data race in page table setup/walking?
  2008-04-29 12:36 ` Hugh Dickins
@ 2008-04-29 21:37   ` Benjamin Herrenschmidt
  2008-04-29 22:47     ` Hugh Dickins
  2008-04-30  6:03   ` Nick Piggin
  1 sibling, 1 reply; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2008-04-29 21:37 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nick Piggin, Linus Torvalds, linux-arch, Linux Memory Management List

On Tue, 2008-04-29 at 13:36 +0100, Hugh Dickins wrote:
> 
> Ugh.  It's just so irritating to introduce these blockages against
> such a remote possibility (but there again, that's what so much of
> kernel code has to be about).  Is there any other way of handling it?

Not that much overhead... I think smp_read_barrier_depends() is a nop on
most archs no ? The data dependency between all the pointers takes care
of ordering in many cases. So it boils down to smp_wmb's when setting
which is not that expensive.

Cheers,
Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [rfc] data race in page table setup/walking?
  2008-04-29 21:37   ` Benjamin Herrenschmidt
@ 2008-04-29 22:47     ` Hugh Dickins
  2008-04-30  0:09       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 20+ messages in thread
From: Hugh Dickins @ 2008-04-29 22:47 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Nick Piggin, Linus Torvalds, linux-arch, Linux Memory Management List

On Wed, 30 Apr 2008, Benjamin Herrenschmidt wrote:
> On Tue, 2008-04-29 at 13:36 +0100, Hugh Dickins wrote:
> > 
> > Ugh.  It's just so irritating to introduce these blockages against
> > such a remote possibility (but there again, that's what so much of
> > kernel code has to be about).  Is there any other way of handling it?
> 
> Not that much overhead... I think smp_read_barrier_depends() is a nop on
> most archs no ? The data dependency between all the pointers takes care
> of ordering in many cases.

Ah, you're right, I was automatically thinking smp_rmb, whereas as this
is the only_does_something_on_alpha_mb (nice to see those impressively
long comments on a "do { } while (0)" in some of the other arches ;)

(Well, frv says "barrier()" for it - does it actually need that?)

Yes, that's not bad at all; though in that case,
I am surprised it's enough to patch up the issue.

> So it boils down to smp_wmb's when setting
> which is not that expensive.

Yes, I wasn't worried about the much less common and anyway heavier
write (allocate) path, I don't begrudge the smp_wmb's there.

Thanks for calming me down!
Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [rfc] data race in page table setup/walking?
  2008-04-29 22:47     ` Hugh Dickins
@ 2008-04-30  0:09       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2008-04-30  0:09 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nick Piggin, Linus Torvalds, linux-arch, Linux Memory Management List

On Tue, 2008-04-29 at 23:47 +0100, Hugh Dickins wrote:

> I am surprised it's enough to patch up the issue.

Well, we get lucky here because there's a data dependency between all
the loads... the last one needs the result from the previous one etc...

Only alpha is crazy enough to require barriers in that case as far as I
know :-)

Cheers,
Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [rfc] data race in page table setup/walking?
  2008-04-29 12:36 ` Hugh Dickins
  2008-04-29 21:37   ` Benjamin Herrenschmidt
@ 2008-04-30  6:03   ` Nick Piggin
  2008-04-30  6:05     ` David Miller, Nick Piggin
                       ` (2 more replies)
  1 sibling, 3 replies; 20+ messages in thread
From: Nick Piggin @ 2008-04-30  6:03 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, linux-arch, Linux Memory Management List,
	Benjamin Herrenschmidt

On Tue, Apr 29, 2008 at 01:36:41PM +0100, Hugh Dickins wrote:
> On Tue, 29 Apr 2008, Nick Piggin wrote:
> > I *think* there is a possible data race in the page table walking code. After
> > the split ptlock patches, it actually seems to have been introduced to the core
> > code, but even before that I think it would have impacted some architectures.
> > 
> > The race is as follows:
> > The pte page is allocated, zeroed, and its struct page gets its spinlock
> > initialized. The mm-wide ptl is then taken, and then the pte page is inserted
> > into the pagetables.
> > 
> > At this point, the spinlock is not guaranteed to have ordered the previous
> > stores to initialize the pte page with the subsequent store to put it in the
> > page tables. So another Linux page table walker might be walking down (without
> > any locks, because we have split-leaf-ptls), and find that new pte we've
> > inserted. It might try to take the spinlock before the store from the other
> > CPU initializes it. And subsequently it might read a pte_t out before stores
> > from the other CPU have cleared the memory.
> > 
> > There seem to be similar races in higher levels of the page tables, but they
> > obviously don't involve the spinlock, but one could see uninitialized memory.
> 
> It's sad, but I have to believe you're right.  I'm slightly more barrier-
> aware now than I was back when doing split ptlock (largely thanks to your
> persistence); and looking back at it, I cannot now imagine how it could
> be correct to remove a lock from that walkdown without adding barriers.

Well don't worry too much, I was one of the reviewers of that code too :P In
our defence, there were pre-existing counter examples of lockless page table
walking in arch code... but it is sometimes just really hard to spot these
ordering races. We've had many many others in mm/ I'm afraid to say.


> Ugh.  It's just so irritating to introduce these blockages against
> such a remote possibility (but there again, that's what so much of
> kernel code has to be about).  Is there any other way of handling it?

As Ben pointed out, the overhead is not too bad. On the read path, only
Alpha would care (and if Alpha was more than a curiosity at this point,
I guess they could introduce a lighter barrier, or detect if a specific
implementation doesn't require data dep barriers).


> > Arch code and hardware pagetable walkers that walk the pagetables without
> > locks could see similar uninitialized memory problems (regardless of whether
> > we have split ptes or not).
> 
> The hardware walkers, hmm.  Well, I guess each arch has its own rules
> to protect against those, and all you can do is provide a macro for
> each to fill in.   You assume smp_read_barrier_depends versus smp_wmb
> below: sure of those, or is it worth providing particular new stubs?

Yes, it definitely is a data dependency barrier: the load of the pte page
spinlock or the ptes out of the page itself depends on the load of the
pointer to the pte page.

Hardware walkers, I shouldn't worry too much about, except as a thought
exercise to realise that we have lockless readers. I think(?) alpha can
walk the linux ptes in hardware on TLB miss, but surely they will have
to do the requisite barriers in hardware too (otherwise things get
really messy)

Powerpc's find_linux_pte is one of the software walked lockless ones.
That's basically how I imagine hardware walkers essentially should operate.


> > This isn't a complete patch yet, but a demonstration of the problem, and an
> > RFC really as to the form of the solution. I prefer to put the barriers in
> > core code, because that's where the higher level logic happens, but the page
> > table accessors are per-arch, and open-coding them everywhere I don't think
> > is an option.
> 
> If there's no better way (I think not), this looks about right to me;
> though I leave all the hard thought to you ;)

I'll work on it ;) Thanks for the comments.

 
> While I'm in the confessional, something else you probably need to
> worry about there: handle_pte_fault's "entry = *pte" without holding
> the lock; several cases are self-righting, but there's pte_unmap_same
> for a couple of cases where we need to make sure of the right decision
> - presently it's only worrying about the PAE case, when it might have
> got the top of one pte with the bottom of another, but now you need
> some barrier thinking?  Oh, perhaps this is already safely covered
> by your pte_offset_map.

Yes I think it should be OK to dereference it because we came to it
from pte_alloc_map.

The issue of taking the top or bottom of the pte I think is a different
data race, and yes I think we don't have to worry about it (although
it would be nice to wrap _all_ page table dereferences in functions, so
we can audit and modify them more easily).

Actually, aside, all those smp_wmb() things in pgtable-3level.h can
probably go away if we cared: because we could be sneaky and leverage
the assumption that top and bottom will always be in the same cacheline
and thus should be shielded from memory consistency problems :)

 
> The pte_offset_kernel one (aside from the trivial of needing a ret):
> I'm not convinced that needs to be changed at all.  I still believe,
> as I believed at split ptlock time, that the kernel walkdowns need
> no locking (or barriers) of their own: that it's a separate kernel
> bug if a kernel subsystem is making speculative accesses to addresses
> it cannot be sure have been allocated.  Counter-examples?
> 
> Ah, but perhaps naughty userspace (depending on architecture) could
> make those speculative accesses into kernel address space, and have
> a chance of striking lucky with the hardware walker, without proper
> barriers at the kernel end?

I'm not sure about that. Apparently the hardware prefetcher can do
pretty wild things on some CPUs including setting up TLBs. As far
as userspace access goes, I'm not completely sure, either.

My thinking is that it might be better not to take any chances even
in the kernel path. I guess I should comment my thinking, so that it
can be easier to understand/dispute in future.


> > So anyway... comments, please? Am I dreaming the whole thing up? I suspect
> > that if I'm not, then powerpc at least might have been impacted by the race,
> > but as far as I know of, they haven't seen stability problems around there...
> > Might just be terribly rare, though. I'd like to try to make a test program
> > to reproduce the problem if I can get access to a box...
> 
> Please do, if you're feeling ingenious: it's tiresome adding overhead
> without being able to show it's really achieved something.

Heh ;) I'll try to kick some grey cells into action and think up something!
I'd still like to demonstrate it even if everyone agrees that it is a
problem.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [rfc] data race in page table setup/walking?
  2008-04-30  6:03   ` Nick Piggin
@ 2008-04-30  6:05     ` David Miller, Nick Piggin
  2008-04-30  6:17       ` Nick Piggin
  2008-04-30 11:14     ` Hugh Dickins
  2008-04-30 15:53     ` Linus Torvalds
  2 siblings, 1 reply; 20+ messages in thread
From: David Miller, Nick Piggin @ 2008-04-30  6:05 UTC (permalink / raw)
  To: npiggin; +Cc: hugh, torvalds, linux-arch, linux-mm, benh

> Hardware walkers, I shouldn't worry too much about, except as a thought
> exercise to realise that we have lockless readers. I think(?) alpha can
> walk the linux ptes in hardware on TLB miss, but surely they will have
> to do the requisite barriers in hardware too (otherwise things get
> really messy)

My understanding is that all Alpha implementations walk the
page tables in PAL code.

> Powerpc's find_linux_pte is one of the software walked lockless ones.
> That's basically how I imagine hardware walkers essentially should operate.

Sparc64 walks the page tables lockless in it's TLB hash table miss
handling.

MIPS does something similar.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [rfc] data race in page table setup/walking?
  2008-04-30  6:05     ` David Miller, Nick Piggin
@ 2008-04-30  6:17       ` Nick Piggin
  0 siblings, 0 replies; 20+ messages in thread
From: Nick Piggin @ 2008-04-30  6:17 UTC (permalink / raw)
  To: David Miller; +Cc: hugh, torvalds, linux-arch, linux-mm, benh

On Tue, Apr 29, 2008 at 11:05:43PM -0700, David Miller wrote:
> From: Nick Piggin <npiggin@suse.de>
> Date: Wed, 30 Apr 2008 08:03:40 +0200
> 
> > Hardware walkers, I shouldn't worry too much about, except as a thought
> > exercise to realise that we have lockless readers. I think(?) alpha can
> > walk the linux ptes in hardware on TLB miss, but surely they will have
> > to do the requisite barriers in hardware too (otherwise things get
> > really messy)
> 
> My understanding is that all Alpha implementations walk the
> page tables in PAL code.

Ah OK. I guess that's effectively "hardware" as far as Linux is concerned.
I guess even x86 really walks the page tables in microcode as well. Basically
I just mean something that is invisible to, and obvlivious of, Linux's
locking.

 
> > Powerpc's find_linux_pte is one of the software walked lockless ones.
> > That's basically how I imagine hardware walkers essentially should operate.
> 
> Sparc64 walks the page tables lockless in it's TLB hash table miss
> handling.
> 
> MIPS does something similar.

Interesting, thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [rfc] data race in page table setup/walking?
  2008-04-30  6:03   ` Nick Piggin
  2008-04-30  6:05     ` David Miller, Nick Piggin
@ 2008-04-30 11:14     ` Hugh Dickins
  2008-05-01  0:35       ` Nick Piggin
  2008-04-30 15:53     ` Linus Torvalds
  2 siblings, 1 reply; 20+ messages in thread
From: Hugh Dickins @ 2008-04-30 11:14 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, linux-arch, Linux Memory Management List,
	Benjamin Herrenschmidt

On Wed, 30 Apr 2008, Nick Piggin wrote:
> 
> Actually, aside, all those smp_wmb() things in pgtable-3level.h can
> probably go away if we cared: because we could be sneaky and leverage
> the assumption that top and bottom will always be in the same cacheline
> and thus should be shielded from memory consistency problems :)

I've sometimes wondered along those lines.  But it would need
interrupts disabled, wouldn't it?  And could SMM mess it up?
And what about another CPU taking the cacheline to modify it
in between our two accesses?

I don't think we do care in that x86 PAE case, but as a general
principal, if it can be safely assumed on all architectures (or
more messily, just on some) under certain conditions, then shouldn't
we be looking to use that technique (relying on a consistent view of
separate variables clustered into the same cacheline) in critical
places, rather than regarding it as sneaky?

But I suspect this is a chimaera, that there's actually no
safe use to be made of it.  I'd be glad to be shown wrong.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [rfc] data race in page table setup/walking?
  2008-04-30  6:03   ` Nick Piggin
  2008-04-30  6:05     ` David Miller, Nick Piggin
  2008-04-30 11:14     ` Hugh Dickins
@ 2008-04-30 15:53     ` Linus Torvalds
  2008-05-01  0:29       ` Nick Piggin
  2 siblings, 1 reply; 20+ messages in thread
From: Linus Torvalds @ 2008-04-30 15:53 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, linux-arch, Linux Memory Management List,
	Benjamin Herrenschmidt

On Wed, 30 Apr 2008, Nick Piggin wrote:
> 
> Actually, aside, all those smp_wmb() things in pgtable-3level.h can
> probably go away if we cared: because we could be sneaky and leverage
> the assumption that top and bottom will always be in the same cacheline
> and thus should be shielded from memory consistency problems :)

Umm.

Why would we care, since smp_wmb() is a no-op? (Yea, it's a compiler 
barrier, big deal, it's not going to cost us anything).

Also, write barriers are not about cacheline access order, they tend to be 
more about the write *buffer*, ie before the write even hits the cache 
line. And a write coudl easily pass another write in the write buffer if 
there is (for example) a dependency on the address.

So even if they are in the same cacheline, if the first write needs an 
offset addition, and the second one does not, it could easily be that the 
second one hits the write buffer first (together with some alias 
detection that re-does the things if they alias).

Of course, on x86, the write ordering is strictly defined, and even if the 
CPU reorders writes they are guaranteed to never show up re-ordered, so 
this is not an issue. But I wanted to point out that memory ordering is 
*not* just about cachelines, and being in the same cacheline is no 
guarantee of anything, even if it can have *some* effects.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [rfc] data race in page table setup/walking?
  2008-04-30 15:53     ` Linus Torvalds
@ 2008-05-01  0:29       ` Nick Piggin
  2008-05-01  3:24         ` Linus Torvalds
  0 siblings, 1 reply; 20+ messages in thread
From: Nick Piggin @ 2008-05-01  0:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, linux-arch, Linux Memory Management List,
	Benjamin Herrenschmidt

On Wed, Apr 30, 2008 at 08:53:44AM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 30 Apr 2008, Nick Piggin wrote:
> > 
> > Actually, aside, all those smp_wmb() things in pgtable-3level.h can
> > probably go away if we cared: because we could be sneaky and leverage
> > the assumption that top and bottom will always be in the same cacheline
> > and thus should be shielded from memory consistency problems :)
> 
> Umm.
> 
> Why would we care, since smp_wmb() is a no-op? (Yea, it's a compiler 
> barrier, big deal, it's not going to cost us anything).

Oh there needs to be a compiler barrier there. I was just saying...
I don't actually think we care (whether or not I'm right).
 

> Also, write barriers are not about cacheline access order, they tend to be 
> more about the write *buffer*, ie before the write even hits the cache 
> line. And a write coudl easily pass another write in the write buffer if 
> there is (for example) a dependency on the address.
> 
> So even if they are in the same cacheline, if the first write needs an 
> offset addition, and the second one does not, it could easily be that the 
> second one hits the write buffer first (together with some alias 
> detection that re-does the things if they alias).
> 
> Of course, on x86, the write ordering is strictly defined, and even if the 
> CPU reorders writes they are guaranteed to never show up re-ordered, so 
> this is not an issue. But I wanted to point out that memory ordering is 
> *not* just about cachelines, and being in the same cacheline is no 
> guarantee of anything, even if it can have *some* effects.

Well it is a guarantee about cache coherency presumably, but I guess
you're taking that for granted.

But I'm surprised that two writes to the same cacheline (different
words) can be reordered. Of course write buffers are technically outside
the coherency domain, but I would have thought any implementation will
actually treat writes to the same line as aliasing. Is there a counter
example?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [rfc] data race in page table setup/walking?
  2008-04-30 11:14     ` Hugh Dickins
@ 2008-05-01  0:35       ` Nick Piggin
  2008-05-01 12:45         ` Hugh Dickins
  0 siblings, 1 reply; 20+ messages in thread
From: Nick Piggin @ 2008-05-01  0:35 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, linux-arch, Linux Memory Management List,
	Benjamin Herrenschmidt

On Wed, Apr 30, 2008 at 12:14:51PM +0100, Hugh Dickins wrote:
> On Wed, 30 Apr 2008, Nick Piggin wrote:
> > 
> > Actually, aside, all those smp_wmb() things in pgtable-3level.h can
> > probably go away if we cared: because we could be sneaky and leverage
> > the assumption that top and bottom will always be in the same cacheline
> > and thus should be shielded from memory consistency problems :)
> 
> I've sometimes wondered along those lines.  But it would need
> interrupts disabled, wouldn't it?  And could SMM mess it up?
> And what about another CPU taking the cacheline to modify it
> in between our two accesses?

Nothing more than could not already happen with the smp_wmb in there,
AFAIKS.

 
> I don't think we do care in that x86 PAE case, but as a general
> principal, if it can be safely assumed on all architectures (or
> more messily, just on some) under certain conditions, then shouldn't
> we be looking to use that technique (relying on a consistent view of
> separate variables clustered into the same cacheline) in critical
> places, rather than regarding it as sneaky?
> 
> But I suspect this is a chimaera, that there's actually no
> safe use to be made of it.  I'd be glad to be shown wrong.

Well Linus put a dampener on it... but if it actually did work, then
yeah I guess there are some places it could be used. I suspect that
on some implementations, being in the same cacheline would actually
fully order all transactions of a CPU, so if it did make a big
difference anywhere, we could have smp_*mb_cacheline() or something ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [rfc] data race in page table setup/walking?
  2008-05-01  0:29       ` Nick Piggin
@ 2008-05-01  3:24         ` Linus Torvalds
  2008-05-02  1:20           ` Nick Piggin
  0 siblings, 1 reply; 20+ messages in thread
From: Linus Torvalds @ 2008-05-01  3:24 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, linux-arch, Linux Memory Management List,
	Benjamin Herrenschmidt

On Thu, 1 May 2008, Nick Piggin wrote:
> > 
> > Of course, on x86, the write ordering is strictly defined, and even if the 
> > CPU reorders writes they are guaranteed to never show up re-ordered, so 
> > this is not an issue. But I wanted to point out that memory ordering is 
> > *not* just about cachelines, and being in the same cacheline is no 
> > guarantee of anything, even if it can have *some* effects.
> 
> Well it is a guarantee about cache coherency presumably, but I guess
> you're taking that for granted.

Yes, I'm taking cache coherency for granted, I don't think it's worth even 
worrying about non-coherent cases.

> But I'm surprised that two writes to the same cacheline (different
> words) can be reordered. Of course write buffers are technically outside
> the coherency domain, but I would have thought any implementation will
> actually treat writes to the same line as aliasing. Is there a counter
> example?

I don't know if anybody does it, but no, normally I would *not* expect any 
alias logic to have anything to do with cachelines. Aliasing within a 
cacheline is so common (spills to the stack, if nothing else) that if the 
CPU has some write buffer alias logic, I'd expect it to be byte or perhaps 
word-granular.

So I think that at least in theory it is quite possible that a later write 
hits the same cacheline first, just because the write data or address got 
resolved first and the architecture allows out-of-order memory accesses. 

Whether you'll ever see it in practice, I don't know.  Never on x86, of 
course.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [rfc] data race in page table setup/walking?
  2008-05-01  0:35       ` Nick Piggin
@ 2008-05-01 12:45         ` Hugh Dickins
  0 siblings, 0 replies; 20+ messages in thread
From: Hugh Dickins @ 2008-05-01 12:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, linux-arch, Linux Memory Management List,
	Benjamin Herrenschmidt

On Thu, 1 May 2008, Nick Piggin wrote:
> On Wed, Apr 30, 2008 at 12:14:51PM +0100, Hugh Dickins wrote:
> > On Wed, 30 Apr 2008, Nick Piggin wrote:
> > > 
> > > Actually, aside, all those smp_wmb() things in pgtable-3level.h can
> > > probably go away if we cared: because we could be sneaky and leverage
> > > the assumption that top and bottom will always be in the same cacheline
> > > and thus should be shielded from memory consistency problems :)
> > 
> > I've sometimes wondered along those lines.  But it would need
> > interrupts disabled, wouldn't it?  And could SMM mess it up?
> > And what about another CPU taking the cacheline to modify it
> > in between our two accesses?
> 
> Nothing more than could not already happen with the smp_wmb in there,
> AFAIKS.

Yes, one does wonder just what I was wondering ;)

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [rfc] data race in page table setup/walking?
  2008-05-01  3:24         ` Linus Torvalds
@ 2008-05-02  1:20           ` Nick Piggin
  2008-05-02  1:33             ` Linus Torvalds
  0 siblings, 1 reply; 20+ messages in thread
From: Nick Piggin @ 2008-05-02  1:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, linux-arch, Linux Memory Management List,
	Benjamin Herrenschmidt

On Wed, Apr 30, 2008 at 08:24:48PM -0700, Linus Torvalds wrote:
> 
> On Thu, 1 May 2008, Nick Piggin wrote:

> > But I'm surprised that two writes to the same cacheline (different
> > words) can be reordered. Of course write buffers are technically outside
> > the coherency domain, but I would have thought any implementation will
> > actually treat writes to the same line as aliasing. Is there a counter
> > example?
> 
> I don't know if anybody does it, but no, normally I would *not* expect any 
> alias logic to have anything to do with cachelines. Aliasing within a 
> cacheline is so common (spills to the stack, if nothing else) that if the 
> CPU has some write buffer alias logic, I'd expect it to be byte or perhaps 
> word-granular.
> 
> So I think that at least in theory it is quite possible that a later write 
> hits the same cacheline first, just because the write data or address got 
> resolved first and the architecture allows out-of-order memory accesses. 

I guess it is possible. But at least in the case of write address, you'd
have to wait for later stores anyway in order to do the alias detection,
which might be the most common case.

For other dependencies yes, although I would have thought that you'd be
better off to wait for the earlier write and so they can be combined into
a single cache transaction. The easy part of stores is queueing them,
the hard part is moving them out to cache.

Anyway I'm speculating at this point. You do raise a valid issue, so
obviously we can't make any such assumptions without verifying it on a
per-arch basis ;) I'm just interested to know whether this happens on
any CPU we run on.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [rfc] data race in page table setup/walking?
  2008-05-02  1:20           ` Nick Piggin
@ 2008-05-02  1:33             ` Linus Torvalds
  2008-05-02  1:43               ` Nick Piggin
  0 siblings, 1 reply; 20+ messages in thread
From: Linus Torvalds @ 2008-05-02  1:33 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, linux-arch, Linux Memory Management List,
	Benjamin Herrenschmidt


On Fri, 2 May 2008, Nick Piggin wrote:
> 
> I guess it is possible. But at least in the case of write address, you'd
> have to wait for later stores anyway in order to do the alias detection,
> which might be the most common case.

No, just the *address*. The data for the second store may not be ready, 
but the address may have been resolved (and checked that it doesn't fault 
etc) and the previous store may complete.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [rfc] data race in page table setup/walking?
  2008-05-02  1:33             ` Linus Torvalds
@ 2008-05-02  1:43               ` Nick Piggin
  0 siblings, 0 replies; 20+ messages in thread
From: Nick Piggin @ 2008-05-02  1:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, linux-arch, Linux Memory Management List,
	Benjamin Herrenschmidt

On Thu, May 01, 2008 at 06:33:45PM -0700, Linus Torvalds wrote:
> 
> 
> On Fri, 2 May 2008, Nick Piggin wrote:
> > 
> > I guess it is possible. But at least in the case of write address, you'd
> > have to wait for later stores anyway in order to do the alias detection,
> > which might be the most common case.
> 
> No, just the *address*. The data for the second store may not be ready, 
> but the address may have been resolved (and checked that it doesn't fault 
> etc) and the previous store may complete.

Yes in the case of other dependencies I agreed that it would be possible.
In the case of just address it doesn't really make sense.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2008-05-02  1:43 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-04-29  5:00 [rfc] data race in page table setup/walking? Nick Piggin
2008-04-29  5:08 ` Benjamin Herrenschmidt
2008-04-29  5:41   ` Nick Piggin
2008-04-29 10:56 ` David Miller, Nick Piggin
2008-04-29 12:36 ` Hugh Dickins
2008-04-29 21:37   ` Benjamin Herrenschmidt
2008-04-29 22:47     ` Hugh Dickins
2008-04-30  0:09       ` Benjamin Herrenschmidt
2008-04-30  6:03   ` Nick Piggin
2008-04-30  6:05     ` David Miller, Nick Piggin
2008-04-30  6:17       ` Nick Piggin
2008-04-30 11:14     ` Hugh Dickins
2008-05-01  0:35       ` Nick Piggin
2008-05-01 12:45         ` Hugh Dickins
2008-04-30 15:53     ` Linus Torvalds
2008-05-01  0:29       ` Nick Piggin
2008-05-01  3:24         ` Linus Torvalds
2008-05-02  1:20           ` Nick Piggin
2008-05-02  1:33             ` Linus Torvalds
2008-05-02  1:43               ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox