[PATCH] workaround for lost dirty bits on x86 SMP

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] workaround for lost dirty bits on x86 SMP
@ 2000-09-12  0:43 Ben LaHaise
  2000-09-12  0:59 ` Kanoj Sarcar
  2000-09-15 19:56 ` Linus Torvalds
  0 siblings, 2 replies; 10+ messages in thread
From: Ben LaHaise @ 2000-09-12  0:43 UTC (permalink / raw)
  To: linux-mm; +Cc: torvalds

The patch below is one means of working around the lost dirty bit problem
on x86 SMP.  If possible, I'ld like to see this tested in 2.4 as it would
be the least intrusive fix for 2.2.  The idea is simple and comes from the
way RISC processors deal with ptes in Linux: maintain the writable flag in
one of the system bits in the pte and only set the writable bit in the pte
when the dirty bit is set.  This way we get a page fault when the system
wishes to update the dirty bit, which causes the needed serialization to
occur.  Without the patch, dirty state can be lost in places like
filemap/c:filemap_sync_pte where pte_clear, leading to invalid data in the
page cache.

		-ben (2.2 patch to follow)

-----snip: v2_4_0_test8__x86_smp_dirty.diff---------
diff -urN kernels/v2.4.0-test8/arch/i386/kernel/process.c work-v2.4.0-test8/arch/i386/kernel/process.c
--- kernels/v2.4.0-test8/arch/i386/kernel/process.c	Mon Sep 11 13:13:50 2000
+++ work-v2.4.0-test8/arch/i386/kernel/process.c	Mon Sep 11 19:54:24 2000
@@ -281,7 +281,7 @@
 	/* Make sure the first page is mapped to the start of physical memory.
 	   It is normally not mapped, to trap kernel NULL pointer dereferences. */
 
-	pg0[0] = _PAGE_RW | _PAGE_PRESENT;
+	pg0[0] = _PAGE_RW | _PAGE_W | _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_PRESENT;
 
 	/*
 	 * Use `swapper_pg_dir' as our page directory.
diff -urN kernels/v2.4.0-test8/arch/i386/mm/ioremap.c work-v2.4.0-test8/arch/i386/mm/ioremap.c
--- kernels/v2.4.0-test8/arch/i386/mm/ioremap.c	Tue Aug  8 00:02:27 2000
+++ work-v2.4.0-test8/arch/i386/mm/ioremap.c	Mon Sep 11 19:55:35 2000
@@ -28,7 +28,7 @@
 			printk("remap_area_pte: page already exists\n");
 			BUG();
 		}
-		set_pte(pte, mk_pte_phys(phys_addr, __pgprot(_PAGE_PRESENT | _PAGE_RW | 
+		set_pte(pte, mk_pte_phys(phys_addr, __pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_W |
 					_PAGE_DIRTY | _PAGE_ACCESSED | flags)));
 		address += PAGE_SIZE;
 		phys_addr += PAGE_SIZE;
diff -urN kernels/v2.4.0-test8/drivers/char/drm/vm.c work-v2.4.0-test8/drivers/char/drm/vm.c
--- kernels/v2.4.0-test8/drivers/char/drm/vm.c	Fri Aug 11 22:14:46 2000
+++ work-v2.4.0-test8/drivers/char/drm/vm.c	Mon Sep 11 19:57:15 2000
@@ -302,15 +302,11 @@
 	
 	if (!capable(CAP_SYS_ADMIN) && (map->flags & _DRM_READ_ONLY)) {
 		vma->vm_flags &= VM_MAYWRITE;
-#if defined(__i386__)
-		pgprot_val(vma->vm_page_prot) &= ~_PAGE_RW;
-#else
 				/* Ye gads this is ugly.  With more thought
                                    we could move this up higher and use
                                    `protection_map' instead.  */
 		vma->vm_page_prot = __pgprot(pte_val(pte_wrprotect(
 			__pte(pgprot_val(vma->vm_page_prot)))));
-#endif
 	}
 
 	switch (map->type) {
Binary files kernels/v2.4.0-test8/include/asm-i386/.pgtable.h.swp and work-v2.4.0-test8/include/asm-i386/.pgtable.h.swp differ
diff -urN kernels/v2.4.0-test8/include/asm-i386/pgtable.h work-v2.4.0-test8/include/asm-i386/pgtable.h
--- kernels/v2.4.0-test8/include/asm-i386/pgtable.h	Wed Aug 23 14:35:07 2000
+++ work-v2.4.0-test8/include/asm-i386/pgtable.h	Mon Sep 11 20:07:39 2000
@@ -146,7 +146,7 @@
  * memory. 
  */
 #define _PAGE_PRESENT	0x001
-#define _PAGE_RW	0x002
+#define _PAGE_PHY_RW	0x002
 #define _PAGE_USER	0x004
 #define _PAGE_PWT	0x008
 #define _PAGE_PCD	0x010
@@ -155,11 +155,30 @@
 #define _PAGE_PSE	0x080	/* 4 MB (or 2MB) page, Pentium+, if present.. */
 #define _PAGE_GLOBAL	0x100	/* Global TLB entry PPro+ */
 
+#if defined(CONFIG_SMP)
+/* To work around an SMP race which would require us to use
+ * atomic operations to clear *all* page tables which might
+ * have the dirty bit set, we do the following: treat the
+ * physical dirty and writable bits as one entity -- if one
+ * is set, the other *must* be set.  That way, if the dirty
+ * bit is cleared, write access is taken away and a fault
+ * with proper serialization (via mmap_sem) will take place.
+ * This is the same thing done on most RISC processors. -ben
+ */
+#define _PAGE_VIR_RW	0x200
+#define _PAGE_W		_PAGE_PHY_RW
+#define _PAGE_RW	_PAGE_VIR_RW
+
+#else
+#define _PAGE_W		0x000
+#define _PAGE_RW	_PAGE_PHY_RW
+#endif
+
 #define _PAGE_PROTNONE	0x080	/* If not present */
 
-#define _PAGE_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | _PAGE_ACCESSED | _PAGE_DIRTY)
-#define _KERNPG_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | _PAGE_DIRTY)
-#define _PAGE_CHG_MASK	(PTE_MASK | _PAGE_ACCESSED | _PAGE_DIRTY)
+#define _PAGE_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_W | _PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_USER)
+#define _KERNPG_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_W | _PAGE_ACCESSED | _PAGE_DIRTY)
+#define _PAGE_CHG_MASK	(PTE_MASK | _PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_W)
 
 #define PAGE_NONE	__pgprot(_PAGE_PROTNONE | _PAGE_ACCESSED)
 #define PAGE_SHARED	__pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | _PAGE_ACCESSED)
@@ -167,9 +186,9 @@
 #define PAGE_READONLY	__pgprot(_PAGE_PRESENT | _PAGE_USER | _PAGE_ACCESSED)
 
 #define __PAGE_KERNEL \
-	(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_ACCESSED)
+	(_PAGE_PRESENT | _PAGE_RW | _PAGE_W | _PAGE_DIRTY | _PAGE_ACCESSED)
 #define __PAGE_KERNEL_NOCACHE \
-	(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_PCD | _PAGE_ACCESSED)
+	(_PAGE_PRESENT | _PAGE_RW | _PAGE_W | _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_PCD)
 #define __PAGE_KERNEL_RO \
 	(_PAGE_PRESENT | _PAGE_DIRTY | _PAGE_ACCESSED)
 
@@ -260,12 +279,12 @@
 
 extern inline pte_t pte_rdprotect(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_USER)); return pte; }
 extern inline pte_t pte_exprotect(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_USER)); return pte; }
-extern inline pte_t pte_mkclean(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_DIRTY)); return pte; }
+extern inline pte_t pte_mkclean(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) & ~(_PAGE_DIRTY | _PAGE_W))); return pte; }
 extern inline pte_t pte_mkold(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_ACCESSED)); return pte; }
-extern inline pte_t pte_wrprotect(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_RW)); return pte; }
+extern inline pte_t pte_wrprotect(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) & ~(_PAGE_RW | _PAGE_W))); return pte; }
 extern inline pte_t pte_mkread(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) | _PAGE_USER)); return pte; }
 extern inline pte_t pte_mkexec(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) | _PAGE_USER)); return pte; }
-extern inline pte_t pte_mkdirty(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) | _PAGE_DIRTY)); return pte; }
+extern inline pte_t pte_mkdirty(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) | _PAGE_DIRTY | _PAGE_W)); return pte; }
 extern inline pte_t pte_mkyoung(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) | _PAGE_ACCESSED)); return pte; }
 extern inline pte_t pte_mkwrite(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) | _PAGE_RW)); return pte; }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] workaround for lost dirty bits on x86 SMP
  2000-09-12  0:43 [PATCH] workaround for lost dirty bits on x86 SMP Ben LaHaise
@ 2000-09-12  0:59 ` Kanoj Sarcar
  2000-09-12  1:36   ` bcrl
  2000-09-15 19:56 ` Linus Torvalds
  1 sibling, 1 reply; 10+ messages in thread
From: Kanoj Sarcar @ 2000-09-12  0:59 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: linux-mm, torvalds

> 
> The patch below is one means of working around the lost dirty bit problem
> on x86 SMP.  If possible, I'ld like to see this tested in 2.4 as it would
> be the least intrusive fix for 2.2.  The idea is simple and comes from the
> way RISC processors deal with ptes in Linux: maintain the writable flag in
> one of the system bits in the pte and only set the writable bit in the pte
> when the dirty bit is set.  This way we get a page fault when the system
> wishes to update the dirty bit, which causes the needed serialization to
> occur.  Without the patch, dirty state can be lost in places like
> filemap/c:filemap_sync_pte where pte_clear, leading to invalid data in the
> page cache.
> 
> 		-ben (2.2 patch to follow)

Ben,

Could you describe the race you are trying to fix?

One of the worst races is in the page stealing path, when the stealer
thread checks whether the page is dirty, decides to pte_clear(), and
right then, the user dirties the pte, before the stealer thread has done
the flush_tlb. Are you trying to handle this situation?

FWIW, previous notes/patches on this topic can be found at

	http://reality.sgi.com/kanoj_engr/smppte.patch

and this also tries to handle the filemap cases. 

I _think_ that with your patch, the page fault rate would go up, so
it would be appropriate to generate some benchmark numbers.

I would be willing to port the patch on the web page to 2.4, but 
thus far, my impression is that Linus is not happy with its
implementation ...

Kanoj

> 
> -----snip: v2_4_0_test8__x86_smp_dirty.diff---------
> diff -urN kernels/v2.4.0-test8/arch/i386/kernel/process.c work-v2.4.0-test8/arch/i386/kernel/process.c
> --- kernels/v2.4.0-test8/arch/i386/kernel/process.c	Mon Sep 11 13:13:50 2000
> +++ work-v2.4.0-test8/arch/i386/kernel/process.c	Mon Sep 11 19:54:24 2000
> @@ -281,7 +281,7 @@
>  	/* Make sure the first page is mapped to the start of physical memory.
>  	   It is normally not mapped, to trap kernel NULL pointer dereferences. */
>  
> -	pg0[0] = _PAGE_RW | _PAGE_PRESENT;
> +	pg0[0] = _PAGE_RW | _PAGE_W | _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_PRESENT;
>  
>  	/*
>  	 * Use `swapper_pg_dir' as our page directory.
> diff -urN kernels/v2.4.0-test8/arch/i386/mm/ioremap.c work-v2.4.0-test8/arch/i386/mm/ioremap.c
> --- kernels/v2.4.0-test8/arch/i386/mm/ioremap.c	Tue Aug  8 00:02:27 2000
> +++ work-v2.4.0-test8/arch/i386/mm/ioremap.c	Mon Sep 11 19:55:35 2000
> @@ -28,7 +28,7 @@
>  			printk("remap_area_pte: page already exists\n");
>  			BUG();
>  		}
> -		set_pte(pte, mk_pte_phys(phys_addr, __pgprot(_PAGE_PRESENT | _PAGE_RW | 
> +		set_pte(pte, mk_pte_phys(phys_addr, __pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_W |
>  					_PAGE_DIRTY | _PAGE_ACCESSED | flags)));
>  		address += PAGE_SIZE;
>  		phys_addr += PAGE_SIZE;
> diff -urN kernels/v2.4.0-test8/drivers/char/drm/vm.c work-v2.4.0-test8/drivers/char/drm/vm.c
> --- kernels/v2.4.0-test8/drivers/char/drm/vm.c	Fri Aug 11 22:14:46 2000
> +++ work-v2.4.0-test8/drivers/char/drm/vm.c	Mon Sep 11 19:57:15 2000
> @@ -302,15 +302,11 @@
>  	
>  	if (!capable(CAP_SYS_ADMIN) && (map->flags & _DRM_READ_ONLY)) {
>  		vma->vm_flags &= VM_MAYWRITE;
> -#if defined(__i386__)
> -		pgprot_val(vma->vm_page_prot) &= ~_PAGE_RW;
> -#else
>  				/* Ye gads this is ugly.  With more thought
>                                     we could move this up higher and use
>                                     `protection_map' instead.  */
>  		vma->vm_page_prot = __pgprot(pte_val(pte_wrprotect(
>  			__pte(pgprot_val(vma->vm_page_prot)))));
> -#endif
>  	}
>  
>  	switch (map->type) {
> Binary files kernels/v2.4.0-test8/include/asm-i386/.pgtable.h.swp and work-v2.4.0-test8/include/asm-i386/.pgtable.h.swp differ
> diff -urN kernels/v2.4.0-test8/include/asm-i386/pgtable.h work-v2.4.0-test8/include/asm-i386/pgtable.h
> --- kernels/v2.4.0-test8/include/asm-i386/pgtable.h	Wed Aug 23 14:35:07 2000
> +++ work-v2.4.0-test8/include/asm-i386/pgtable.h	Mon Sep 11 20:07:39 2000
> @@ -146,7 +146,7 @@
>   * memory. 
>   */
>  #define _PAGE_PRESENT	0x001
> -#define _PAGE_RW	0x002
> +#define _PAGE_PHY_RW	0x002
>  #define _PAGE_USER	0x004
>  #define _PAGE_PWT	0x008
>  #define _PAGE_PCD	0x010
> @@ -155,11 +155,30 @@
>  #define _PAGE_PSE	0x080	/* 4 MB (or 2MB) page, Pentium+, if present.. */
>  #define _PAGE_GLOBAL	0x100	/* Global TLB entry PPro+ */
>  
> +#if defined(CONFIG_SMP)
> +/* To work around an SMP race which would require us to use
> + * atomic operations to clear *all* page tables which might
> + * have the dirty bit set, we do the following: treat the
> + * physical dirty and writable bits as one entity -- if one
> + * is set, the other *must* be set.  That way, if the dirty
> + * bit is cleared, write access is taken away and a fault
> + * with proper serialization (via mmap_sem) will take place.
> + * This is the same thing done on most RISC processors. -ben
> + */
> +#define _PAGE_VIR_RW	0x200
> +#define _PAGE_W		_PAGE_PHY_RW
> +#define _PAGE_RW	_PAGE_VIR_RW
> +
> +#else
> +#define _PAGE_W		0x000
> +#define _PAGE_RW	_PAGE_PHY_RW
> +#endif
> +
>  #define _PAGE_PROTNONE	0x080	/* If not present */
>  
> -#define _PAGE_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | _PAGE_ACCESSED | _PAGE_DIRTY)
> -#define _KERNPG_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | _PAGE_DIRTY)
> -#define _PAGE_CHG_MASK	(PTE_MASK | _PAGE_ACCESSED | _PAGE_DIRTY)
> +#define _PAGE_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_W | _PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_USER)
> +#define _KERNPG_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_W | _PAGE_ACCESSED | _PAGE_DIRTY)
> +#define _PAGE_CHG_MASK	(PTE_MASK | _PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_W)
>  
>  #define PAGE_NONE	__pgprot(_PAGE_PROTNONE | _PAGE_ACCESSED)
>  #define PAGE_SHARED	__pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | _PAGE_ACCESSED)
> @@ -167,9 +186,9 @@
>  #define PAGE_READONLY	__pgprot(_PAGE_PRESENT | _PAGE_USER | _PAGE_ACCESSED)
>  
>  #define __PAGE_KERNEL \
> -	(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_ACCESSED)
> +	(_PAGE_PRESENT | _PAGE_RW | _PAGE_W | _PAGE_DIRTY | _PAGE_ACCESSED)
>  #define __PAGE_KERNEL_NOCACHE \
> -	(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_PCD | _PAGE_ACCESSED)
> +	(_PAGE_PRESENT | _PAGE_RW | _PAGE_W | _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_PCD)
>  #define __PAGE_KERNEL_RO \
>  	(_PAGE_PRESENT | _PAGE_DIRTY | _PAGE_ACCESSED)
>  
> @@ -260,12 +279,12 @@
>  
>  extern inline pte_t pte_rdprotect(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_USER)); return pte; }
>  extern inline pte_t pte_exprotect(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_USER)); return pte; }
> -extern inline pte_t pte_mkclean(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_DIRTY)); return pte; }
> +extern inline pte_t pte_mkclean(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) & ~(_PAGE_DIRTY | _PAGE_W))); return pte; }
>  extern inline pte_t pte_mkold(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_ACCESSED)); return pte; }
> -extern inline pte_t pte_wrprotect(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_RW)); return pte; }
> +extern inline pte_t pte_wrprotect(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) & ~(_PAGE_RW | _PAGE_W))); return pte; }
>  extern inline pte_t pte_mkread(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) | _PAGE_USER)); return pte; }
>  extern inline pte_t pte_mkexec(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) | _PAGE_USER)); return pte; }
> -extern inline pte_t pte_mkdirty(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) | _PAGE_DIRTY)); return pte; }
> +extern inline pte_t pte_mkdirty(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) | _PAGE_DIRTY | _PAGE_W)); return pte; }
>  extern inline pte_t pte_mkyoung(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) | _PAGE_ACCESSED)); return pte; }
>  extern inline pte_t pte_mkwrite(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) | _PAGE_RW)); return pte; }
>  
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux.eu.org/Linux-MM/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] workaround for lost dirty bits on x86 SMP
  2000-09-12  0:59 ` Kanoj Sarcar
@ 2000-09-12  1:36   ` bcrl
  2000-09-12  1:56     ` Rik van Riel
                       ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: bcrl @ 2000-09-12  1:36 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: linux-mm, torvalds

On Mon, 11 Sep 2000, Kanoj Sarcar wrote:

> One of the worst races is in the page stealing path, when the stealer
> thread checks whether the page is dirty, decides to pte_clear(), and
> right then, the user dirties the pte, before the stealer thread has done
> the flush_tlb. Are you trying to handle this situation?

That's the one.  It also crops up in msync, munmap and such.

> FWIW, previous notes/patches on this topic can be found at
> 
> 	http://reality.sgi.com/kanoj_engr/smppte.patch
> 
> and this also tries to handle the filemap cases. 

Right, that doesn't look so good -- it walks over the memory an extra pass
or two, which is not good.

> I _think_ that with your patch, the page fault rate would go up, so
> it would be appropriate to generate some benchmark numbers.

Yes, the fault rate will go up, but only for clean-but-writable pages that
are written to and only on SMP kernels.  We already do this on SPARC
(which the _PAGE_W trick was modeled after) and MIPS, so the overhead
should be reasonably low.  Note that we may wish to do this anyways to
keep track of the number of pinned pages in the system.

> I would be willing to port the patch on the web page to 2.4, but 
> thus far, my impression is that Linus is not happy with its
> implementation ...

The alternative is to replace pte_clear on a clean but writable pte with
an xchg to get the old pte value.  For non-threaded programs the
locked bus cycles will slow things down for no real gain.  Another
possibility is to implement real tlb shootdown.

Fwiw, with the patch, running a make -j bzImage on a 4 way box does not
seem to have made a difference.  A patched run:

bcrl@toolbox linux-v2.4.0-test8]$ time make -j -s bzImage
init.c:74: warning: `get_bad_pmd_table' defined but not used
Root device is (8, 1)
Boot sector 512 bytes.
Setup is 4522 bytes.
System is 873 kB
294.04user 23.81system 1:26.78elapsed 366%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (382013major+542948minor)pagefaults 0swaps
[bcrl@toolbox linux-v2.4.0-test8]$ 

vs unpatched:

[bcrl@toolbox linux-v2.4.0-test8]$ time make -j -s bzImage
init.c:74: warning: `get_bad_pmd_table' defined but not used
Root device is (8, 1)
Boot sector 512 bytes.
Setup is 4522 bytes.
System is 873 kB
294.19user 23.94system 1:26.88elapsed 366%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (382013major+542947minor)pagefaults 0swaps
[bcrl@toolbox linux-v2.4.0-test8]$ 

So it's in the noise for this case (hot cache for both runs).  It probably
makes a bigger difference for threaded programs in the presense of lots of
msync/memory pressure, but the overhead in tlb shootdown on cleaning the
dirty bit probably dwarfs the extra page fault, not to mention the actual
io taking place.  There are other areas that need improving before
write-faults on clean-writable pages make much of a difference.

		-ben

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] workaround for lost dirty bits on x86 SMP
  2000-09-12  1:36   ` bcrl
@ 2000-09-12  1:56     ` Rik van Riel
  2000-09-12  2:34     ` Kanoj Sarcar
  2000-09-12 10:24     ` Stephen C. Tweedie
  2 siblings, 0 replies; 10+ messages in thread
From: Rik van Riel @ 2000-09-12  1:56 UTC (permalink / raw)
  To: bcrl; +Cc: Kanoj Sarcar, linux-mm, torvalds

On Mon, 11 Sep 2000 bcrl@redhat.com wrote:
> On Mon, 11 Sep 2000, Kanoj Sarcar wrote:
> 
> > One of the worst races is in the page stealing path, when the stealer
> > thread checks whether the page is dirty, decides to pte_clear(), and
> > right then, the user dirties the pte, before the stealer thread has done
> > the flush_tlb. Are you trying to handle this situation?
> 
> That's the one.  It also crops up in msync, munmap and such.

And (IMHO the worst one) in try_to_swap_out...

     55         if (pte_young(pte)) {
     56                 /*
     57                  * Transfer the "accessed" bit from the page
     58                  * tables to the global page map.
     59                  */
     60                 set_pte(page_table, pte_mkold(pte));

Imagine what would happen if the CPU would mark a page
dirty while we are here...

CPU#0:			CPU#1:
(kswapd)		(user process)

read pte
...
			replace TLB entry / write dirty bit
set_pte()


And we've lost the dirty bit ...

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] workaround for lost dirty bits on x86 SMP
  2000-09-12  1:36   ` bcrl
  2000-09-12  1:56     ` Rik van Riel
@ 2000-09-12  2:34     ` Kanoj Sarcar
  2000-09-12  3:39       ` Benjamin C.R. LaHaise
  2000-09-12 10:24     ` Stephen C. Tweedie
  2 siblings, 1 reply; 10+ messages in thread
From: Kanoj Sarcar @ 2000-09-12  2:34 UTC (permalink / raw)
  To: bcrl; +Cc: linux-mm, torvalds

> 
> On Mon, 11 Sep 2000, Kanoj Sarcar wrote:
> 
> > One of the worst races is in the page stealing path, when the stealer
> > thread checks whether the page is dirty, decides to pte_clear(), and
> > right then, the user dirties the pte, before the stealer thread has done
> > the flush_tlb. Are you trying to handle this situation?
> 
> That's the one.  It also crops up in msync, munmap and such.
> 
> > FWIW, previous notes/patches on this topic can be found at
> > 
> > 	http://reality.sgi.com/kanoj_engr/smppte.patch
> > 
> > and this also tries to handle the filemap cases. 
> 
> Right, that doesn't look so good -- it walks over the memory an extra pass
> or two, which is not good.

Not really, I thought I had it in a state where the bare minimum
was being done. Of course, this was the non-PAE version ...

> 
> > I _think_ that with your patch, the page fault rate would go up, so
> > it would be appropriate to generate some benchmark numbers.
> 
> Yes, the fault rate will go up, but only for clean-but-writable pages that
> are written to and only on SMP kernels.  We already do this on SPARC
> (which the _PAGE_W trick was modeled after) and MIPS, so the overhead
> should be reasonably low.  Note that we may wish to do this anyways to
> keep track of the number of pinned pages in the system.

Yes, this approach is mentioned in the web page, alongwith the 
types of apps that will see the greatest hit:

"Note that an alternate solution to the ia32 SMP pte race is to change 
PAGE_SHARED in include/asm-i386/pgtable.h to not drop in _PAGE_RW. On a 
write fault, mark the pte dirty, as well as drop in the _PAGE_RW bit into 
the pte. Basically, behave as if the processor does not have a hardware 
dirty bit, and adopt the same strategy as in the alpha/mips code. 
Disadvantage - take an extra fault on shm/file mmap'ed pages when a 
program accesses the page, *then* dirties it (no change if the first access 
is a write)."

> 
> > I would be willing to port the patch on the web page to 2.4, but 
> > thus far, my impression is that Linus is not happy with its
> > implementation ...
> 
> The alternative is to replace pte_clear on a clean but writable pte with
> an xchg to get the old pte value.  For non-threaded programs the
> locked bus cycles will slow things down for no real gain.  Another
> possibility is to implement real tlb shootdown.
> 
> Fwiw, with the patch, running a make -j bzImage on a 4 way box does not
> seem to have made a difference.  A patched run:
> 
> bcrl@toolbox linux-v2.4.0-test8]$ time make -j -s bzImage
> init.c:74: warning: `get_bad_pmd_table' defined but not used
> Root device is (8, 1)
> Boot sector 512 bytes.
> Setup is 4522 bytes.
> System is 873 kB
> 294.04user 23.81system 1:26.78elapsed 366%CPU (0avgtext+0avgdata
> 0maxresident)k
> 0inputs+0outputs (382013major+542948minor)pagefaults 0swaps
> [bcrl@toolbox linux-v2.4.0-test8]$ 
> 
> vs unpatched:
> 
> [bcrl@toolbox linux-v2.4.0-test8]$ time make -j -s bzImage
> init.c:74: warning: `get_bad_pmd_table' defined but not used
> Root device is (8, 1)
> Boot sector 512 bytes.
> Setup is 4522 bytes.
> System is 873 kB
> 294.19user 23.94system 1:26.88elapsed 366%CPU (0avgtext+0avgdata
> 0maxresident)k
> 0inputs+0outputs (382013major+542947minor)pagefaults 0swaps
> [bcrl@toolbox linux-v2.4.0-test8]$ 
> 
> So it's in the noise for this case (hot cache for both runs).  It probably
> makes a bigger difference for threaded programs in the presense of lots of
> msync/memory pressure, but the overhead in tlb shootdown on cleaning the

With the above type of apps, you do not need too high a _memory_ pressure
to trigger this, just compute pressure. Each time you come in to drop in
the "dirty" bit, you would need to do establish_pte(), which does a 
flush_tlb_page(), which gets costly when you have higher cpu counts. 

This of course depends on how smart flush_tlb_page is, and the processor
involved.

Kanoj

> dirty bit probably dwarfs the extra page fault, not to mention the actual
> io taking place.  There are other areas that need improving before
> write-faults on clean-writable pages make much of a difference.
> 
> 		-ben
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux.eu.org/Linux-MM/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] workaround for lost dirty bits on x86 SMP
  2000-09-12  2:34     ` Kanoj Sarcar
@ 2000-09-12  3:39       ` Benjamin C.R. LaHaise
  2000-09-12  6:13         ` Kanoj Sarcar
  0 siblings, 1 reply; 10+ messages in thread
From: Benjamin C.R. LaHaise @ 2000-09-12  3:39 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: linux-mm, torvalds

On Mon, 11 Sep 2000, Kanoj Sarcar wrote:

> Not really, I thought I had it in a state where the bare minimum
> was being done. Of course, this was the non-PAE version ...

PAE already has to do the atomic updates, and could possibly withstand a
couple of tweaks: eg clear_pte should be okay if it is a locked clear of
the present bit and non-atomic clear of the high bits.

[...]
> With the above type of apps, you do not need too high a _memory_ pressure
> to trigger this, just compute pressure. Each time you come in to drop in
> the "dirty" bit, you would need to do establish_pte(), which does a 
> flush_tlb_page(), which gets costly when you have higher cpu counts. 

If the process is active, it will be updating the accessed bit on a
regular basis, so there will be no need to clear the dirty bit.  Also, we
mark mappings dirty early, so there is no extra fault on the typical case
of writing to data.  If the process was sharing the page read only (ie
zero page), it would still have had to do the tlb flush anyways.  As for
the tlb flushing in establish_pte, we should be avoiding the cross cpu tlb
flush since any other processor would take a write fault, at which it
would update its tlb.  establish_pte is only called from 3 places, two of
which are passing in the same page with either the accessed, dirty or
write bits enabled.

> This of course depends on how smart flush_tlb_page is, and the processor
> involved.

We're only talking about x86 SMP. =)  I think it and m68k SMP (which isn't
implemented afair) are the only vulnerable platforms to this problem
(since all others implement dirty bits in software).

		-ben

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] workaround for lost dirty bits on x86 SMP
  2000-09-12  3:39       ` Benjamin C.R. LaHaise
@ 2000-09-12  6:13         ` Kanoj Sarcar
  0 siblings, 0 replies; 10+ messages in thread
From: Kanoj Sarcar @ 2000-09-12  6:13 UTC (permalink / raw)
  To: Benjamin C.R. LaHaise; +Cc: linux-mm, torvalds

> 
> On Mon, 11 Sep 2000, Kanoj Sarcar wrote:
> 
> > Not really, I thought I had it in a state where the bare minimum
> > was being done. Of course, this was the non-PAE version ...
> 
> PAE already has to do the atomic updates, and could possibly withstand a
> couple of tweaks: eg clear_pte should be okay if it is a locked clear of
> the present bit and non-atomic clear of the high bits.
> 
> [...]

As I mentioned previously, here is the disadvantage of your
scheme:

"Disadvantage - take an extra fault on shm/file mmap'ed pages when a 
program accesses the page, *then* dirties it (no change if the first access 
is a write)."

As you mention, I am not sure whether this is the "typical" case, all
I am saying is that we should find out how much degradation this kind
of app would see ... and remember, there is /dev/zero pages and MAP_SHARED|
MAP_ANON pages too. I am not worried that much about what happens when 
page stealing starts, I am more worried about incurring costs under normal
circumstances. But I do agree that your patch is functionally correct, 
and fixing corruption problems is more important than getting a 1 or 2% 
performance edge.

> > With the above type of apps, you do not need too high a _memory_ pressure
> > to trigger this, just compute pressure. Each time you come in to drop in
> > the "dirty" bit, you would need to do establish_pte(), which does a 
> > flush_tlb_page(), which gets costly when you have higher cpu counts. 
> 
> If the process is active, it will be updating the accessed bit on a
> regular basis, so there will be no need to clear the dirty bit.  Also, we
> mark mappings dirty early, so there is no extra fault on the typical case
> of writing to data.  If the process was sharing the page read only (ie
> zero page), it would still have had to do the tlb flush anyways.  As for
> the tlb flushing in establish_pte, we should be avoiding the cross cpu tlb
> flush since any other processor would take a write fault, at which it
> would update its tlb.  establish_pte is only called from 3 places, two of
> which are passing in the same page with either the accessed, dirty or
> write bits enabled.

establish_pte is broken anyways, so I would rather not intermix any changes
to establish_pte with another patch. In any case, the pieces of code that
invoke establish_pte are generic, they have to invoke establish_pte so that
things mostly work on all platforms ... or come up with more processor
specific primitives.

Kanoj

> 
> > This of course depends on how smart flush_tlb_page is, and the processor
> > involved.
> 
> We're only talking about x86 SMP. =)  I think it and m68k SMP (which isn't
> implemented afair) are the only vulnerable platforms to this problem
> (since all others implement dirty bits in software).
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux.eu.org/Linux-MM/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] workaround for lost dirty bits on x86 SMP
  2000-09-12  1:36   ` bcrl
  2000-09-12  1:56     ` Rik van Riel
  2000-09-12  2:34     ` Kanoj Sarcar
@ 2000-09-12 10:24     ` Stephen C. Tweedie
  2000-09-12 16:54       ` Ben LaHaise
  2 siblings, 1 reply; 10+ messages in thread
From: Stephen C. Tweedie @ 2000-09-12 10:24 UTC (permalink / raw)
  To: bcrl; +Cc: Kanoj Sarcar, linux-mm, torvalds

Hi,

On Mon, Sep 11, 2000 at 09:36:35PM -0400, bcrl@redhat.com wrote:

> Fwiw, with the patch, running a make -j bzImage on a 4 way box does not
> seem to have made a difference.

Of course it won't, because you aren't testing the new behaviour!
Anonymous pages are always dirty, and shared mmaped pages in
MAP_PRIVATE regions are always clean.  The only place where you need
to track the dirty bit dynamically is when you use shared writeable
mmaps --- can you measure a performance change there?

Cheers,
 Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] workaround for lost dirty bits on x86 SMP
  2000-09-12 10:24     ` Stephen C. Tweedie
@ 2000-09-12 16:54       ` Ben LaHaise
  0 siblings, 0 replies; 10+ messages in thread
From: Ben LaHaise @ 2000-09-12 16:54 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Kanoj Sarcar, linux-mm, torvalds

On Tue, 12 Sep 2000, Stephen C. Tweedie wrote:

> Of course it won't, because you aren't testing the new behaviour!
> Anonymous pages are always dirty, and shared mmaped pages in
> MAP_PRIVATE regions are always clean.  The only place where you need
> to track the dirty bit dynamically is when you use shared writeable
> mmaps --- can you measure a performance change there?

Here's a more realistic test, and yes it does extract a heavy performance
hit -- unpatched read then write to 1 byte of 1GB of pages:

size is 1073741825
addr = 0x4010f000
read fault test: start=29670620301256 stop=29670945360465, elapsed=325059209
write fault test: start=29670945400339 stop=29671024199394, elapsed=78799055

patched:

size is 1073741825
addr = 0x4010f000
read fault test: start=135059091263 stop=135383514415, elapsed=324423152
write fault test: start=135383569394 stop=135664481836, elapsed=280912442

So, let's see what can be done to speed up the clean to dirty fault
path...  (more later)

		-ben

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] workaround for lost dirty bits on x86 SMP
  2000-09-12  0:43 [PATCH] workaround for lost dirty bits on x86 SMP Ben LaHaise
  2000-09-12  0:59 ` Kanoj Sarcar
@ 2000-09-15 19:56 ` Linus Torvalds
  1 sibling, 0 replies; 10+ messages in thread
From: Linus Torvalds @ 2000-09-15 19:56 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: linux-mm

On Mon, 11 Sep 2000, Ben LaHaise wrote:
>
> The patch below is one means of working around the lost dirty bit problem
> on x86 SMP.  If possible, I'ld like to see this tested in 2.4 as it would
> be the least intrusive fix for 2.2.

Yes, I think this is the right fix.

I've seriously considered making this a architecturally visible feature:
the reason for that is that we're probably better off knowing how many
dirty pages the user has _anyway_ - just to be able to balance things off
a bit. 

So it might be a good idea to make this stuff happen for the UP case too,
instead of trying to optimize it away there..

Have you done any statistics on how many of these "clean->dirty" faults we
get? Because obviously we _will_ fault more. I don't think it happens all
that often (most of them will probably have been COW-faults anyway, the
way Linux handles anonymous pages), but it would be good to see actual
numbers like "increases system page faults by 1%" in order to get more
than just an intuitive feel for it..

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2000-09-15 19:56 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-09-12  0:43 [PATCH] workaround for lost dirty bits on x86 SMP Ben LaHaise
2000-09-12  0:59 ` Kanoj Sarcar
2000-09-12  1:36   ` bcrl
2000-09-12  1:56     ` Rik van Riel
2000-09-12  2:34     ` Kanoj Sarcar
2000-09-12  3:39       ` Benjamin C.R. LaHaise
2000-09-12  6:13         ` Kanoj Sarcar
2000-09-12 10:24     ` Stephen C. Tweedie
2000-09-12 16:54       ` Ben LaHaise
2000-09-15 19:56 ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox