Re: Updated 2.4 TODO List

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: Updated 2.4 TODO List
       [not found] <200010090419.e994JQT09775@trampoline.thunk.org>
@ 2000-10-10 20:53 ` Rik van Riel
  2000-10-11  0:06   ` 2.4.0test9 vm: disappointing streaming i/o under load Chris Evans
  2000-10-11 18:38   ` Updated 2.4 TODO List tytso
  0 siblings, 2 replies; 20+ messages in thread
From: Rik van Riel @ 2000-10-10 20:53 UTC (permalink / raw)
  To: tytso; +Cc: linux-kernel, linux-mm

On Mon, 9 Oct 2000 tytso@mit.edu wrote:

> 2. Capable Of Corrupting Your FS/data
> 
>      * Non-atomic page-map operations can cause loss of dirty bit on
>        pages (sct, alan)

Is anybody looking into fixing this bug ?

> 9. To Do
> 
>      * mm->rss is modified in some places without holding the
>        page_table_lock (sct)

Probably not a show-stopper, but we're looking for
volunteers to fix this one anyway ;)

>      * VM: Out of Memory handling {CRITICAL}

Seems to work now, except for the fact that it is possible
to end up with a heavily thrashing system that /just/ didn't
run out of memory and doesn't get anything killed.

Then again, you can end up with a heavily thrashing system
where you can't get anything done without running out of swap
anyway ... the proper fix for this is probably some form of
thrashing control...

>      * VM: Fix the highmem deadlock, where the swapper cannot create low
>        memory bounce buffers OR swap out low memory because it has
>        consumed all resources {CRITICAL} (old bug, already reported in
>        2.4.0test6)

Haven't been able to reproduce it on my 1GB test machine,
but it might still be there. Can anyone confirm if this
bug is still present ?

>      * VM: page->mapping->flush() callback in page_lauder() for easier
>        integration with journaling filesystem and maybe the network
>        filesystems

Possibly a 2.5 issue, or something to merge later in 2.4,
since we don't have journaling filesystems in the kernel
anyway. I guess we'll want it for the network filesystems
though.

But this is a fairly simple thing to integrate:
1) have an appropriate function in the filesystems
2) insert function pointer in the right struct
3) call the function from vmscan.c::page_launder()

>      * VM: maybe rebalance the swapper a bit... we do page aging now so
>        maybe refill_inactive_scan() / shm_swap() and swap_out() need to
>        be rebalanced a bit

I'll try to look into this (3 days to go before I have to
leave for Miami) and see how things can be improved here.

> 11. To Check
> 
>      * VFS?VM - mmap/write deadlock (demo code seems to show lock is
>        there)

Does anyone have the demo code at hand so we can verify if this
still happens ?

>      * Stressing the VM (IOPS SPEC SFS) with HIGHMEM turned on can hang
>        system (linux-2.4.0test5, Ying Chen, Rik van Riel)

Ditto. Can this still be reproduced with the latest VM or was
it simply a side effect of something else in the VM that got
fixed recently ?

(the highmem code itself looks ok so the bug might well have
been caused by a side effect of something else)

> 12. Probably Post 2.4
> 
>      * addres_space needs a VM pressure/flush callback (Ingo)

[duplicate item?]

We may want this to better support the journaling filesystems
in 2.4 .... but I agree that it should probably be post 2.4.0.

>      * VM: physical->virtual reverse mapping, so we can do much better
>        page aging with less CPU usage spikes
>      * VM: better IO clustering for swap (and filesystem) IO
>      * VM: move all the global VM variables, lists, etc. into the pgdat
>        struct for better NUMA scalability
>      * VM: (maybe) some QoS things, as far as they are major improvements
>        with minor intrusion

These 4 seem /definate/ 2.5 issues, though I hope to have them
(except maybe QoS?) ready in an patch before 2.5.0 is split off.

>      * VM: thrashing control, maybe process suspension with some forced
>        swapping ?
>      * VM: include Ben LaHaise's code, which moves readahead to the VMA
>        level, this way we can do streaming swap IO, complete with
>        drop_behind()

These two are fairly simple and may well be done in the next
few weeks. If no bug reports about the current 2.4 VM pop up,
I'll probably look into some of the issues above...

FYI, my personal VM TODO list:
- see if refill_inactive_scan(), swapout_shm(), swap_out(), etc...
  need rebalancing
- anti-thrashing code  (if no hidden nasties are present)
- better IO clustering + readahead at VMA level

AFAIK Juan Quintela is already looking into the ->flush()
callback for journaling filesystems.

And one more TODO item:

* pinned page reservation system for journaling filesystems

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* 2.4.0test9 vm: disappointing streaming i/o under load
  2000-10-10 20:53 ` Updated 2.4 TODO List Rik van Riel
@ 2000-10-11  0:06   ` Chris Evans
  2000-10-11 11:38     ` Eric Lowe
  2000-10-11 18:38   ` Updated 2.4 TODO List tytso
  1 sibling, 1 reply; 20+ messages in thread
From: Chris Evans @ 2000-10-11  0:06 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm

Hi,

Finally got round to checking out 2.4.0test9.

Unfortunately, 2.4.0test9 exhibits poor streaming i/o performance when
under a bit of memory pressure.

The test is this: boot with mem=32M, log onto GNOME and start xmms playing
a big .wav ripped from a CD (this requires 100-200k read i/o per second).

Then, I start then kill netscape. I then started a find / and started
gnumeric firing up at the same time.

Results
=======

2.2 RH7.0: the music skipped maybe twice briefly during the test.

2.4.0test9: music stuttered repeatedly while netscape started. Worse, when
firing up gnumeric with the find / on the go, there were big pauses in
sound output. On pause was over 5 seconds!!!

So not so hot.

Could this perhaps be related to the drop_behind magic penalizing
streaming i/o pages too much? Perhaps the greater ago on the i/o pages
means that when there is a little memory pressure, they are getting thrown
out the page cache before the app (xmms) gets a chance to use them!

Might it be useful for me to try pre10-1, I note it has more "balancing
fixes".

Cheers
Chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 2.4.0test9 vm: disappointing streaming i/o under load
  2000-10-11  0:06   ` 2.4.0test9 vm: disappointing streaming i/o under load Chris Evans
@ 2000-10-11 11:38     ` Eric Lowe
  2000-10-11 20:59       ` Chris Evans
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Lowe @ 2000-10-11 11:38 UTC (permalink / raw)
  To: Chris Evans; +Cc: linux-mm

Hello,

> Finally got round to checking out 2.4.0test9.
> 
> Unfortunately, 2.4.0test9 exhibits poor streaming i/o performance when
> under a bit of memory pressure.
> 
> The test is this: boot with mem=32M, log onto GNOME and start xmms playing
> a big .wav ripped from a CD (this requires 100-200k read i/o per second).
> 
> Then, I start then kill netscape. I then started a find / and started
> gnumeric firing up at the same time.

Would you try setting /proc/sys/vm/page-cluster to 8 or 16 and let
me know the results?  I think one _part_ of the problem is that
when the swapper isn't agressive enough, it causes too much disk
thrashing which gets in the way of normal I/O... my experience
has been that with modern disks with 512K+ cache you have to
write in 64K clusters to get optimum throughput.

Eric


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 2.4.0test9 vm: disappointing streaming i/o under load
  2000-10-11 11:38     ` Eric Lowe
@ 2000-10-11 20:59       ` Chris Evans
  2000-10-11 22:10         ` Roger Larsson
  0 siblings, 1 reply; 20+ messages in thread
From: Chris Evans @ 2000-10-11 20:59 UTC (permalink / raw)
  To: Eric Lowe; +Cc: linux-mm

On Wed, 11 Oct 2000, Eric Lowe wrote:

> > Unfortunately, 2.4.0test9 exhibits poor streaming i/o performance when
> > under a bit of memory pressure.

[...]

> Would you try setting /proc/sys/vm/page-cluster to 8 or 16 and let
> me know the results?  I think one _part_ of the problem is that
> when the swapper isn't agressive enough, it causes too much disk
> thrashing which gets in the way of normal I/O... my experience
> has been that with modern disks with 512K+ cache you have to
> write in 64K clusters to get optimum throughput.

Raising the cluster size didn't seem to do much apart from generally slow
down interactive response. Lowering it, however, seemed to make playback
less jittery. I guess that's to be expected; faulting in large chunks of
sequential i/o won't help much when under memory pressure because the
pages will get thrown out again before they get a chance to be
used. Especially with drop_behind.

Rik what do you think.

Cheers
Chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 2.4.0test9 vm: disappointing streaming i/o under load
  2000-10-11 20:59       ` Chris Evans
@ 2000-10-11 22:10         ` Roger Larsson
  2000-10-11 22:46           ` Chris Evans
  0 siblings, 1 reply; 20+ messages in thread
From: Roger Larsson @ 2000-10-11 22:10 UTC (permalink / raw)
  To: Chris Evans; +Cc: linux-mm

Hi,

(you do have DMA enabled...)

I have tested throughput - new kernels are rather good.

I have also tested latency stuff in test9 - I have not
seen any thing as bad as your results.
But my audio apps runs with high priority...

To be able to determine the cause
Try to to renice your audio deamon (and audio clients)
 renice -10 <pid>


Did it become better?


/RogerL


Chris Evans wrote:
> 
> On Wed, 11 Oct 2000, Eric Lowe wrote:
> 
> > > Unfortunately, 2.4.0test9 exhibits poor streaming i/o performance when
> > > under a bit of memory pressure.
> 
> [...]
> 
> > Would you try setting /proc/sys/vm/page-cluster to 8 or 16 and let
> > me know the results?  I think one _part_ of the problem is that
> > when the swapper isn't agressive enough, it causes too much disk
> > thrashing which gets in the way of normal I/O... my experience
> > has been that with modern disks with 512K+ cache you have to
> > write in 64K clusters to get optimum throughput.
> 
> Raising the cluster size didn't seem to do much apart from generally slow
> down interactive response. Lowering it, however, seemed to make playback
> less jittery. I guess that's to be expected; faulting in large chunks of
> sequential i/o won't help much when under memory pressure because the
> pages will get thrown out again before they get a chance to be
> used. Especially with drop_behind.
> 
> Rik what do you think.
> 
> Cheers
> Chris
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux.eu.org/Linux-MM/

--
Home page:
  http://www.norran.net/nra02596/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 2.4.0test9 vm: disappointing streaming i/o under load
  2000-10-11 22:10         ` Roger Larsson
@ 2000-10-11 22:46           ` Chris Evans
  2000-10-13 16:57             ` Rik van Riel
  0 siblings, 1 reply; 20+ messages in thread
From: Chris Evans @ 2000-10-11 22:46 UTC (permalink / raw)
  To: Roger Larsson; +Cc: linux-mm

On Thu, 12 Oct 2000, Roger Larsson wrote:

> Hi,
> 
> (you do have DMA enabled...)

Oh yes (discovering that in fact my chipset is only UDMA33 in the
process).

> I have tested throughput - new kernels are rather good.

I don't doubt it. I'll try and post some numbers on this later.

> I have also tested latency stuff in test9 - I have not
> seen any thing as bad as your results.
> But my audio apps runs with high priority...
> 
> To be able to determine the cause
> Try to to renice your audio deamon (and audio clients)
>  renice -10 <pid>
> 
> 
> Did it become better?

Not noticeably :-(

Perhaps I'm just asking too much, booting with mem=32M. No point testing
the new VM with a 128Mb desktop, though; it wouldn't break a
sweat!

2.2 (RH7.0 kernel) does skip less, though, and the duration of skip is
less.

Perhaps the two kernels have different elevator settings?

Cheers
Chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 2.4.0test9 vm: disappointing streaming i/o under load
  2000-10-11 22:46           ` Chris Evans
@ 2000-10-13 16:57             ` Rik van Riel
  0 siblings, 0 replies; 20+ messages in thread
From: Rik van Riel @ 2000-10-13 16:57 UTC (permalink / raw)
  To: Chris Evans; +Cc: Roger Larsson, linux-mm

On Wed, 11 Oct 2000, Chris Evans wrote:

> Perhaps I'm just asking too much, booting with mem=32M. No point
> testing the new VM with a 128Mb desktop, though; it wouldn't
> break a sweat!
> 
> 2.2 (RH7.0 kernel) does skip less, though, and the duration of
> skip is less.
> 
> Perhaps the two kernels have different elevator settings?

That too. And you just -might- be catching a boundary condition
of the drop-behind code (if the audio isn't kept mapped by any
of the processes, but is left to sit in a file which is write()n
to by one process and is read() by the other).

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Updated 2.4 TODO List
  2000-10-10 20:53 ` Updated 2.4 TODO List Rik van Riel
  2000-10-11  0:06   ` 2.4.0test9 vm: disappointing streaming i/o under load Chris Evans
@ 2000-10-11 18:38   ` tytso
  2000-10-11 23:52     ` [RFC] atomic pte updates for x86 smp Ben LaHaise
  1 sibling, 1 reply; 20+ messages in thread
From: tytso @ 2000-10-11 18:38 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, linux-mm

   > 2. Capable Of Corrupting Your FS/data
   > 
   >      * Non-atomic page-map operations can cause loss of dirty bit on
   >        pages (sct, alan)

   Is anybody looking into fixing this bug ?

According to sct (who's sitting next to me in my hotel room at ALS) Ben
LaHaise has a bugfix for this, but it hasn't been merged.

   >      * VM: Fix the highmem deadlock, where the swapper cannot create low
   >        memory bounce buffers OR swap out low memory because it has
   >        consumed all resources {CRITICAL} (old bug, already reported in
   >        2.4.0test6)

   Haven't been able to reproduce it on my 1GB test machine,
   but it might still be there. Can anyone confirm if this
   bug is still present ?

Note: all of the issues on the TODO list with the "VM:" prefix are from
a VM todo list you posted a week or two ago; so I'm assuming that you
know more about those issues than I do.....  (feel free to send me an
updated list and I'll merge it into the 2.4 TODO list.)

						- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC] atomic pte updates for x86 smp
  2000-10-11 18:38   ` Updated 2.4 TODO List tytso
@ 2000-10-11 23:52     ` Ben LaHaise
  2000-10-12  0:09       ` Linus Torvalds
  0 siblings, 1 reply; 20+ messages in thread
From: Ben LaHaise @ 2000-10-11 23:52 UTC (permalink / raw)
  To: torvalds, tytso; +Cc: linux-kernel, linux-mm

On Wed, 11 Oct 2000 tytso@mit.edu wrote:

>    > 2. Capable Of Corrupting Your FS/data
>    > 
>    >      * Non-atomic page-map operations can cause loss of dirty bit on
>    >        pages (sct, alan)
> 
>    Is anybody looking into fixing this bug ?
> 
> According to sct (who's sitting next to me in my hotel room at ALS) Ben
> LaHaise has a bugfix for this, but it hasn't been merged.

Here's an updated version of the patch that doesn't do the funky RISC like
dirty bit updates.  It doesn't incur the additional overhead of page
faults on dirty, which actually happens a lot on SHM attaches
(during Oracle runs this is quite noticeable due to their use of
hundreds of MB of SHM).  Ted: Note that there are a couple of other SMP
races that still need fixing: list them under VM threading bug under SMP
(different bug).

		-ben

# v2.4.0-test10-1-smp_pte_fix.diff
diff -ur v2.4.0-test10-pre1/include/asm-i386/pgtable-2level.h work-v2.4.0-test10-pre1/include/asm-i386/pgtable-2level.h
--- v2.4.0-test10-pre1/include/asm-i386/pgtable-2level.h	Fri Dec  3 14:12:23 1999
+++ work-v2.4.0-test10-pre1/include/asm-i386/pgtable-2level.h	Wed Oct 11 16:08:08 2000
@@ -55,4 +55,7 @@
 	return (pmd_t *) dir;
 }
 
+#define __HAVE_ARCH_pte_xchg_clear
+#define pte_xchg_clear(xp)	__pte(xchg(&(xp)->pte, 0))
+
 #endif /* _I386_PGTABLE_2LEVEL_H */
diff -ur v2.4.0-test10-pre1/include/asm-i386/pgtable-3level.h work-v2.4.0-test10-pre1/include/asm-i386/pgtable-3level.h
--- v2.4.0-test10-pre1/include/asm-i386/pgtable-3level.h	Mon Dec  6 19:19:13 1999
+++ work-v2.4.0-test10-pre1/include/asm-i386/pgtable-3level.h	Wed Oct 11 16:14:40 2000
@@ -76,4 +76,17 @@
 #define pmd_offset(dir, address) ((pmd_t *) pgd_page(*(dir)) + \
 			__pmd_offset(address))
 
+#define __HAVE_ARCH_pte_xchg_clear
+extern inline pte_t pte_xchg_clear(pte_t *ptep)
+{
+	long long res = pte_val(*ptep);
+__asm__ __volatile__ (
+        "1: cmpxchg8b (%1);
+                jnz 1b"
+        : "=A" (res)
+	:"D"(ptep), "0" (res), "b"(0), "c"(0)
+        : "memory");
+	return (pte_t){ res };
+}
+
 #endif /* _I386_PGTABLE_3LEVEL_H */
diff -ur v2.4.0-test10-pre1/include/asm-i386/pgtable.h work-v2.4.0-test10-pre1/include/asm-i386/pgtable.h
--- v2.4.0-test10-pre1/include/asm-i386/pgtable.h	Mon Oct  2 14:06:43 2000
+++ work-v2.4.0-test10-pre1/include/asm-i386/pgtable.h	Wed Oct 11 17:44:04 2000
@@ -17,6 +17,10 @@
 #include <asm/fixmap.h>
 #include <linux/threads.h>
 
+#ifndef _I386_BITOPS_H
+#include <asm/bitops.h>
+#endif
+
 extern pgd_t swapper_pg_dir[1024];
 extern void paging_init(void);
 
@@ -145,6 +149,16 @@
  * the page directory entry points directly to a 4MB-aligned block of
  * memory. 
  */
+#define _PAGE_BIT_PRESENT	0
+#define _PAGE_BIT_RW		1
+#define _PAGE_BIT_USER		2
+#define _PAGE_BIT_PWT		3
+#define _PAGE_BIT_PCD		4
+#define _PAGE_BIT_ACCESSED	5
+#define _PAGE_BIT_DIRTY		6
+#define _PAGE_BIT_PSE		7	/* 4 MB (or 2MB) page, Pentium+, if present.. */
+#define _PAGE_BIT_GLOBAL	8	/* Global TLB entry PPro+ */
+
 #define _PAGE_PRESENT	0x001
 #define _PAGE_RW	0x002
 #define _PAGE_USER	0x004
@@ -234,6 +248,24 @@
 #define pte_none(x)	(!pte_val(x))
 #define pte_present(x)	(pte_val(x) & (_PAGE_PRESENT | _PAGE_PROTNONE))
 #define pte_clear(xp)	do { set_pte(xp, __pte(0)); } while (0)
+
+#define __HAVE_ARCH_pte_test_and_clear_dirty
+static inline int pte_test_and_clear_dirty(pte_t *page_table, pte_t pte)
+{
+	return test_and_clear_bit(_PAGE_BIT_DIRTY, page_table);
+}
+
+#define __HAVE_ARCH_pte_test_and_clear_young
+static inline int pte_test_and_clear_young(pte_t *page_table, pte_t pte)
+{
+	return test_and_clear_bit(_PAGE_BIT_ACCESSED, page_table);
+}
+
+#define __HAVE_ARCH_atomic_pte_wrprotect
+static inline void atomic_pte_wrprotect(pte_t *page_table, pte_t old_pte)
+{
+	clear_bit(_PAGE_BIT_RW, page_table);
+}
 
 #define pmd_none(x)	(!pmd_val(x))
 #define pmd_present(x)	(pmd_val(x) & _PAGE_PRESENT)
diff -ur v2.4.0-test10-pre1/include/linux/mm.h work-v2.4.0-test10-pre1/include/linux/mm.h
--- v2.4.0-test10-pre1/include/linux/mm.h	Tue Oct  3 13:40:38 2000
+++ work-v2.4.0-test10-pre1/include/linux/mm.h	Wed Oct 11 17:44:38 2000
@@ -532,6 +532,42 @@
 #define vmlist_modify_lock(mm)		vmlist_access_lock(mm)
 #define vmlist_modify_unlock(mm)	vmlist_access_unlock(mm)
 
+#ifndef __HAVE_ARCH_pte_test_and_clear_young
+static inline int pte_test_and_clear_young(pte_t *page_table, pte_t pte)
+{
+	if (!pte_young(pte))
+		return 0;
+	set_pte(page_table, pte_mkold(pte));
+	return 1;
+}
+#endif
+
+#ifndef __HAVE_ARCH_pte_test_and_clear_dirty
+static inline int pte_test_and_clear_dirty(pte_t *page_table, pte_t pte)
+{
+	if (!pte_dirty(pte))
+		return 0;
+	set_pte(page_table, pte_mkclean(pte));
+	return 1;
+}
+#endif
+
+#ifndef __HAVE_ARCH_pte_xchg_clear
+static pte_t pte_xchg_clear(pte_t *page_table)
+{
+	pte_t pte = *page_table;
+	pte_clear(page_table);
+	return pte;
+}
+#endif
+
+#ifndef __HAVE_ARCH_atomic_pte_wrprotect
+static inline void atomic_pte_wrprotect(pte_t *page_table, pte_t old_pte)
+{
+	set_pte(page_table, pte_wrprotect(old_pte));
+}
+#endif
+
 #endif /* __KERNEL__ */
 
 #endif
diff -ur v2.4.0-test10-pre1/mm/filemap.c work-v2.4.0-test10-pre1/mm/filemap.c
--- v2.4.0-test10-pre1/mm/filemap.c	Tue Oct  3 13:40:38 2000
+++ work-v2.4.0-test10-pre1/mm/filemap.c	Wed Oct 11 18:26:35 2000
@@ -1475,39 +1475,47 @@
 	return retval;
 }
 
+/* Called with mm->page_table_lock held to protect against other
+ * threads/the swapper from ripping pte's out from under us.
+ */
 static inline int filemap_sync_pte(pte_t * ptep, struct vm_area_struct *vma,
 	unsigned long address, unsigned int flags)
 {
 	unsigned long pgoff;
-	pte_t pte = *ptep;
+	pte_t pte;
 	struct page *page;
 	int error;
 
+	pte = *ptep;
+
 	if (!(flags & MS_INVALIDATE)) {
 		if (!pte_present(pte))
-			return 0;
-		if (!pte_dirty(pte))
-			return 0;
+			goto out;
+		if (!pte_test_and_clear_dirty(ptep, pte))
+			goto out;
 		flush_page_to_ram(pte_page(pte));
 		flush_cache_page(vma, address);
-		set_pte(ptep, pte_mkclean(pte));
 		flush_tlb_page(vma, address);
 		page = pte_page(pte);
 		page_cache_get(page);
 	} else {
 		if (pte_none(pte))
-			return 0;
+			goto out;
 		flush_cache_page(vma, address);
-		pte_clear(ptep);
+
+		pte = pte_xchg_clear(ptep);
 		flush_tlb_page(vma, address);
+
 		if (!pte_present(pte)) {
+			spin_unlock(&vma->vm_mm->page_table_lock);
 			swap_free(pte_to_swp_entry(pte));
-			return 0;
+			spin_lock(&vma->vm_mm->page_table_lock);
+			goto out;
 		}
 		page = pte_page(pte);
 		if (!pte_dirty(pte) || flags == MS_INVALIDATE) {
 			page_cache_free(page);
-			return 0;
+			goto out;
 		}
 	}
 	pgoff = (address - vma->vm_start) >> PAGE_CACHE_SHIFT;
@@ -1516,11 +1524,18 @@
 		printk("weirdness: pgoff=%lu index=%lu address=%lu vm_start=%lu vm_pgoff=%lu\n",
 			pgoff, page->index, address, vma->vm_start, vma->vm_pgoff);
 	}
+
+	spin_unlock(&vma->vm_mm->page_table_lock);
 	lock_page(page);
 	error = filemap_write_page(vma->vm_file, page, 1);
 	UnlockPage(page);
 	page_cache_free(page);
+
+	spin_lock(&vma->vm_mm->page_table_lock);
 	return error;
+
+out:
+	return 0;
 }
 
 static inline int filemap_sync_pte_range(pmd_t * pmd,
@@ -1590,6 +1605,11 @@
 	unsigned long end = address + size;
 	int error = 0;
 
+	/* Aquire the lock early; it may be possible to avoid dropping
+	 * and reaquiring it repeatedly.
+	 */
+	spin_lock(&vma->vm_mm->page_table_lock);
+
 	dir = pgd_offset(vma->vm_mm, address);
 	flush_cache_range(vma->vm_mm, end - size, end);
 	if (address >= end)
@@ -1600,6 +1620,9 @@
 		dir++;
 	} while (address && (address < end));
 	flush_tlb_range(vma->vm_mm, end - size, end);
+
+	spin_unlock(&vma->vm_mm->page_table_lock);
+
 	return error;
 }
 
diff -ur v2.4.0-test10-pre1/mm/highmem.c work-v2.4.0-test10-pre1/mm/highmem.c
--- v2.4.0-test10-pre1/mm/highmem.c	Tue Oct 10 16:57:31 2000
+++ work-v2.4.0-test10-pre1/mm/highmem.c	Tue Oct 10 18:13:44 2000
@@ -130,10 +130,10 @@
 		if (pkmap_count[i] != 1)
 			continue;
 		pkmap_count[i] = 0;
-		pte = pkmap_page_table[i];
+		//pte = pkmap_page_table[i]; pte_clear(pkmap_page_table+i);
+		pte = pte_xchg_clear(pkmap_page_table+i);
 		if (pte_none(pte))
 			BUG();
-		pte_clear(pkmap_page_table+i);
 		page = pte_page(pte);
 		page->virtual = NULL;
 	}
diff -ur v2.4.0-test10-pre1/mm/memory.c work-v2.4.0-test10-pre1/mm/memory.c
--- v2.4.0-test10-pre1/mm/memory.c	Tue Oct  3 13:40:38 2000
+++ work-v2.4.0-test10-pre1/mm/memory.c	Wed Oct 11 18:30:17 2000
@@ -215,30 +215,30 @@
 				/* copy_one_pte */
 
 				if (pte_none(pte))
-					goto cont_copy_pte_range;
+					goto cont_copy_pte_range_noset;
 				if (!pte_present(pte)) {
 					swap_duplicate(pte_to_swp_entry(pte));
-					set_pte(dst_pte, pte);
 					goto cont_copy_pte_range;
 				}
 				ptepage = pte_page(pte);
 				if ((!VALID_PAGE(ptepage)) || 
-				    PageReserved(ptepage)) {
-					set_pte(dst_pte, pte);
+				    PageReserved(ptepage))
 					goto cont_copy_pte_range;
-				}
+
 				/* If it's a COW mapping, write protect it both in the parent and the child */
 				if (cow) {
-					pte = pte_wrprotect(pte);
-					set_pte(src_pte, pte);
+					atomic_pte_wrprotect(src_pte, pte);
+					pte = *src_pte;
 				}
+
 				/* If it's a shared mapping, mark it clean in the child */
 				if (vma->vm_flags & VM_SHARED)
 					pte = pte_mkclean(pte);
-				set_pte(dst_pte, pte_mkold(pte));
+				pte = pte_mkold(pte);
 				get_page(ptepage);
-			
-cont_copy_pte_range:		address += PAGE_SIZE;
+
+cont_copy_pte_range:		set_pte(dst_pte, pte);
+cont_copy_pte_range_noset:	address += PAGE_SIZE;
 				if (address >= end)
 					goto out;
 				src_pte++;
@@ -306,10 +306,9 @@
 		pte_t page;
 		if (!size)
 			break;
-		page = *pte;
+		page = pte_xchg_clear(pte);
 		pte++;
 		size--;
-		pte_clear(pte-1);
 		if (pte_none(page))
 			continue;
 		freed += free_pte(page);
@@ -712,8 +711,8 @@
 		end = PMD_SIZE;
 	do {
 		struct page *page;
-		pte_t oldpage = *pte;
-		pte_clear(pte);
+		pte_t oldpage;
+		oldpage = pte_xchg_clear(pte);
 
 		page = virt_to_page(__va(phys_addr));
 		if ((!VALID_PAGE(page)) || PageReserved(page))
@@ -746,6 +745,7 @@
 	return 0;
 }
 
+/*  Note: this is only safe if the mm semaphore is held when called. */
 int remap_page_range(unsigned long from, unsigned long phys_addr, unsigned long size, pgprot_t prot)
 {
 	int error = 0;
diff -ur v2.4.0-test10-pre1/mm/mremap.c work-v2.4.0-test10-pre1/mm/mremap.c
--- v2.4.0-test10-pre1/mm/mremap.c	Tue Oct  3 13:40:38 2000
+++ work-v2.4.0-test10-pre1/mm/mremap.c	Wed Oct 11 02:38:41 2000
@@ -63,14 +63,14 @@
 	pte_t pte;
 
 	spin_lock(&mm->page_table_lock);
-	pte = *src;
+	pte = pte_xchg_clear(src);
 	if (!pte_none(pte)) {
-		error++;
-		if (dst) {
-			pte_clear(src);
-			set_pte(dst, pte);
-			error--;
+		if (!dst) {
+			/* No dest?  We must put it back. */
+			dst = src;
+			error++;
 		}
+		set_pte(dst, pte);
 	}
 	spin_unlock(&mm->page_table_lock);
 	return error;
diff -ur v2.4.0-test10-pre1/mm/vmalloc.c work-v2.4.0-test10-pre1/mm/vmalloc.c
--- v2.4.0-test10-pre1/mm/vmalloc.c	Tue Oct  3 13:40:38 2000
+++ work-v2.4.0-test10-pre1/mm/vmalloc.c	Wed Oct 11 16:38:21 2000
@@ -34,14 +34,15 @@
 	if (end > PMD_SIZE)
 		end = PMD_SIZE;
 	do {
-		pte_t page = *pte;
-		pte_clear(pte);
+		pte_t page;
+		page = pte_xchg_clear(pte);
 		address += PAGE_SIZE;
 		pte++;
 		if (pte_none(page))
 			continue;
 		if (pte_present(page)) {
 			struct page *ptpage = pte_page(page);
+			/* FIXME: i am an ugly little race condition */
 			if (VALID_PAGE(ptpage) && (!PageReserved(ptpage)))
 				__free_page(ptpage);
 			continue;
diff -ur v2.4.0-test10-pre1/mm/vmscan.c work-v2.4.0-test10-pre1/mm/vmscan.c
--- v2.4.0-test10-pre1/mm/vmscan.c	Tue Oct 10 16:57:31 2000
+++ work-v2.4.0-test10-pre1/mm/vmscan.c	Wed Oct 11 18:17:17 2000
@@ -55,8 +55,7 @@
 
 	onlist = PageActive(page);
 	/* Don't look at this pte if it's been accessed recently. */
-	if (pte_young(pte)) {
-		set_pte(page_table, pte_mkold(pte));
+	if (pte_test_and_clear_young(page_table, pte)) {
 		if (onlist) {
 			/*
 			 * Transfer the "accessed" bit from the page
@@ -99,6 +98,10 @@
 	if (PageSwapCache(page)) {
 		entry.val = page->index;
 		swap_duplicate(entry);
+		if (pte_dirty(pte))
+			BUG();
+		if (pte_write(pte))
+			BUG();
 		set_pte(page_table, swp_entry_to_pte(entry));
 drop_pte:
 		UnlockPage(page);
@@ -109,6 +112,13 @@
 		goto out_failed;
 	}
 
+	/* From this point on, the odds are that we're going to
+	 * nuke this pte, so read and clear the pte.  This hook
+	 * is needed on CPUs which update the accessed and dirty
+	 * bits in hardware.
+	 */
+	pte = pte_xchg_clear(page_table);
+
 	/*
 	 * Is it a clean page? Then it must be recoverable
 	 * by just paging it in again, and we can just drop
@@ -124,7 +134,6 @@
 	 */
 	if (!pte_dirty(pte)) {
 		flush_cache_page(vma, address);
-		pte_clear(page_table);
 		goto drop_pte;
 	}
 
@@ -134,7 +143,7 @@
 	 * locks etc.
 	 */
 	if (!(gfp_mask & __GFP_IO))
-		goto out_unlock;
+		goto out_unlock_restore;
 
 	/*
 	 * Don't do any of the expensive stuff if
@@ -143,7 +152,7 @@
 	if (page->zone->free_pages + page->zone->inactive_clean_pages
 					+ page->zone->inactive_dirty_pages
 		      	> page->zone->pages_high + inactive_target)
-		goto out_unlock;
+		goto out_unlock_restore;
 
 	/*
 	 * Ok, it's really dirty. That means that
@@ -169,7 +178,7 @@
 		int error;
 		struct file *file = vma->vm_file;
 		if (file) get_file(file);
-		pte_clear(page_table);
+
 		mm->rss--;
 		flush_tlb_page(vma, address);
 		vmlist_access_unlock(mm);
@@ -191,10 +200,12 @@
 	 */
 	entry = get_swap_page();
 	if (!entry.val)
-		goto out_unlock; /* No swap space left */
+		goto out_unlock_restore; /* No swap space left */
 
-	if (!(page = prepare_highmem_swapout(page)))
+	if (!(page = prepare_highmem_swapout(page))) {
+		set_pte(page_table, pte);
 		goto out_swap_free;
+	}
 
 	swap_duplicate(entry);	/* One for the process, one for the swap cache */
 
@@ -218,7 +229,8 @@
 	swap_free(entry);
 out_failed:
 	return 0;
-out_unlock:
+out_unlock_restore:
+	set_pte(page_table, pte);
 	UnlockPage(page);
 	return 0;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] atomic pte updates for x86 smp
  2000-10-11 23:52     ` [RFC] atomic pte updates for x86 smp Ben LaHaise
@ 2000-10-12  0:09       ` Linus Torvalds
  2000-10-12  4:03         ` Benjamin C.R. LaHaise
  0 siblings, 1 reply; 20+ messages in thread
From: Linus Torvalds @ 2000-10-12  0:09 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: tytso, linux-kernel, linux-mm


On Wed, 11 Oct 2000, Ben LaHaise wrote:
> 
> Here's an updated version of the patch that doesn't do the funky RISC like
> dirty bit updates.  It doesn't incur the additional overhead of page
> faults on dirty, which actually happens a lot on SHM attaches
> (during Oracle runs this is quite noticeable due to their use of
> hundreds of MB of SHM).

I much prefered the dirty fault version.

What does "quite noticeable" mean? Does it mean that you can see page
faults (no big deal), or does it mean that you can actually measure the
performance degradation objectively?

Also, this version doesn't seem to fix the bug.

> diff -ur v2.4.0-test10-pre1/mm/vmscan.c work-v2.4.0-test10-pre1/mm/vmscan.c
> --- v2.4.0-test10-pre1/mm/vmscan.c	Tue Oct 10 16:57:31 2000
> +++ work-v2.4.0-test10-pre1/mm/vmscan.c	Wed Oct 11 18:17:17 2000
> @@ -134,7 +143,7 @@
>  	 * locks etc.
>  	 */
>  	if (!(gfp_mask & __GFP_IO))
> -		goto out_unlock;
> +		goto out_unlock_restore;
>  
>  	/*
>  	 * Don't do any of the expensive stuff if
> @@ -143,7 +152,7 @@
>  	if (page->zone->free_pages + page->zone->inactive_clean_pages
>  					+ page->zone->inactive_dirty_pages
>  		      	> page->zone->pages_high + inactive_target)
> -		goto out_unlock;
> +		goto out_unlock_restore;
>  
>  	/*
>  	 * Ok, it's really dirty. That means that

Both of the above paths can cause the dirty bit to be dropped again, as
far as I can see.

In fact, you seem to have _added_ those drops in this patch. What's up?

I'm not going to apply a patch that I don't see will even fix the problem
at this point.

I _will_ apply the "exception on dirty" version, if you remove the SMP
special case (ie you do it unconditionally). At least that one I believe
really fixes the problem.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] atomic pte updates for x86 smp
  2000-10-12  0:09       ` Linus Torvalds
@ 2000-10-12  4:03         ` Benjamin C.R. LaHaise
  2000-10-12  4:06           ` David S. Miller
  2000-10-12  6:42           ` Linus Torvalds
  0 siblings, 2 replies; 20+ messages in thread
From: Benjamin C.R. LaHaise @ 2000-10-12  4:03 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: tytso, linux-kernel, linux-mm

Hello Linus,

On Wed, 11 Oct 2000, Linus Torvalds wrote:

> I much prefered the dirty fault version.

> What does "quite noticeable" mean? Does it mean that you can see page
> faults (no big deal), or does it mean that you can actually measure the
> performance degradation objectively?

It's a factor of 4 difference in execution time on the filemap rewrite
test on a 1GB file (including all those cache misses that should have
dwarfed the page fault handler). Moving the writable test and mkdirty
early on in the page fault handler made no measurable difference in
execution time; the bulk of the overhead appears to be in handling the
page fault itself.

> Also, this version doesn't seem to fix the bug.
...
> Both of the above paths can cause the dirty bit to be dropped again, as
> far as I can see.

Note the fragment above those portions of the patch where the
pte_xchg_clear is done on the page table: this results in a page fault
for any other cpu that looks at the pte while it is unavailable.

> In fact, you seem to have _added_ those drops in this patch. What's up?

It's safe because of how x86s hardware works when it encounters the
cleared pte.  According to one of the manuals I've got here (the old 386
book is the only one that states it outright, sigh), the access and dirty
bits are updated with a locked memory cycle only if the entry is marked
present.  If you want test code demonstrating that x86 does a reread of
the pte on a dirty fault, I'll gladly share it.

> I'm not going to apply a patch that I don't see will even fix the problem
> at this point.
> 
> I _will_ apply the "exception on dirty" version, if you remove the SMP
> special case (ie you do it unconditionally). At least that one I believe
> really fixes the problem.

I'd rather not lose the use of a hardware feature that makes a difference
during the most important time: when the system is under heavy load and
the page table scanner is active.  If there's a way the atomic updates can
be cleaned up acceptably, then I want to do so.  Cheers,

		-ben

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] atomic pte updates for x86 smp
  2000-10-12  4:03         ` Benjamin C.R. LaHaise
@ 2000-10-12  4:06           ` David S. Miller
  2000-10-12  4:31             ` Cort Dougan
  2000-10-12  4:37             ` Benjamin C.R. LaHaise
  2000-10-12  6:42           ` Linus Torvalds
  1 sibling, 2 replies; 20+ messages in thread
From: David S. Miller @ 2000-10-12  4:06 UTC (permalink / raw)
  To: blah; +Cc: torvalds, tytso, linux-kernel, linux-mm

   It's safe because of how x86s hardware works

What about other platforms?

Later,
David S. Miller
davem@redhat.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] atomic pte updates for x86 smp
  2000-10-12  4:06           ` David S. Miller
@ 2000-10-12  4:31             ` Cort Dougan
  2000-10-12  4:37             ` Benjamin C.R. LaHaise
  1 sibling, 0 replies; 20+ messages in thread
From: Cort Dougan @ 2000-10-12  4:31 UTC (permalink / raw)
  To: David S. Miller; +Cc: blah, torvalds, tytso, linux-kernel, linux-mm

}    Date: 	Thu, 12 Oct 2000 00:03:31 -0400 (EDT)
}    From: "Benjamin C.R. LaHaise" <blah@kvack.org>
} 
}    It's safe because of how x86s hardware works
} 
} What about other platforms?

On the PPC's that don't do a hardware walk we do a normal write to the
hash table (with a spinlock).  On the hardware walk PPC's I'm told this is
done with with a lwarx/stwcx pair (conditional load/store on exclusive
access).

Any comments on how this would affect PPC?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] atomic pte updates for x86 smp
  2000-10-12  4:06           ` David S. Miller
  2000-10-12  4:31             ` Cort Dougan
@ 2000-10-12  4:37             ` Benjamin C.R. LaHaise
  1 sibling, 0 replies; 20+ messages in thread
From: Benjamin C.R. LaHaise @ 2000-10-12  4:37 UTC (permalink / raw)
  To: David S. Miller; +Cc: torvalds, tytso, linux-kernel, linux-mm

On Wed, 11 Oct 2000, David S. Miller wrote:

>    It's safe because of how x86s hardware works
> 
> What about other platforms?

If atomic ops don't work, then software dirty bits are still an option
(read as: it shouldn't break RISC CPUs).

		-ben

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] atomic pte updates for x86 smp
  2000-10-12  4:03         ` Benjamin C.R. LaHaise
  2000-10-12  4:06           ` David S. Miller
@ 2000-10-12  6:42           ` Linus Torvalds
  2000-10-12  8:13             ` Ingo Molnar
  2000-10-12 15:10             ` Benjamin C.R. LaHaise
  1 sibling, 2 replies; 20+ messages in thread
From: Linus Torvalds @ 2000-10-12  6:42 UTC (permalink / raw)
  To: Benjamin C.R. LaHaise; +Cc: tytso, linux-kernel, linux-mm, MOLNAR Ingo

On Thu, 12 Oct 2000, Benjamin C.R. LaHaise wrote:
> 
> Note the fragment above those portions of the patch where the
> pte_xchg_clear is done on the page table: this results in a page fault
> for any other cpu that looks at the pte while it is unavailable.

Ok, I see..

Hmm.. That's a singularly ugly interface, though - it all looks very
x86-specific. Things like "pte_xchg_clear()" look just a bit too obviously
like the name only makes sense due to the x86 implementation. So I'd like
to change the naming to be more about the design and less about the
implementation..

(It also doesn't make sense to me that you call the "clear the write bit"
thing "atomic_pte_wrprotect()", but you call the "clear the dirty bit"
"pte_test_and_clear_dirty()" - why not the same naming scheme for the two 
things?). 

I also have this suspicion that if this was done right, we should be able
to clean up the 64-bit atomic stuff for the x86 PAE case - which does a
cmpxchg8b right now on PAE entries exactly because of atomicity reasons.

With your patch as it stands now, we'd end up basically always doing two
of them.

And looking at the patch I get this nagging feeling that if this was
really done right, we could get rid of that PAE special case for
set_pte(), because the issue with atomic updates on PAE really boils down
to pretty much the same thing as the issue of one atomic bit.

(Instead of doing an atomic 64-bit memory write, we would be doing the
atomic "pte_xchg_clear()" followed by two _non_atomic 32-bit writes where
the second write would set the present bit. Although maybe the erratum
about the PAE pgd entry not honoring the P bit correctly makes this be
unworkable).

Ingo? I'd really like you to take a long look at this patch for sanity,
especially wrt PAE.

After this patch, are there any cases where we do a "set_pte()" where the
PTE wasn't clear before? That might be a good sanity-test to add, just to
make sure. And I'd really like to speed up the PAE set_pte() - as far as I
can tell both set_pte and set_pmd really should be safe without the atomic
64-bit crap with your changes.

Why do I care?

Basically, I'd be a lot happier about this patch if it also solves another
problem - if the "lost dirty bits" patch automagically also solves the
"64-bit atomic PTE" issue for the PAE case, then I will just feel a lot
happier about the fact that the solution is not just a specific hack for
handling "dirty", but a real change that makes conceptual sense for two
unrelated problems.

Because this, as always, is my final test for a "GoodDesign(tm)" patch: if
it solves just one problem it's a bug-fix, but if it solves two problems
it is the "RightThing(tm)" to do. And bug-fixes are a dime a dozen. Good
design is something to be admired.

What do you say, Ben? Do you think your approach really would solve the
PAE atomicity issue too, or am I just expecting too much?

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] atomic pte updates for x86 smp
  2000-10-12  6:42           ` Linus Torvalds
@ 2000-10-12  8:13             ` Ingo Molnar
  2000-10-12  8:56               ` David S. Miller
  2000-10-12 15:10             ` Benjamin C.R. LaHaise
  1 sibling, 1 reply; 20+ messages in thread
From: Ingo Molnar @ 2000-10-12  8:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin C.R. LaHaise, Theodore Y. Ts'o, linux-kernel,
	MM mailing list

On Wed, 11 Oct 2000, Linus Torvalds wrote:

> (Instead of doing an atomic 64-bit memory write, we would be doing the
> atomic "pte_xchg_clear()" followed by two _non_atomic 32-bit writes where
> the second write would set the present bit. Although maybe the erratum
> about the PAE pgd entry not honoring the P bit correctly makes this be
> unworkable).
> 
> Ingo? I'd really like you to take a long look at this patch for sanity,
> especially wrt PAE.

the PAE pgd 'anomaly' should not affect this case, because we never clear
neither user-space pgds, nor user-space pmds in PAE mode. Unless we start
swapping pagetables i dont think this will ever happen in the future. The
PAE anomaly only affects the four top-level pgds, so even if we started
swapping pagetables, we'll never have to swap the pgds themselves.

i completely agree with the need to clean the pte-setting atomicity
interface up. And getting rid of cmpxch8b will be a definite performance
(and GCC-optimization) improvement.

> After this patch, are there any cases where we do a "set_pte()" where
> the PTE wasn't clear before? That might be a good sanity-test to add,
> just to make sure. And I'd really like to speed up the PAE set_pte() -
> as far as I can tell both set_pte and set_pmd really should be safe
> without the atomic 64-bit crap with your changes.

yep, the two 32-bit writes idea is very nice - this should be safe - and
there isnt even any need for any barriers (except optimization barrier),
given that writes are strongly ordered on x86.

my gut feeling is that all these things will only benefit PAE support, and
the risk of those changes is low, none of those should bite us in the
future, design-wise. And it's also a nice speedup. And after this we could
finally get rid of the 'unsigned long long' as well and just define two
32-bit fields in pte.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] atomic pte updates for x86 smp
  2000-10-12  8:13             ` Ingo Molnar
@ 2000-10-12  8:56               ` David S. Miller
  2000-10-12 10:05                 ` Ingo Molnar
  0 siblings, 1 reply; 20+ messages in thread
From: David S. Miller @ 2000-10-12  8:56 UTC (permalink / raw)
  To: mingo; +Cc: torvalds, blah, tytso, linux-kernel, linux-mm

   the PAE pgd 'anomaly' should not affect this case, because we never
   clear neither user-space pgds, nor user-space pmds in PAE mode

Eh?

munmap() --> clear_page_tables() --> free_one_pgd() --> pgd_clear

Later,
David S. Miller
davem@redhat.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] atomic pte updates for x86 smp
  2000-10-12  8:56               ` David S. Miller
@ 2000-10-12 10:05                 ` Ingo Molnar
  2000-10-12 11:10                   ` Ingo Molnar
  0 siblings, 1 reply; 20+ messages in thread
From: Ingo Molnar @ 2000-10-12 10:05 UTC (permalink / raw)
  To: David S. Miller
  Cc: Linus Torvalds, blah, Theodore Y. Ts'o, linux-kernel,
	MM mailing list

On Thu, 12 Oct 2000, David S. Miller wrote:

>    clear neither user-space pgds, nor user-space pmds in PAE mode
> 
> Eh?
> 
> munmap() --> clear_page_tables() --> free_one_pgd() --> pgd_clear

you are right, i was focused too much on the swapping case. I dont think
munmap() is a problem in the PAE case. pgd_clear() should stay a 64-bit
operation (like in Ben's patch) because we could get a legitimate TLB
flush between two 32-bit writes. (the 4 pgd entries are otherwise cached
in the CPU core, only TLB flushes reload them.)

	Ingo



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] atomic pte updates for x86 smp
  2000-10-12 10:05                 ` Ingo Molnar
@ 2000-10-12 11:10                   ` Ingo Molnar
  0 siblings, 0 replies; 20+ messages in thread
From: Ingo Molnar @ 2000-10-12 11:10 UTC (permalink / raw)
  To: David S. Miller
  Cc: Linus Torvalds, blah, Theodore Y. Ts'o, linux-kernel,
	MM mailing list

On Thu, 12 Oct 2000, Ingo Molnar wrote:

> [...] pgd_clear() should stay a 64-bit operation [...]

even this isnt strictly necessery - pgds and pmds are allocated in 'low
memory', and thus a simple 32-bit write to the lower 32 bits of the pgd
entry is enough to clear a PAE pgd. But it still must be a special case
due to the pgd present-bit restriction.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] atomic pte updates for x86 smp
  2000-10-12  6:42           ` Linus Torvalds
  2000-10-12  8:13             ` Ingo Molnar
@ 2000-10-12 15:10             ` Benjamin C.R. LaHaise
  1 sibling, 0 replies; 20+ messages in thread
From: Benjamin C.R. LaHaise @ 2000-10-12 15:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: tytso, linux-kernel, linux-mm, MOLNAR Ingo

On Wed, 11 Oct 2000, Linus Torvalds wrote:

> 
> On Thu, 12 Oct 2000, Benjamin C.R. LaHaise wrote:
> > 
> > Note the fragment above those portions of the patch where the
> > pte_xchg_clear is done on the page table: this results in a page fault
> > for any other cpu that looks at the pte while it is unavailable.
> 
> Ok, I see..
> 
> Hmm.. That's a singularly ugly interface, though - it all looks very
> x86-specific. Things like "pte_xchg_clear()" look just a bit too obviously
> like the name only makes sense due to the x86 implementation. So I'd like
> to change the naming to be more about the design and less about the
> implementation..

How about pte_get_and_clear?

> (It also doesn't make sense to me that you call the "clear the write bit"
> thing "atomic_pte_wrprotect()", but you call the "clear the dirty bit"
> "pte_test_and_clear_dirty()" - why not the same naming scheme for the two 
> things?). 

*nod*

> I also have this suspicion that if this was done right, we should be able
> to clean up the 64-bit atomic stuff for the x86 PAE case - which does a
> cmpxchg8b right now on PAE entries exactly because of atomicity reasons.
> 
> With your patch as it stands now, we'd end up basically always doing two
> of them.
> 
> And looking at the patch I get this nagging feeling that if this was
> really done right, we could get rid of that PAE special case for
> set_pte(), because the issue with atomic updates on PAE really boils down
> to pretty much the same thing as the issue of one atomic bit.

> (Instead of doing an atomic 64-bit memory write, we would be doing the
> atomic "pte_xchg_clear()" followed by two _non_atomic 32-bit writes where
> the second write would set the present bit. Although maybe the erratum
> about the PAE pgd entry not honoring the P bit correctly makes this be
> unworkable).

As Ingo pointed out, this is only a problem for the pgd; we're safe so
long as atomic operations are used on the present bit for pte's.  I think
we can completely eliminate the cmpxchg8b for ptes by using xchg on the
low byte containing the P bit and non atomic ops on the high byte.  This
should be much better!

...
> What do you say, Ben? Do you think your approach really would solve the
> PAE atomicity issue too, or am I just expecting too much?

These are good ideas.  I'll go back and rework the patch for PAE stuff and
see what kind of results turn out.

		-ben

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2000-10-13 16:57 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <200010090419.e994JQT09775@trampoline.thunk.org>
2000-10-10 20:53 ` Updated 2.4 TODO List Rik van Riel
2000-10-11  0:06   ` 2.4.0test9 vm: disappointing streaming i/o under load Chris Evans
2000-10-11 11:38     ` Eric Lowe
2000-10-11 20:59       ` Chris Evans
2000-10-11 22:10         ` Roger Larsson
2000-10-11 22:46           ` Chris Evans
2000-10-13 16:57             ` Rik van Riel
2000-10-11 18:38   ` Updated 2.4 TODO List tytso
2000-10-11 23:52     ` [RFC] atomic pte updates for x86 smp Ben LaHaise
2000-10-12  0:09       ` Linus Torvalds
2000-10-12  4:03         ` Benjamin C.R. LaHaise
2000-10-12  4:06           ` David S. Miller
2000-10-12  4:31             ` Cort Dougan
2000-10-12  4:37             ` Benjamin C.R. LaHaise
2000-10-12  6:42           ` Linus Torvalds
2000-10-12  8:13             ` Ingo Molnar
2000-10-12  8:56               ` David S. Miller
2000-10-12 10:05                 ` Ingo Molnar
2000-10-12 11:10                   ` Ingo Molnar
2000-10-12 15:10             ` Benjamin C.R. LaHaise

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox