3rd version of R/W mmap_sem patch available

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* 3rd version of R/W mmap_sem patch available
       [not found] <Pine.LNX.4.33.0103191802330.2076-100000@mikeg.weiden.de>
@ 2001-03-20  1:56 ` Rik van Riel
  2001-03-19 22:46   ` Linus Torvalds
  2001-03-20  2:46   ` Linus Torvalds
  0 siblings, 2 replies; 24+ messages in thread
From: Rik van Riel @ 2001-03-20  1:56 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Linus Torvalds, linux-mm, linux-kernel

On Mon, 19 Mar 2001, Mike Galbraith wrote:

> @@ -1135,6 +1170,7 @@
	[large patch]

I've been finding small bugs in both my late-night code and in
Mike's code and have redone the changes in do_anonymous_page(),
do_no_page() and do_swap_page() much more carefully...

Now the code is beautiful and it might even be bugfree ;)

If you feel particularly adventurous, please help me test the
patch; it is available from:

	http://www.surriel.com/patches/2.4/2.4.2-ac20-rwmmap_sem3

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3rd version of R/W mmap_sem patch available
  2001-03-20  1:56 ` 3rd version of R/W mmap_sem patch available Rik van Riel
@ 2001-03-19 22:46   ` Linus Torvalds
  2001-03-20  2:46   ` Linus Torvalds
  1 sibling, 0 replies; 24+ messages in thread
From: Linus Torvalds @ 2001-03-19 22:46 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Mike Galbraith, linux-mm, linux-kernel

> Now the code is beautiful and it might even be bugfree ;)

I'm applying this to my tree - I'm not exactly comfortable with this
during the 2.4.x timeframe, but at the same time I'm even less comfortable
with the current alternative, which is to make the regular semaphores
fairer (we tried it once, and the implementation had problems, I'm not
going to try that again during 2.4.x).

Besides, the fair semaphores would potentially slow things down, while
this potentially speeds things up. So.. It looks obvious enough.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3rd version of R/W mmap_sem patch available
  2001-03-20  1:56 ` 3rd version of R/W mmap_sem patch available Rik van Riel
  2001-03-19 22:46   ` Linus Torvalds
@ 2001-03-20  2:46   ` Linus Torvalds
  2001-03-20  4:15     ` Marcelo Tosatti
                       ` (2 more replies)
  1 sibling, 3 replies; 24+ messages in thread
From: Linus Torvalds @ 2001-03-20  2:46 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mike Galbraith, linux-mm, linux-kernel, Manfred Spraul, MOLNAR Ingo

There is a 2.4.3-pre5 in the test-directory on ftp.kernel.org.

The complete changelog is appended, but the biggest recent change is the
mmap_sem change, which I updated with new locking rules for pte/pmd_alloc
to avoid the race on the actual page table build.

This has only been tested on i386 without PAE, and is known to break other
architectures. Ingo, mind checking what PAE needs? Generally, the changes
are simple, and really only implies changing the pte/pmd allocation
functions to _only_ allocate (ie removing the stuff that actually modifies
the page tables, as that is now handled by generic code), and to make sure
that the "pgd/pmd_populate()" functions do the right thing.

I have also removed the xxx_kernel() functions - for architectures that
need them, I suspect that the right approach is to just make the
"populate" funtions notice when "mm" is "init_mm", the kernel context.
That removed a lot of duplicate code that had little good reason.

This pre-release is meant mainly as a synchronization point for mm
developers, not for generic use.

	Thanks,

		Linus

-----
-pre5:
  - Rik van Riel and others: mm rw-semaphore (ps/top ok when swapping)
  - IDE: 256 sectors at a time is legal, but apparently confuses some
    drives. Max out at 255 sectors instead.
  - Petko Manolov: USB pegasus driver update
  - make the boottime memory map printout at least almost readable.
  - USB driver updates
  - pte_alloc()/pmd_alloc() need page_table_lock.

-pre4:
  - Petr Vandrovec, Al Viro: dentry revalidation fixes
  - Stephen Tweedie / Manfred Spraul: kswapd and ptrace race
  - Neil Brown: nfsd/rpc/raid cleanups and fixes

-pre3:
  - Alan Cox: continued merging
  - Urban Widmark: smbfs fix (d_add on already hashed dentry - no-no).
  - Andrew Morton: 3c59x update
  - Jeff Garzik: network driver cleanups and fixes
  - Gerard Roudier: sym-ncr drivers update
  - Jens Axboe: more loop cleanups and fixes
  - David Miller: sparc update, some networking fixes

-pre2:
  - Jens Axboe: fix loop device deadlocks
  - Greg KH: USB updates
  - Alan Cox: continued merging
  - Tim Waugh: parport and documentation updates
  - Cort Dougan: PowerPC merge
  - Jeff Garzik: network driver updates
  - Justin Gibbs: new and much improved aic7xxx driver 6.1.5

-pre1:
  - Chris Mason: reiserfs, another null bytes bug
  - Andrea Arkangeli: make SMP Athlon build
  - Alexander Zarochentcev: reiserfs directory fsync SMP locking fix
  - Jeff Garzik: PCI network driver updates
  - Alan Cox: continue merging
  - Ingo Molnar: fix RAID AUTORUN ioctl, scheduling improvements

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3rd version of R/W mmap_sem patch available
  2001-03-20  2:46   ` Linus Torvalds
@ 2001-03-20  4:15     ` Marcelo Tosatti
  2001-03-20  6:07       ` Linus Torvalds
  2001-03-20 15:11     ` Andrew Morton
  2001-03-25 14:53     ` [patch] pae-2.4.3-A4 Ingo Molnar
  2 siblings, 1 reply; 24+ messages in thread
From: Marcelo Tosatti @ 2001-03-20  4:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Mike Galbraith, linux-mm, linux-kernel,
	Manfred Spraul, MOLNAR Ingo


On Mon, 19 Mar 2001, Linus Torvalds wrote:

> 
> There is a 2.4.3-pre5 in the test-directory on ftp.kernel.org.
> 
> The complete changelog is appended, but the biggest recent change is the
> mmap_sem change, which I updated with new locking rules for pte/pmd_alloc
> to avoid the race on the actual page table build.
> 
> This has only been tested on i386 without PAE, and is known to break other
> architectures. Ingo, mind checking what PAE needs? Generally, the changes
> are simple, and really only implies changing the pte/pmd allocation
> functions to _only_ allocate (ie removing the stuff that actually modifies
> the page tables, as that is now handled by generic code), and to make sure
> that the "pgd/pmd_populate()" functions do the right thing.
> 
> I have also removed the xxx_kernel() functions - for architectures that
> need them, I suspect that the right approach is to just make the
> "populate" funtions notice when "mm" is "init_mm", the kernel context.
> That removed a lot of duplicate code that had little good reason.
> 
> This pre-release is meant mainly as a synchronization point for mm
> developers, not for generic use.
> 
> 	Thanks,
> 
> 		Linus
> 
> 
> -----
> -pre5:
>   - Rik van Riel and others: mm rw-semaphore (ps/top ok when swapping)
>   - IDE: 256 sectors at a time is legal, but apparently confuses some
>     drives. Max out at 255 sectors instead.

Could the IDE one cause corruption ?

EXT2-fs error (device ide0(3,1)): ext2_free_blocks: bit already cleared
for block 6211

Just hitted this now with pre3. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3rd version of R/W mmap_sem patch available
  2001-03-20  4:15     ` Marcelo Tosatti
@ 2001-03-20  6:07       ` Linus Torvalds
  2001-03-20  4:29         ` Marcelo Tosatti
  0 siblings, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2001-03-20  6:07 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Rik van Riel, Mike Galbraith, linux-mm, linux-kernel,
	Manfred Spraul, MOLNAR Ingo


On Tue, 20 Mar 2001, Marcelo Tosatti wrote:
>
> Could the IDE one cause corruption ?

Only with broken disks, as far as we know right now. There's been so far
just one report of this problem, and nobody has heard back about which
disk this was.. And it should be noisy about it when it happens -
complaining about lost interrupts and resetting the IDE controller.

So unlikely.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3rd version of R/W mmap_sem patch available
  2001-03-20  6:07       ` Linus Torvalds
@ 2001-03-20  4:29         ` Marcelo Tosatti
  2001-03-20  6:36           ` Linus Torvalds
  0 siblings, 1 reply; 24+ messages in thread
From: Marcelo Tosatti @ 2001-03-20  4:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Mike Galbraith, linux-mm, lkml, Manfred Spraul,
	MOLNAR Ingo


On Mon, 19 Mar 2001, Linus Torvalds wrote:

> 
> 
> On Tue, 20 Mar 2001, Marcelo Tosatti wrote:
> >
> > Could the IDE one cause corruption ?
> 
> Only with broken disks, as far as we know right now. There's been so far
> just one report of this problem, and nobody has heard back about which
> disk this was.. And it should be noisy about it when it happens -
> complaining about lost interrupts and resetting the IDE controller.
> 
> So unlikely.

Ok, so I think we have a problem. The disk is OK -- no lost interrupts or
resets. Just this message on syslog and pgbench complaining about
corruption of the database.

I'll put pre5 in and try to reproduce the problem (I hitted it while
running pgbench + shmtest). 

Damn. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3rd version of R/W mmap_sem patch available
  2001-03-20  4:29         ` Marcelo Tosatti
@ 2001-03-20  6:36           ` Linus Torvalds
  2001-03-20  7:03             ` Linus Torvalds
  0 siblings, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2001-03-20  6:36 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Rik van Riel, Mike Galbraith, linux-mm, lkml, Manfred Spraul,
	MOLNAR Ingo

On Tue, 20 Mar 2001, Marcelo Tosatti wrote:
>
> I'll put pre5 in and try to reproduce the problem (I hitted it while
> running pgbench + shmtest).

I found a case where pre5 will forget to unlock the page_table_lock (in
copy_page_range()), and one place where I had missed the lock altogether
(in ioremap()), so I'll make a pre6 (neither is a problem on UP, though,
so pre5 is not unusable - even on SMP it works really well until you hit
the case where it forgets to unlock ;).

Although I'd prefer to see somebody check out the other architectures, to
do the (pretty trivial) changes to make them support properly threaded
page faults. I'd hate to have two pre-patches without any input from other
architectures..

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3rd version of R/W mmap_sem patch available
  2001-03-20  6:36           ` Linus Torvalds
@ 2001-03-20  7:03             ` Linus Torvalds
  2001-03-20  8:19               ` Eric W. Biederman
  0 siblings, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2001-03-20  7:03 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Rik van Riel, Mike Galbraith, linux-mm, Manfred Spraul, MOLNAR Ingo


On Mon, 19 Mar 2001, Linus Torvalds wrote:
>
> Although I'd prefer to see somebody check out the other architectures,
> to do the (pretty trivial) changes to make them support properly
> threaded page faults. I'd hate to have two pre-patches without any
> input from other architectures..

These are the trivial fixes to make -pre5 be spinlock-debugging-clean and
fix the missing unlock in copy_page_range(). I'd really like to hear from
architecture maintainers if possible.

		Linus

----
diff -u --recursive --new-file pre5/linux/arch/i386/mm/ioremap.c linux/arch/i386/mm/ioremap.c
--- pre5/linux/arch/i386/mm/ioremap.c	Mon Mar 19 18:49:18 2001
+++ linux/arch/i386/mm/ioremap.c	Mon Mar 19 21:25:16 2001
@@ -62,6 +62,7 @@
 static int remap_area_pages(unsigned long address, unsigned long phys_addr,
 				 unsigned long size, unsigned long flags)
 {
+	int error;
 	pgd_t * dir;
 	unsigned long end = address + size;

@@ -70,17 +71,21 @@
 	flush_cache_all();
 	if (address >= end)
 		BUG();
+	spin_lock(&init_mm.page_table_lock);
 	do {
 		pmd_t *pmd;
 		pmd = pmd_alloc(&init_mm, dir, address);
+		error = -ENOMEM;
 		if (!pmd)
-			return -ENOMEM;
+			break;
 		if (remap_area_pmd(pmd, address, end - address,
 					 phys_addr + address, flags))
-			return -ENOMEM;
+			break;
+		error = 0;
 		address = (address + PGDIR_SIZE) & PGDIR_MASK;
 		dir++;
 	} while (address && (address < end));
+	spin_unlock(&init_mm.page_table_lock);
 	flush_tlb_all();
 	return 0;
 }
diff -u --recursive --new-file pre5/linux/mm/memory.c linux/mm/memory.c
--- pre5/linux/mm/memory.c	Mon Mar 19 18:49:20 2001
+++ linux/mm/memory.c	Mon Mar 19 22:49:39 2001
@@ -160,6 +160,7 @@
 	src_pgd = pgd_offset(src, address)-1;
 	dst_pgd = pgd_offset(dst, address)-1;

+	spin_lock(&dst->page_table_lock);
 	for (;;) {
 		pmd_t * src_pmd, * dst_pmd;

@@ -178,7 +179,6 @@
 			continue;
 		}

-		spin_lock(&dst->page_table_lock);
 		src_pmd = pmd_offset(src_pgd, address);
 		dst_pmd = pmd_alloc(dst, dst_pgd, address);
 		if (!dst_pmd)
@@ -247,13 +247,10 @@
 cont_copy_pmd_range:	src_pmd++;
 			dst_pmd++;
 		} while ((unsigned long)src_pmd & PMD_TABLE_MASK);
-		spin_unlock(&dst->page_table_lock);
 	}
-out:
-	return 0;
-
 out_unlock:
 	spin_unlock(&src->page_table_lock);
+out:
 	spin_unlock(&dst->page_table_lock);
 	return 0;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3rd version of R/W mmap_sem patch available
  2001-03-20  7:03             ` Linus Torvalds
@ 2001-03-20  8:19               ` Eric W. Biederman
  0 siblings, 0 replies; 24+ messages in thread
From: Eric W. Biederman @ 2001-03-20  8:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Marcelo Tosatti, Rik van Riel, Mike Galbraith, linux-mm,
	Manfred Spraul, MOLNAR Ingo

Linus Torvalds <torvalds@transmeta.com> writes:

> On Mon, 19 Mar 2001, Linus Torvalds wrote:
> >
> > Although I'd prefer to see somebody check out the other architectures,
> > to do the (pretty trivial) changes to make them support properly
> > threaded page faults. I'd hate to have two pre-patches without any
> > input from other architectures..
> 
> These are the trivial fixes to make -pre5 be spinlock-debugging-clean and
> fix the missing unlock in copy_page_range(). I'd really like to hear from
> architecture maintainers if possible.
> 
> 		Linus

Hmm.  It looks like remap_area_pages doesn't return an error...
- return 0;
+ return error;


Eric
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3rd version of R/W mmap_sem patch available
  2001-03-20  2:46   ` Linus Torvalds
  2001-03-20  4:15     ` Marcelo Tosatti
@ 2001-03-20 15:11     ` Andrew Morton
  2001-03-20 15:15       ` Jeff Garzik
                         ` (2 more replies)
  2001-03-25 14:53     ` [patch] pae-2.4.3-A4 Ingo Molnar
  2 siblings, 3 replies; 24+ messages in thread
From: Andrew Morton @ 2001-03-20 15:11 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-mm

Linus Torvalds wrote:
> 
> There is a 2.4.3-pre5 in the test-directory on ftp.kernel.org.
> 

I stared long and hard at expand_stack().  Its first access
to vma->vm_start appears to be safe wrt other threads which 
can alter this, but perhaps the page_table_lock should be
acquired earlier here?

We now have:

	free_pgd_slow();
	pmd_free_slow();
	pte_free_slow();

Could we please have consistent naming back?

in do_wp_page():

        spin_unlock(&mm->page_table_lock);
        new_page = alloc_page(GFP_HIGHUSER);
        if (!new_page)
                return -1;
        spin_lock(&mm->page_table_lock);

Should retake the spinlock before returning.

General comment: an expensive part of a pagefault
is zeroing the new page.  It'd be nice if we could
drop the page_table_lock while doing the clear_user_page()
and, if possible, copy_user_page() functions.  Very nice.

read_zero_pagealigned()->zap_page_range()

	The handling of mm->rss is racy.  But I think
	it always has been?

This comment in mprotect.c:
+       /* XXX: maybe this could be down_read ??? - Rik */

I don't think so.  The decisions about where in the 
vma tree to place the new vma would be unprotected and racy.

Apart from that - I looked at it (x86-only) very closely and
it seems solid.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3rd version of R/W mmap_sem patch available
  2001-03-20 15:11     ` Andrew Morton
@ 2001-03-20 15:15       ` Jeff Garzik
  2001-03-20 15:16       ` Jeff Garzik
  2001-03-20 16:08       ` Linus Torvalds
  2 siblings, 0 replies; 24+ messages in thread
From: Jeff Garzik @ 2001-03-20 15:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linus Torvalds, linux-mm

-- 
Jeff Garzik       | May you have warm words on a cold evening,
Building 1024     | a full mooon on a dark night,
MandrakeSoft      | and a smooth road all the way to your door.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3rd version of R/W mmap_sem patch available
  2001-03-20 15:11     ` Andrew Morton
  2001-03-20 15:15       ` Jeff Garzik
@ 2001-03-20 15:16       ` Jeff Garzik
  2001-03-20 15:31         ` Andrew Morton
  2001-03-20 16:08       ` Linus Torvalds
  2 siblings, 1 reply; 24+ messages in thread
From: Jeff Garzik @ 2001-03-20 15:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linus Torvalds, linux-mm

Andrew Morton wrote:
> General comment: an expensive part of a pagefault
> is zeroing the new page.  It'd be nice if we could
> drop the page_table_lock while doing the clear_user_page()
> and, if possible, copy_user_page() functions.  Very nice.

People have talked before about creating zero pages in the background,
or creating them as a side effect of another operation (don't recall
details), so yeah this is definitely an area where some optimizations
could be done.  I wouldn't want to do it until 2.5 though...

-- 
Jeff Garzik       | May you have warm words on a cold evening,
Building 1024     | a full mooon on a dark night,
MandrakeSoft      | and a smooth road all the way to your door.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3rd version of R/W mmap_sem patch available
  2001-03-20 15:16       ` Jeff Garzik
@ 2001-03-20 15:31         ` Andrew Morton
  2001-03-21  1:59           ` Eric W. Biederman
  0 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2001-03-20 15:31 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Linus Torvalds, linux-mm

Jeff Garzik wrote:
> 
> Andrew Morton wrote:
> > General comment: an expensive part of a pagefault
> > is zeroing the new page.  It'd be nice if we could
> > drop the page_table_lock while doing the clear_user_page()
> > and, if possible, copy_user_page() functions.  Very nice.
> 
> People have talked before about creating zero pages in the background,
> or creating them as a side effect of another operation (don't recall
> details), so yeah this is definitely an area where some optimizations
> could be done.  I wouldn't want to do it until 2.5 though...

Actually, I did this for x86 last weekend :) Initial results are
disappointing. 

It creates a special uncachable mapping and sits there
zeroing pages in a low-priority thread (also tried
doing it in the idle task).

It was made uncachable because a lot of the
cost of clearing a page at fault time will be in
the eviction of live, useful data.

But clearing an uncachable page takes about eight times
as long as clearing a cachable, but uncached one.  Now,
if there was a hardware peripheral which could zero pages
quickly, that'd be good.

I dunno.  I need to test it on more workloads.  I was
using kernel compiles and these have a very low
sleeping-on-IO to faulting-zeropages-in ratio.  The
walltime for kernel builds was unaltered.

Certainly one can write silly applications which
speed up by a factor of ten with this change.

I'll finish this work off sometime in the next week,
stick it on the web.

But that's all orthogonal to my comment.  We'd
get significantly better threaded use of a single
mm if we didn't block it while clearing and copying 
pages.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3rd version of R/W mmap_sem patch available
  2001-03-20 15:31         ` Andrew Morton
@ 2001-03-21  1:59           ` Eric W. Biederman
  0 siblings, 0 replies; 24+ messages in thread
From: Eric W. Biederman @ 2001-03-21  1:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jeff Garzik, Linus Torvalds, linux-mm

Andrew Morton <andrewm@uow.edu.au> writes:

> Jeff Garzik wrote:
> > 
> > Andrew Morton wrote:
> > > General comment: an expensive part of a pagefault
> > > is zeroing the new page.  It'd be nice if we could
> > > drop the page_table_lock while doing the clear_user_page()
> > > and, if possible, copy_user_page() functions.  Very nice.
> > 
> > People have talked before about creating zero pages in the background,
> > or creating them as a side effect of another operation (don't recall
> > details), so yeah this is definitely an area where some optimizations
> > could be done.  I wouldn't want to do it until 2.5 though...
> 
> Actually, I did this for x86 last weekend :) Initial results are
> disappointing. 
> 
> It creates a special uncachable mapping and sits there
> zeroing pages in a low-priority thread (also tried
> doing it in the idle task).

Well if you are going to mess with caching make the mapping write-combining
on x86..  You get much better performance.

Eric

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3rd version of R/W mmap_sem patch available
  2001-03-20 15:11     ` Andrew Morton
  2001-03-20 15:15       ` Jeff Garzik
  2001-03-20 15:16       ` Jeff Garzik
@ 2001-03-20 16:08       ` Linus Torvalds
  2001-03-20 16:33         ` Andi Kleen
  2001-03-22 10:24         ` Andrew Morton
  2 siblings, 2 replies; 24+ messages in thread
From: Linus Torvalds @ 2001-03-20 16:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm


On Wed, 21 Mar 2001, Andrew Morton wrote:
>
> I stared long and hard at expand_stack().  Its first access
> to vma->vm_start appears to be safe wrt other threads which
> can alter this, but perhaps the page_table_lock should be
> acquired earlier here?

Hmm.. Probably.

> We now have:
>
> 	free_pgd_slow();
> 	pmd_free_slow();
> 	pte_free_slow();
>
> Could we please have consistent naming back?

Yes, I want to rename free_pgd_slow() to match the others.

> in do_wp_page():
>
>         spin_unlock(&mm->page_table_lock);
>         new_page = alloc_page(GFP_HIGHUSER);
>         if (!new_page)
>                 return -1;
>         spin_lock(&mm->page_table_lock);
>
> Should retake the spinlock before returning.

Thanks, done.

> General comment: an expensive part of a pagefault
> is zeroing the new page.  It'd be nice if we could
> drop the page_table_lock while doing the clear_user_page()
> and, if possible, copy_user_page() functions.  Very nice.

I don't think it's worth it. We should have basically zero contention on
this lock now, and adding complexity to try to release it sounds like a
bad idea when the only way to make contention on it is (a) kswapd (only
when paging stuff out) and (b) multiple threads (only when taking
concurrent page faults).

So I don't really see the point of bothering.

> read_zero_pagealigned()->zap_page_range()
>
> 	The handling of mm->rss is racy.  But I think
> 	it always has been?

It always has been. Right now I think we hold the page_table_lock over
most of them, that the old patch to fix this might end up being just that
one place. Somebody interested in checking?

> This comment in mprotect.c:
> +       /* XXX: maybe this could be down_read ??? - Rik */
>
> I don't think so.  The decisions about where in the
> vma tree to place the new vma would be unprotected and racy.

I think we could potentially find it useful in places to have a

	down_write_start();

	down_write_commit();

in a few places, where "down_write_start()" only guarantees exclusion of
other writers (and write-startes), while down_write_commit() waits for all
the readers to go away.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3rd version of R/W mmap_sem patch available
  2001-03-20 16:08       ` Linus Torvalds
@ 2001-03-20 16:33         ` Andi Kleen
  2001-03-20 17:13           ` Linus Torvalds
  2001-03-20 19:33           ` Rik van Riel
  2001-03-22 10:24         ` Andrew Morton
  1 sibling, 2 replies; 24+ messages in thread
From: Andi Kleen @ 2001-03-20 16:33 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, linux-mm

On Tue, Mar 20, 2001 at 05:08:36PM +0100, Linus Torvalds wrote:
> > General comment: an expensive part of a pagefault
> > is zeroing the new page.  It'd be nice if we could
> > drop the page_table_lock while doing the clear_user_page()
> > and, if possible, copy_user_page() functions.  Very nice.
> 
> I don't think it's worth it. We should have basically zero contention on
> this lock now, and adding complexity to try to release it sounds like a
> bad idea when the only way to make contention on it is (a) kswapd (only
> when paging stuff out) and (b) multiple threads (only when taking
> concurrent page faults).

Isn't (b) a rather common case in multi threaded applications ? 

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3rd version of R/W mmap_sem patch available
  2001-03-20 16:33         ` Andi Kleen
@ 2001-03-20 17:13           ` Linus Torvalds
  2001-03-20 19:33           ` Rik van Riel
  1 sibling, 0 replies; 24+ messages in thread
From: Linus Torvalds @ 2001-03-20 17:13 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andrew Morton, linux-mm


On Tue, 20 Mar 2001, Andi Kleen wrote:
> >
> > I don't think it's worth it. We should have basically zero contention on
> > this lock now, and adding complexity to try to release it sounds like a
> > bad idea when the only way to make contention on it is (a) kswapd (only
> > when paging stuff out) and (b) multiple threads (only when taking
> > concurrent page faults).
>
> Isn't (b) a rather common case in multi threaded applications ?

Not if you're performance-sensitive, I bet. If you take so many pagefaults
that the page_table_lock ends up being a problem, you have _more_ problems
than that.

We'll see. I will certainly re-consider if it ends up being shown to be a
real problem. spinlock contention tends to be very easy to see on kernel
profiles, especially the way they're done on Linux/x86 (inline - so you
see evrey contention place separately).

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3rd version of R/W mmap_sem patch available
  2001-03-20 16:33         ` Andi Kleen
  2001-03-20 17:13           ` Linus Torvalds
@ 2001-03-20 19:33           ` Rik van Riel
  2001-03-20 22:51             ` Andrew Morton
  1 sibling, 1 reply; 24+ messages in thread
From: Rik van Riel @ 2001-03-20 19:33 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Linus Torvalds, Andrew Morton, linux-mm

On Tue, 20 Mar 2001, Andi Kleen wrote:
> On Tue, Mar 20, 2001 at 05:08:36PM +0100, Linus Torvalds wrote:
> > > General comment: an expensive part of a pagefault
> > > is zeroing the new page.  It'd be nice if we could
> > > drop the page_table_lock while doing the clear_user_page()
> > > and, if possible, copy_user_page() functions.  Very nice.
> > 
> > I don't think it's worth it. We should have basically zero contention on
> > this lock now, and adding complexity to try to release it sounds like a
> > bad idea when the only way to make contention on it is (a) kswapd (only
> > when paging stuff out) and (b) multiple threads (only when taking
> > concurrent page faults).
> 
> Isn't (b) a rather common case in multi threaded applications ? 

Multiple threads pagefaulting on the SAME page of anonymous
memory at the same time ?

I can imagine multiple threads pagefaulting on the same page
of some mmaped database, but on the same page of anonymous
memory ??

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3rd version of R/W mmap_sem patch available
  2001-03-20 19:33           ` Rik van Riel
@ 2001-03-20 22:51             ` Andrew Morton
  0 siblings, 0 replies; 24+ messages in thread
From: Andrew Morton @ 2001-03-20 22:51 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andi Kleen, Linus Torvalds, linux-mm

Rik van Riel wrote:
> 
> On Tue, 20 Mar 2001, Andi Kleen wrote:
> > On Tue, Mar 20, 2001 at 05:08:36PM +0100, Linus Torvalds wrote:
> > > > General comment: an expensive part of a pagefault
> > > > is zeroing the new page.  It'd be nice if we could
> > > > drop the page_table_lock while doing the clear_user_page()
> > > > and, if possible, copy_user_page() functions.  Very nice.
> > >
> > > I don't think it's worth it. We should have basically zero contention on
> > > this lock now, and adding complexity to try to release it sounds like a
> > > bad idea when the only way to make contention on it is (a) kswapd (only
> > > when paging stuff out) and (b) multiple threads (only when taking
> > > concurrent page faults).
> >
> > Isn't (b) a rather common case in multi threaded applications ?
> 
> Multiple threads pagefaulting on the SAME page of anonymous
> memory at the same time ?
> 
> I can imagine multiple threads pagefaulting on the same page
> of some mmaped database, but on the same page of anonymous
> memory ??

err...  If we hold mm->page_table_lock for a long time,
that's going to block all faulting threads which use this mm,
regardless of which page (or vma) they're faulting on, no?

I guess I've kind of lost the plot on why this patch exists
in the first place.  Was it simply to prevent vmstat from getting
stuck, or was it because we were seeing significant throughput
degradation for some workload?

If the latter, what workload was it?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3rd version of R/W mmap_sem patch available
  2001-03-20 16:08       ` Linus Torvalds
  2001-03-20 16:33         ` Andi Kleen
@ 2001-03-22 10:24         ` Andrew Morton
  1 sibling, 0 replies; 24+ messages in thread
From: Andrew Morton @ 2001-03-22 10:24 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-mm

Linus Torvalds wrote:
> 
> On Wed, 21 Mar 2001, Andrew Morton wrote:
> >
> >       The handling of mm->rss is racy.  But I think
> >       it always has been?
> 
> It always has been. Right now I think we hold the page_table_lock over
> most of them, that the old patch to fix this might end up being just that
> one place. Somebody interested in checking?
> 

There were two places which needed fixing, and it was
pretty trivial.  With this patch, mm_struct.rss handling
is racefree on x86.  Some other archs (notably ia64/ia32)
are still a little racy on the exec() path.

I was sorely tempted to make put_dirty_page() require
that tsk->mm->page_table_lock be held by the caller,
which would save a bunch of locking.  But put_dirty_page()
is used by architectures which I don't understand.

The patch also includes a feeble attempt to document
some locking rules.



--- linux-2.4.3-pre6/include/linux/sched.h	Thu Mar 22 18:52:52 2001
+++ lk/include/linux/sched.h	Thu Mar 22 19:41:06 2001
@@ -209,9 +209,12 @@
 	atomic_t mm_count;			/* How many references to "struct mm_struct" (users count as 1) */
 	int map_count;				/* number of VMAs */
 	struct rw_semaphore mmap_sem;
-	spinlock_t page_table_lock;
+	spinlock_t page_table_lock;		/* Protects task page tables and mm->rss */
 
-	struct list_head mmlist;		/* List of all active mm's */
+	struct list_head mmlist;		/* List of all active mm's.  These are globally strung
+						 * together off init_mm.mmlist, and are protected
+						 * by mmlist_lock
+						 */
 
 	unsigned long start_code, end_code, start_data, end_data;
 	unsigned long start_brk, brk, start_stack;
--- linux-2.4.3-pre6/fs/exec.c	Thu Mar 22 18:52:52 2001
+++ lk/fs/exec.c	Thu Mar 22 19:51:35 2001
@@ -252,6 +252,8 @@
 /*
  * This routine is used to map in a page into an address space: needed by
  * execve() for the initial stack and environment pages.
+ *
+ * tsk->mmap_sem is held for writing.
  */
 void put_dirty_page(struct task_struct * tsk, struct page *page, unsigned long address)
 {
@@ -291,6 +293,7 @@
 	unsigned long stack_base;
 	struct vm_area_struct *mpnt;
 	int i;
+	unsigned long rss_increment = 0;
 
 	stack_base = STACK_TOP - MAX_ARG_PAGES*PAGE_SIZE;
 
@@ -322,11 +325,14 @@
 		struct page *page = bprm->page[i];
 		if (page) {
 			bprm->page[i] = NULL;
-			current->mm->rss++;
+			rss_increment++;
 			put_dirty_page(current,page,stack_base);
 		}
 		stack_base += PAGE_SIZE;
 	}
+	spin_lock(&current->mm->page_table_lock);
+	current->mm->rss += rss_increment;
+	spin_unlock(&current->mm->page_table_lock);
 	up_write(&current->mm->mmap_sem);
 	
 	return 0;
--- linux-2.4.3-pre6/mm/memory.c	Thu Mar 22 18:52:52 2001
+++ lk/mm/memory.c	Thu Mar 22 21:13:29 2001
@@ -374,7 +374,6 @@
 		address = (address + PGDIR_SIZE) & PGDIR_MASK;
 		dir++;
 	} while (address && (address < end));
-	spin_unlock(&mm->page_table_lock);
 	/*
 	 * Update rss for the mm_struct (not necessarily current->mm)
 	 * Notice that rss is an unsigned long.
@@ -383,6 +382,7 @@
 		mm->rss -= freed;
 	else
 		mm->rss = 0;
+	spin_unlock(&mm->page_table_lock);
 }
 
 
@@ -792,6 +792,8 @@
  *  - flush the old one
  *  - update the page tables
  *  - inform the TLB about the new one
+ *
+ * We hold the mm semaphore for reading and vma->vm_mm->page_table_lock
  */
 static inline void establish_pte(struct vm_area_struct * vma, unsigned long address, pte_t *page_table, pte_t entry)
 {
@@ -800,6 +802,9 @@
 	update_mmu_cache(vma, address, entry);
 }
 
+/*
+ * We hold the mm semaphore for reading and vma->vm_mm->page_table_lock
+ */
 static inline void break_cow(struct vm_area_struct * vma, struct page *	old_page, struct page * new_page, unsigned long address, 
 		pte_t *page_table)
 {
@@ -1024,8 +1029,7 @@
 }
 
 /*
- * We hold the mm semaphore and the page_table_lock on entry
- * and exit.
+ * We hold the mm semaphore and the page_table_lock on entry and exit.
  */
 static int do_swap_page(struct mm_struct * mm,
 	struct vm_area_struct * vma, unsigned long address,
--- linux-2.4.3-pre6/mm/mmap.c	Thu Mar 22 18:52:52 2001
+++ lk/mm/mmap.c	Thu Mar 22 19:19:08 2001
@@ -889,8 +889,8 @@
 	spin_lock(&mm->page_table_lock);
 	mpnt = mm->mmap;
 	mm->mmap = mm->mmap_avl = mm->mmap_cache = NULL;
-	spin_unlock(&mm->page_table_lock);
 	mm->rss = 0;
+	spin_unlock(&mm->page_table_lock);
 	mm->total_vm = 0;
 	mm->locked_vm = 0;
 
--- linux-2.4.3-pre6/mm/vmscan.c	Tue Jan 16 07:36:49 2001
+++ lk/mm/vmscan.c	Thu Mar 22 19:32:11 2001
@@ -25,16 +25,15 @@
 #include <asm/pgalloc.h>
 
 /*
- * The swap-out functions return 1 if they successfully
- * threw something out, and we got a free page. It returns
- * zero if it couldn't do anything, and any other value
- * indicates it decreased rss, but the page was shared.
+ * The swap-out function returns 1 if it successfully
+ * scanned all the pages it was asked to (`count').
+ * It returns zero if it couldn't do anything,
  *
- * NOTE! If it sleeps, it *must* return 1 to make sure we
- * don't continue with the swap-out. Otherwise we may be
- * using a process that no longer actually exists (it might
- * have died while we slept).
+ * rss may decrease because pages are shared, but this
+ * doesn't count as having freed a page.
  */
+
+/* mm->page_table_lock is held. mmap_sem is not held */
 static void try_to_swap_out(struct mm_struct * mm, struct vm_area_struct* vma, unsigned long address, pte_t * page_table, struct page *page)
 {
 	pte_t pte;
@@ -129,6 +128,7 @@
 	return;
 }
 
+/* mm->page_table_lock is held. mmap_sem is not held */
 static int swap_out_pmd(struct mm_struct * mm, struct vm_area_struct * vma, pmd_t *dir, unsigned long address, unsigned long end, int count)
 {
 	pte_t * pte;
@@ -165,6 +165,7 @@
 	return count;
 }
 
+/* mm->page_table_lock is held. mmap_sem is not held */
 static inline int swap_out_pgd(struct mm_struct * mm, struct vm_area_struct * vma, pgd_t *dir, unsigned long address, unsigned long end, int count)
 {
 	pmd_t * pmd;
@@ -194,6 +195,7 @@
 	return count;
 }
 
+/* mm->page_table_lock is held. mmap_sem is not held */
 static int swap_out_vma(struct mm_struct * mm, struct vm_area_struct * vma, unsigned long address, int count)
 {
 	pgd_t *pgdir;
@@ -218,6 +220,9 @@
 	return count;
 }
 
+/*
+ * Returns non-zero if we scanned all `count' pages
+ */
 static int swap_out_mm(struct mm_struct * mm, int count)
 {
 	unsigned long address;
--- linux-2.4.3-pre6/mm/swapfile.c	Sun Feb 25 17:37:14 2001
+++ lk/mm/swapfile.c	Thu Mar 22 19:56:36 2001
@@ -209,6 +209,7 @@
  * share this swap entry, so be cautious and let do_wp_page work out
  * what to do if a write is requested later.
  */
+/* tasklist_lock and vma->vm_mm->page_table_lock are held */
 static inline void unuse_pte(struct vm_area_struct * vma, unsigned long address,
 	pte_t *dir, swp_entry_t entry, struct page* page)
 {
@@ -234,6 +235,7 @@
 	++vma->vm_mm->rss;
 }
 
+/* tasklist_lock and vma->vm_mm->page_table_lock are held */
 static inline void unuse_pmd(struct vm_area_struct * vma, pmd_t *dir,
 	unsigned long address, unsigned long size, unsigned long offset,
 	swp_entry_t entry, struct page* page)
@@ -261,6 +263,7 @@
 	} while (address && (address < end));
 }
 
+/* tasklist_lock and vma->vm_mm->page_table_lock are held */
 static inline void unuse_pgd(struct vm_area_struct * vma, pgd_t *dir,
 	unsigned long address, unsigned long size,
 	swp_entry_t entry, struct page* page)
@@ -291,6 +294,7 @@
 	} while (address && (address < end));
 }
 
+/* tasklist_lock and vma->vm_mm->page_table_lock are held */
 static void unuse_vma(struct vm_area_struct * vma, pgd_t *pgdir,
 			swp_entry_t entry, struct page* page)
 {
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [patch] pae-2.4.3-A4
  2001-03-20  2:46   ` Linus Torvalds
  2001-03-20  4:15     ` Marcelo Tosatti
  2001-03-20 15:11     ` Andrew Morton
@ 2001-03-25 14:53     ` Ingo Molnar
  2001-03-25 16:33       ` Russell King
  2001-03-25 18:07       ` Linus Torvalds
  2 siblings, 2 replies; 24+ messages in thread
From: Ingo Molnar @ 2001-03-25 14:53 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alan Cox, linux-mm, Linux Kernel List

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1360 bytes --]


On Mon, 19 Mar 2001, Linus Torvalds wrote:

> The complete changelog is appended, but the biggest recent change is
> the mmap_sem change, which I updated with new locking rules for
> pte/pmd_alloc to avoid the race on the actual page table build.
>
> This has only been tested on i386 without PAE, and is known to break
> other architectures. Ingo, mind checking what PAE needs? [...]

one nontrivial issue was that on PAE the pgd has to be installed with
'present' pgd entries, due to a CPU erratum. This means that the
pgd_present() code in mm/memory.c, while correct theoretically, doesnt
work with PAE. An equivalent solution is to use !pgd_none(), which also
works with the PAE workaround.

PAE mode could re-define pgd_present() to filter out the workaround - do
you prefer this to the !pgd_none() solution?

the rest was pretty straightforward.

in any case, with the attached pae-2.4.3-A4 patch (against 2.4.3-pre7,
applies to 2.4.2-ac24 cleanly as well) applied, 2.4.3-pre7 boots & works
just fine on PAE 64GB-HIGHMEM and non-PAE kernels.

- the patch also does another cleanup: removes various bad_pagetable code
  snippets all around the x86 tree, it's not needed anymore. This saves 8
  KB RAM on x86 systems.

- removed the last remaining *_kernel() macro.

- fixed a minor clear_page() bug in pgalloc.h, gfp() could fail in the
  future.

	Ingo

[-- Attachment #2: Type: TEXT/PLAIN, Size: 5875 bytes --]

--- linux/mm/memory.c.orig	Sun Mar 25 18:55:05 2001
+++ linux/mm/memory.c	Sun Mar 25 18:55:07 2001
@@ -1295,7 +1295,7 @@
 		 * Because we dropped the lock, we should re-check the
 		 * entry, as somebody else could have populated it..
 		 */
-		if (pgd_present(*pgd)) {
+		if (!pgd_none(*pgd)) {
 			pmd_free(new);
 			goto out;
 		}
@@ -1313,7 +1313,7 @@
  */
 pte_t *pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
 {
-	if (!pmd_present(*pmd)) {
+	if (pmd_none(*pmd)) {
 		pte_t *new;
 
 		/* "fast" allocation can happen without dropping the lock.. */
@@ -1329,7 +1329,7 @@
 			 * Because we dropped the lock, we should re-check the
 			 * entry, as somebody else could have populated it..
 			 */
-			if (pmd_present(*pmd)) {
+			if (!pmd_none(*pmd)) {
 				pte_free(new);
 				goto out;
 			}
--- linux/include/asm-i386/pgalloc-3level.h.orig	Sun Mar 25 18:55:05 2001
+++ linux/include/asm-i386/pgalloc-3level.h	Sun Mar 25 19:23:02 2001
@@ -21,12 +21,12 @@
 {
 	unsigned long *ret;
 
-	if ((ret = pmd_quicklist) != NULL) {
+	ret = pmd_quicklist;
+	if (ret != NULL) {
 		pmd_quicklist = (unsigned long *)(*ret);
 		ret[0] = 0;
 		pgtable_cache_size--;
-	} else
-		ret = (unsigned long *)get_pmd_slow();
+	}
 	return (pmd_t *)ret;
 }
 
@@ -41,5 +41,10 @@
 {
 	free_page((unsigned long)pmd);
 }
+
+#define pmd_free(pmd) pmd_free_fast(pmd)
+
+#define pgd_populate(pgd, pmd) \
+	do { set_pgd(pgd, __pgd(1 + __pa(pmd))); __flush_tlb(); } while(0)
 
 #endif /* _I386_PGALLOC_3LEVEL_H */
--- linux/include/asm-i386/pgalloc.h.orig	Sun Mar 25 18:56:25 2001
+++ linux/include/asm-i386/pgalloc.h	Sun Mar 25 19:35:59 2001
@@ -11,7 +11,8 @@
 #define pte_quicklist (current_cpu_data.pte_quick)
 #define pgtable_cache_size (current_cpu_data.pgtable_cache_sz)
 
-#define pmd_populate(pmd, pte)		set_pmd(pmd, __pmd(_PAGE_TABLE + __pa(pte)))
+#define pmd_populate(pmd, pte) \
+		set_pmd(pmd, __pmd(_PAGE_TABLE + __pa(pte)))
 
 #if CONFIG_X86_PAE
 # include <asm/pgalloc-3level.h>
@@ -72,7 +73,8 @@
 	pte_t *pte;
 
 	pte = (pte_t *) __get_free_page(GFP_KERNEL);
-	clear_page(pte);
+	if (pte)
+		clear_page(pte);
 	return pte;
 }
 
@@ -100,7 +102,6 @@
 	free_page((unsigned long)pte);
 }
 
-#define pte_free_kernel(pte)	pte_free_slow(pte)
 #define pte_free(pte)		pte_free_slow(pte)
 #define pgd_free(pgd)		free_pgd_slow(pgd)
 #define pgd_alloc()		get_pgd_fast()
--- linux/include/asm-i386/pgtable.h.orig	Sun Mar 25 19:03:58 2001
+++ linux/include/asm-i386/pgtable.h	Sun Mar 25 19:04:06 2001
@@ -243,12 +243,6 @@
 /* page table for 0-4MB for everybody */
 extern unsigned long pg0[1024];
 
-/*
- * Handling allocation failures during page table setup.
- */
-extern void __handle_bad_pmd(pmd_t * pmd);
-extern void __handle_bad_pmd_kernel(pmd_t * pmd);
-
 #define pte_present(x)	((x).pte_low & (_PAGE_PRESENT | _PAGE_PROTNONE))
 #define pte_clear(xp)	do { set_pte(xp, __pte(0)); } while (0)
 
--- linux/arch/i386/mm/init.c.orig	Sun Mar 25 19:02:05 2001
+++ linux/arch/i386/mm/init.c	Sun Mar 25 19:02:42 2001
@@ -40,77 +40,6 @@
 static unsigned long totalram_pages;
 static unsigned long totalhigh_pages;
 
-/*
- * BAD_PAGE is the page that is used for page faults when linux
- * is out-of-memory. Older versions of linux just did a
- * do_exit(), but using this instead means there is less risk
- * for a process dying in kernel mode, possibly leaving an inode
- * unused etc..
- *
- * BAD_PAGETABLE is the accompanying page-table: it is initialized
- * to point to BAD_PAGE entries.
- *
- * ZERO_PAGE is a special page that is used for zero-initialized
- * data and COW.
- */
-
-/*
- * These are allocated in head.S so that we get proper page alignment.
- * If you change the size of these then change head.S as well.
- */
-extern char empty_bad_page[PAGE_SIZE];
-#if CONFIG_X86_PAE
-extern pmd_t empty_bad_pmd_table[PTRS_PER_PMD];
-#endif
-extern pte_t empty_bad_pte_table[PTRS_PER_PTE];
-
-/*
- * We init them before every return and make them writable-shared.
- * This guarantees we get out of the kernel in some more or less sane
- * way.
- */
-#if CONFIG_X86_PAE
-static pmd_t * get_bad_pmd_table(void)
-{
-	pmd_t v;
-	int i;
-
-	set_pmd(&v, __pmd(_PAGE_TABLE + __pa(empty_bad_pte_table)));
-
-	for (i = 0; i < PAGE_SIZE/sizeof(pmd_t); i++)
-		empty_bad_pmd_table[i] = v;
-
-	return empty_bad_pmd_table;
-}
-#endif
-
-static pte_t * get_bad_pte_table(void)
-{
-	pte_t v;
-	int i;
-
-	v = pte_mkdirty(mk_pte_phys(__pa(empty_bad_page), PAGE_SHARED));
-
-	for (i = 0; i < PAGE_SIZE/sizeof(pte_t); i++)
-		empty_bad_pte_table[i] = v;
-
-	return empty_bad_pte_table;
-}
-
-
-
-void __handle_bad_pmd(pmd_t *pmd)
-{
-	pmd_ERROR(*pmd);
-	set_pmd(pmd, __pmd(_PAGE_TABLE + __pa(get_bad_pte_table())));
-}
-
-void __handle_bad_pmd_kernel(pmd_t *pmd)
-{
-	pmd_ERROR(*pmd);
-	set_pmd(pmd, __pmd(_KERNPG_TABLE + __pa(get_bad_pte_table())));
-}
-
 int do_check_pgt_cache(int low, int high)
 {
 	int freed = 0;
--- linux/arch/i386/kernel/head.S.orig	Sun Mar 25 19:04:23 2001
+++ linux/arch/i386/kernel/head.S	Sun Mar 25 19:04:57 2001
@@ -415,23 +415,6 @@
 ENTRY(empty_zero_page)
 
 .org 0x5000
-ENTRY(empty_bad_page)
-
-.org 0x6000
-ENTRY(empty_bad_pte_table)
-
-#if CONFIG_X86_PAE
-
- .org 0x7000
- ENTRY(empty_bad_pmd_table)
-
- .org 0x8000
-
-#else
-
- .org 0x7000
-
-#endif
 
 /*
  * This starts the data section. Note that the above is all
--- linux/MAINTAINERS.orig	Sun Mar 25 19:23:13 2001
+++ linux/MAINTAINERS	Sun Mar 25 19:24:46 2001
@@ -1454,6 +1454,11 @@
 L:	linux-x25@vger.kernel.org
 S:	Maintained
 
+X86 3-LEVEL PAGING (PAE) SUPPORT
+P:	Ingo Molnar
+M:	mingo@redhat.com
+S:	Maintained
+
 Z85230 SYNCHRONOUS DRIVER
 P:	Alan Cox
 M:	alan@redhat.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch] pae-2.4.3-A4
  2001-03-25 14:53     ` [patch] pae-2.4.3-A4 Ingo Molnar
@ 2001-03-25 16:33       ` Russell King
  2001-03-25 18:07       ` Linus Torvalds
  1 sibling, 0 replies; 24+ messages in thread
From: Russell King @ 2001-03-25 16:33 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linus Torvalds, Alan Cox, linux-mm, Linux Kernel List

On Sun, Mar 25, 2001 at 04:53:37PM +0200, Ingo Molnar wrote:
> one nontrivial issue was that on PAE the pgd has to be installed with
> 'present' pgd entries, due to a CPU erratum. This means that the
> pgd_present() code in mm/memory.c, while correct theoretically, doesnt
> work with PAE. An equivalent solution is to use !pgd_none(), which also
> works with the PAE workaround.

Certainly that's the way the original *_alloc routines used to work.
In fact, ARM never had need to implement the pmd_present() macros, since
they were never referenced - only the pmd_none() macros were.

However, I'm currently struggling with this change on ARM - so far after
a number of hours trying to kick something into shape, I've not managed
to even get to the stange where I get a kernel image to link, let alone
the compilation to finish.

One of my many dilemas at the moment is how to allocate the page 0 PMD
in pgd_alloc(), where we don't have a mm_struct to do the locking against.

--
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch] pae-2.4.3-A4
  2001-03-25 14:53     ` [patch] pae-2.4.3-A4 Ingo Molnar
  2001-03-25 16:33       ` Russell King
@ 2001-03-25 18:07       ` Linus Torvalds
  2001-03-25 18:51         ` Ingo Molnar
  1 sibling, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2001-03-25 18:07 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Alan Cox, linux-mm, Linux Kernel List

On Sun, 25 Mar 2001, Ingo Molnar wrote:
>
> one nontrivial issue was that on PAE the pgd has to be installed with
> 'present' pgd entries, due to a CPU erratum. This means that the
> pgd_present() code in mm/memory.c, while correct theoretically, doesnt
> work with PAE. An equivalent solution is to use !pgd_none(), which also
> works with the PAE workaround.

Note that due to the very same erratum, we really should populate the PGD
from the very beginning. See the other thread about how we currently
fail to properly invalidate the TLB on other CPU's when we add a new PGD
entry, exactly because the other CPU's are caching the "nonexistent" PGD
entry that we just replaced.

So my suggestion for PAE is:

 - populate in gdb_alloc() (ie just do the three "alloc_page()" calls to
   allocate the PMD's immediately)

   NOTE: This makes the race go away, and will actually speed things up as
   we will pretty much in practice always populate the PGD _anyway_, the
   way the VM areas are laid out.

 - make "pgd_present()" always return 1.

   NOTE: This will speed up the page table walkers anyway. It will also
   avoid the problem above.

 - make "free_pmd()" a no-op.

All of the above will (a) simplify things (b) remove special cases and (c)
remove actual and existing bugs.

(In fact, the reason why the PGD populate missing TLB invalidate probably
never happens in practice is exactly the fact that the PGD is always
populated so fast that it's hard to make a test-case that shows this. But
it's still a bug - probably fairly easily triggered by a threaded program
that is statically linked (so that the code loads at 0x40000000 and
doesn't have the loader linked low - so the lowest PGD entry is not
allocated until later).

Does anybody see any problems with the above? It looks like the obvious
fix.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch] pae-2.4.3-A4
  2001-03-25 18:07       ` Linus Torvalds
@ 2001-03-25 18:51         ` Ingo Molnar
  0 siblings, 0 replies; 24+ messages in thread
From: Ingo Molnar @ 2001-03-25 18:51 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alan Cox, linux-mm, Linux Kernel List

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1782 bytes --]


On Sun, 25 Mar 2001, Linus Torvalds wrote:

> So my suggestion for PAE is:
>
>  - populate in gdb_alloc() (ie just do the three "alloc_page()" calls to
>    allocate the PMD's immediately)
>
>    NOTE: This makes the race go away, and will actually speed things up as
>    we will pretty much in practice always populate the PGD _anyway_, the
>    way the VM areas are laid out.
>
>  - make "pgd_present()" always return 1.
>
>    NOTE: This will speed up the page table walkers anyway. It will also
>    avoid the problem above.
>
>  - make "free_pmd()" a no-op.
>
> All of the above will (a) simplify things (b) remove special cases and
> (c) remove actual and existing bugs.

yep, this truly makes things so much easier and cleaner! There was only
one thing missing to make it work:

   - make "pgd_clear()" a no-op.

[the reason for the slightly more complex pgd_alloc_slow() code is to
support non-default virtual memory splits as well, where the number of
user pgds is not necesserily 3.]

plus i took to opportunity to reduce the allocation size of PAE-pgds.
Their size is only 32 bytes, and we allocated a full page. Now the code
kmalloc()s a 32 byte cacheline for the pgd. (there is a hardware
constraint on alignment: this cacheline must be at least 16-byte aligned,
which is true for the current kmalloc() code.) So the per-process cost is
reduced by almost 4 KB.

and i got rid of pgalloc-[2|3]level.h - with the pmds merged into the pgd
logic the algorithmic difference between 2-level and this pseudo-3-level
PAE paging is not that big anymore. The pgtable-[2|3]level.h files are
still separate.

the attached pae-2.4.3-B2 patch (against 2.4.3-pre7) compiles & boots just
fine both in PAE and non-PAE mode. The patch removes 217 lines, and adds
only 78 lines.

	Ingo

[-- Attachment #2: Type: TEXT/PLAIN, Size: 11365 bytes --]

--- linux/include/asm-i386/pgalloc-3level.h.orig	Sun Mar 25 18:55:05 2001
+++ linux/include/asm-i386/pgalloc-3level.h	Sun Mar 25 22:38:41 2001
@@ -1,45 +0,0 @@
-#ifndef _I386_PGALLOC_3LEVEL_H
-#define _I386_PGALLOC_3LEVEL_H
-
-/*
- * Intel Physical Address Extension (PAE) Mode - three-level page
- * tables on PPro+ CPUs. Page-table allocation routines.
- *
- * Copyright (C) 1999 Ingo Molnar <mingo@redhat.com>
- */
-
-extern __inline__ pmd_t *pmd_alloc_one(void)
-{
-	pmd_t *ret = (pmd_t *)__get_free_page(GFP_KERNEL);
-
-	if (ret)
-		memset(ret, 0, PAGE_SIZE);
-	return ret;
-}
-
-extern __inline__ pmd_t *pmd_alloc_one_fast(void)
-{
-	unsigned long *ret;
-
-	if ((ret = pmd_quicklist) != NULL) {
-		pmd_quicklist = (unsigned long *)(*ret);
-		ret[0] = 0;
-		pgtable_cache_size--;
-	} else
-		ret = (unsigned long *)get_pmd_slow();
-	return (pmd_t *)ret;
-}
-
-extern __inline__ void pmd_free_fast(pmd_t *pmd)
-{
-	*(unsigned long *)pmd = (unsigned long) pmd_quicklist;
-	pmd_quicklist = (unsigned long *) pmd;
-	pgtable_cache_size++;
-}
-
-extern __inline__ void pmd_free_slow(pmd_t *pmd)
-{
-	free_page((unsigned long)pmd);
-}
-
-#endif /* _I386_PGALLOC_3LEVEL_H */
--- linux/include/asm-i386/pgalloc.h.orig	Sun Mar 25 18:56:25 2001
+++ linux/include/asm-i386/pgalloc.h	Sun Mar 25 23:11:23 2001
@@ -11,37 +11,56 @@
 #define pte_quicklist (current_cpu_data.pte_quick)
 #define pgtable_cache_size (current_cpu_data.pgtable_cache_sz)
 
-#define pmd_populate(pmd, pte)		set_pmd(pmd, __pmd(_PAGE_TABLE + __pa(pte)))
-
-#if CONFIG_X86_PAE
-# include <asm/pgalloc-3level.h>
-#else
-# include <asm/pgalloc-2level.h>
-#endif
+#define pmd_populate(pmd, pte) \
+		set_pmd(pmd, __pmd(_PAGE_TABLE + __pa(pte)))
 
 /*
- * Allocate and free page tables. The xxx_kernel() versions are
- * used to allocate a kernel page table - this turns on ASN bits
- * if any.
+ * Allocate and free page tables.
  */
 
+#if CONFIG_X86_PAE
+
+extern void *kmalloc(size_t, int);
+extern void kfree(const void *);
+
 extern __inline__ pgd_t *get_pgd_slow(void)
 {
-	pgd_t *ret = (pgd_t *)__get_free_page(GFP_KERNEL);
+	int i;
+	pgd_t *pgd = kmalloc(PTRS_PER_PGD * sizeof(pgd_t), GFP_KERNEL);
+
+	if (pgd) {
+		for (i = 0; i < USER_PTRS_PER_PGD; i++) {
+			unsigned long pmd = __get_free_page(GFP_KERNEL);
+			if (!pmd)
+				goto out_oom;
+			clear_page(pmd);
+			set_pgd(pgd + i, __pgd(1 + __pa(pmd)));
+		}
+		memcpy(pgd + USER_PTRS_PER_PGD, swapper_pg_dir + USER_PTRS_PER_PGD, (PTRS_PER_PGD - USER_PTRS_PER_PGD) * sizeof(pgd_t));
+	}
+	return pgd;
+out_oom:
+	for (i--; i >= 0; i--)
+		free_page((unsigned long)__va(pgd_val(pgd[i])-1));
+	kfree(pgd);
+	return NULL;
+}
 
-	if (ret) {
-#if CONFIG_X86_PAE
-		int i;
-		for (i = 0; i < USER_PTRS_PER_PGD; i++)
-			__pgd_clear(ret + i);
 #else
-		memset(ret, 0, USER_PTRS_PER_PGD * sizeof(pgd_t));
-#endif
-		memcpy(ret + USER_PTRS_PER_PGD, swapper_pg_dir + USER_PTRS_PER_PGD, (PTRS_PER_PGD - USER_PTRS_PER_PGD) * sizeof(pgd_t));
+
+extern __inline__ pgd_t *get_pgd_slow(void)
+{
+	pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL);
+
+	if (pgd) {
+		memset(pgd, 0, USER_PTRS_PER_PGD * sizeof(pgd_t));
+		memcpy(pgd + USER_PTRS_PER_PGD, swapper_pg_dir + USER_PTRS_PER_PGD, (PTRS_PER_PGD - USER_PTRS_PER_PGD) * sizeof(pgd_t));
 	}
-	return ret;
+	return pgd;
 }
 
+#endif
+
 extern __inline__ pgd_t *get_pgd_fast(void)
 {
 	unsigned long *ret;
@@ -64,7 +83,15 @@
 
 extern __inline__ void free_pgd_slow(pgd_t *pgd)
 {
+#if CONFIG_X86_PAE
+	int i;
+
+	for (i = 0; i < USER_PTRS_PER_PGD; i++)
+		free_page((unsigned long)__va(pgd_val(pgd[i])-1));
+	kfree(pgd);
+#else
 	free_page((unsigned long)pgd);
+#endif
 }
 
 static inline pte_t *pte_alloc_one(void)
@@ -72,7 +99,8 @@
 	pte_t *pte;
 
 	pte = (pte_t *) __get_free_page(GFP_KERNEL);
-	clear_page(pte);
+	if (pte)
+		clear_page(pte);
 	return pte;
 }
 
@@ -100,7 +128,6 @@
 	free_page((unsigned long)pte);
 }
 
-#define pte_free_kernel(pte)	pte_free_slow(pte)
 #define pte_free(pte)		pte_free_slow(pte)
 #define pgd_free(pgd)		free_pgd_slow(pgd)
 #define pgd_alloc()		get_pgd_fast()
@@ -108,12 +135,15 @@
 /*
  * allocating and freeing a pmd is trivial: the 1-entry pmd is
  * inside the pgd, so has no extra memory associated with it.
- * (In the PAE case we free the page.)
+ * (In the PAE case we free the pmds as part of the pgd.)
  */
-#define pmd_free_one(pmd)	free_pmd_slow(pmd)
 
-#define pmd_free_kernel		pmd_free
-#define pmd_alloc_kernel	pmd_alloc
+#define pmd_alloc_one_fast()		({ BUG(); ((pmd_t *)1); })
+#define pmd_alloc_one()			({ BUG(); ((pmd_t *)2); })
+#define pmd_free_slow(x)		do { } while (0)
+#define pmd_free_fast(x)		do { } while (0)
+#define pmd_free(x)			do { } while (0)
+#define pgd_populate(pmd, pte)		BUG()
 
 extern int do_check_pgt_cache(int, int);
 
--- linux/include/asm-i386/pgtable.h.orig	Sun Mar 25 19:03:58 2001
+++ linux/include/asm-i386/pgtable.h	Sun Mar 25 23:34:02 2001
@@ -243,12 +243,6 @@
 /* page table for 0-4MB for everybody */
 extern unsigned long pg0[1024];
 
-/*
- * Handling allocation failures during page table setup.
- */
-extern void __handle_bad_pmd(pmd_t * pmd);
-extern void __handle_bad_pmd_kernel(pmd_t * pmd);
-
 #define pte_present(x)	((x).pte_low & (_PAGE_PRESENT | _PAGE_PROTNONE))
 #define pte_clear(xp)	do { set_pte(xp, __pte(0)); } while (0)
 
--- linux/include/asm-i386/pgalloc-2level.h.orig	Sun Mar 25 22:36:30 2001
+++ linux/include/asm-i386/pgalloc-2level.h	Sun Mar 25 22:38:47 2001
@@ -1,16 +0,0 @@
-#ifndef _I386_PGALLOC_2LEVEL_H
-#define _I386_PGALLOC_2LEVEL_H
-
-/*
- * traditional i386 two-level paging, page table allocation routines:
- * We don't have any real pmd's, and this code never triggers because
- * the pgd will always be present..
- */
-#define pmd_alloc_one_fast()		({ BUG(); ((pmd_t *)1); })
-#define pmd_alloc_one()			({ BUG(); ((pmd_t *)2); })
-#define pmd_free_slow(x)		do { } while (0)
-#define pmd_free_fast(x)		do { } while (0)
-#define pmd_free(x)			do { } while (0)
-#define pgd_populate(pmd, pte)		BUG()
-
-#endif /* _I386_PGALLOC_2LEVEL_H */
--- linux/include/asm-i386/pgtable-3level.h.orig	Sun Mar 25 23:22:35 2001
+++ linux/include/asm-i386/pgtable-3level.h	Sun Mar 25 23:28:41 2001
@@ -33,17 +33,9 @@
 #define pgd_ERROR(e) \
 	printk("%s:%d: bad pgd %p(%016Lx).\n", __FILE__, __LINE__, &(e), pgd_val(e))
 
-/*
- * Subtle, in PAE mode we cannot have zeroes in the top level
- * page directory, the CPU enforces this. (ie. the PGD entry
- * always has to have the present bit set.) The CPU caches
- * the 4 pgd entries internally, so there is no extra memory
- * load on TLB miss, despite one more level of indirection.
- */
-#define EMPTY_PGD (__pa(empty_zero_page) + 1)
-#define pgd_none(x)	(pgd_val(x) == EMPTY_PGD)
+extern inline int pgd_none(pgd_t pgd)		{ return 0; }
 extern inline int pgd_bad(pgd_t pgd)		{ return 0; }
-extern inline int pgd_present(pgd_t pgd)	{ return !pgd_none(pgd); }
+extern inline int pgd_present(pgd_t pgd)	{ return 1; }
 
 /* Rules for using set_pte: the pte being assigned *must* be
  * either not present or in a state where the hardware will
@@ -63,21 +55,12 @@
 		set_64bit((unsigned long long *)(pgdptr),pgd_val(pgdval))
 
 /*
- * Pentium-II errata A13: in PAE mode we explicitly have to flush
- * the TLB via cr3 if the top-level pgd is changed... This was one tough
- * thing to find out - guess i should first read all the documentation
- * next time around ;)
+ * Pentium-II erratum A13: in PAE mode we explicitly have to flush
+ * the TLB via cr3 if the top-level pgd is changed...
+ * We do not let the generic code free and clear pgd entries due to
+ * this erratum.
  */
-extern inline void __pgd_clear (pgd_t * pgd)
-{
-	set_pgd(pgd, __pgd(EMPTY_PGD));
-}
-
-extern inline void pgd_clear (pgd_t * pgd)
-{
-	__pgd_clear(pgd);
-	__flush_tlb();
-}
+extern inline void pgd_clear (pgd_t * pgd) { }
 
 #define pgd_page(pgd) \
 ((unsigned long) __va(pgd_val(pgd) & PAGE_MASK))
--- linux/arch/i386/mm/init.c.orig	Sun Mar 25 19:02:05 2001
+++ linux/arch/i386/mm/init.c	Sun Mar 25 23:33:37 2001
@@ -40,77 +40,6 @@
 static unsigned long totalram_pages;
 static unsigned long totalhigh_pages;
 
-/*
- * BAD_PAGE is the page that is used for page faults when linux
- * is out-of-memory. Older versions of linux just did a
- * do_exit(), but using this instead means there is less risk
- * for a process dying in kernel mode, possibly leaving an inode
- * unused etc..
- *
- * BAD_PAGETABLE is the accompanying page-table: it is initialized
- * to point to BAD_PAGE entries.
- *
- * ZERO_PAGE is a special page that is used for zero-initialized
- * data and COW.
- */
-
-/*
- * These are allocated in head.S so that we get proper page alignment.
- * If you change the size of these then change head.S as well.
- */
-extern char empty_bad_page[PAGE_SIZE];
-#if CONFIG_X86_PAE
-extern pmd_t empty_bad_pmd_table[PTRS_PER_PMD];
-#endif
-extern pte_t empty_bad_pte_table[PTRS_PER_PTE];
-
-/*
- * We init them before every return and make them writable-shared.
- * This guarantees we get out of the kernel in some more or less sane
- * way.
- */
-#if CONFIG_X86_PAE
-static pmd_t * get_bad_pmd_table(void)
-{
-	pmd_t v;
-	int i;
-
-	set_pmd(&v, __pmd(_PAGE_TABLE + __pa(empty_bad_pte_table)));
-
-	for (i = 0; i < PAGE_SIZE/sizeof(pmd_t); i++)
-		empty_bad_pmd_table[i] = v;
-
-	return empty_bad_pmd_table;
-}
-#endif
-
-static pte_t * get_bad_pte_table(void)
-{
-	pte_t v;
-	int i;
-
-	v = pte_mkdirty(mk_pte_phys(__pa(empty_bad_page), PAGE_SHARED));
-
-	for (i = 0; i < PAGE_SIZE/sizeof(pte_t); i++)
-		empty_bad_pte_table[i] = v;
-
-	return empty_bad_pte_table;
-}
-
-
-
-void __handle_bad_pmd(pmd_t *pmd)
-{
-	pmd_ERROR(*pmd);
-	set_pmd(pmd, __pmd(_PAGE_TABLE + __pa(get_bad_pte_table())));
-}
-
-void __handle_bad_pmd_kernel(pmd_t *pmd)
-{
-	pmd_ERROR(*pmd);
-	set_pmd(pmd, __pmd(_KERNPG_TABLE + __pa(get_bad_pte_table())));
-}
-
 int do_check_pgt_cache(int low, int high)
 {
 	int freed = 0;
@@ -289,10 +218,8 @@
 
 	pgd_base = swapper_pg_dir;
 #if CONFIG_X86_PAE
-	for (i = 0; i < PTRS_PER_PGD; i++) {
-		pgd = pgd_base + i;
-		__pgd_clear(pgd);
-	}
+	for (i = 0; i < PTRS_PER_PGD; i++)
+		set_pgd(pgd_base + i, __pgd(1 + __pa(empty_zero_page)));
 #endif
 	i = __pgd_offset(PAGE_OFFSET);
 	pgd = pgd_base + i;
--- linux/arch/i386/kernel/head.S.orig	Sun Mar 25 19:04:23 2001
+++ linux/arch/i386/kernel/head.S	Sun Mar 25 19:04:57 2001
@@ -415,23 +415,6 @@
 ENTRY(empty_zero_page)
 
 .org 0x5000
-ENTRY(empty_bad_page)
-
-.org 0x6000
-ENTRY(empty_bad_pte_table)
-
-#if CONFIG_X86_PAE
-
- .org 0x7000
- ENTRY(empty_bad_pmd_table)
-
- .org 0x8000
-
-#else
-
- .org 0x7000
-
-#endif
 
 /*
  * This starts the data section. Note that the above is all
--- linux/MAINTAINERS.orig	Sun Mar 25 19:23:13 2001
+++ linux/MAINTAINERS	Sun Mar 25 19:24:46 2001
@@ -1454,6 +1454,11 @@
 L:	linux-x25@vger.kernel.org
 S:	Maintained
 
+X86 3-LEVEL PAGING (PAE) SUPPORT
+P:	Ingo Molnar
+M:	mingo@redhat.com
+S:	Maintained
+
 Z85230 SYNCHRONOUS DRIVER
 P:	Alan Cox
 M:	alan@redhat.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2001-03-25 18:51 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <Pine.LNX.4.33.0103191802330.2076-100000@mikeg.weiden.de>
2001-03-20  1:56 ` 3rd version of R/W mmap_sem patch available Rik van Riel
2001-03-19 22:46   ` Linus Torvalds
2001-03-20  2:46   ` Linus Torvalds
2001-03-20  4:15     ` Marcelo Tosatti
2001-03-20  6:07       ` Linus Torvalds
2001-03-20  4:29         ` Marcelo Tosatti
2001-03-20  6:36           ` Linus Torvalds
2001-03-20  7:03             ` Linus Torvalds
2001-03-20  8:19               ` Eric W. Biederman
2001-03-20 15:11     ` Andrew Morton
2001-03-20 15:15       ` Jeff Garzik
2001-03-20 15:16       ` Jeff Garzik
2001-03-20 15:31         ` Andrew Morton
2001-03-21  1:59           ` Eric W. Biederman
2001-03-20 16:08       ` Linus Torvalds
2001-03-20 16:33         ` Andi Kleen
2001-03-20 17:13           ` Linus Torvalds
2001-03-20 19:33           ` Rik van Riel
2001-03-20 22:51             ` Andrew Morton
2001-03-22 10:24         ` Andrew Morton
2001-03-25 14:53     ` [patch] pae-2.4.3-A4 Ingo Molnar
2001-03-25 16:33       ` Russell King
2001-03-25 18:07       ` Linus Torvalds
2001-03-25 18:51         ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox