PATCH [2.4.0test10]: Kiobuf#02, fault-in fix

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* PATCH [2.4.0test10]: Kiobuf#02, fault-in fix
@ 2000-11-02 13:40 Stephen C. Tweedie
  2000-11-02 14:30 ` Jeff Garzik
  2000-11-03 22:27 ` Andrea Arcangeli
  0 siblings, 2 replies; 16+ messages in thread
From: Stephen C. Tweedie @ 2000-11-02 13:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rik van Riel, Ingo Molnar, Stephen Tweedie, linux-mm

[-- Attachment #1: Type: text/plain, Size: 293 bytes --]

Hi,

Next part of the kiobuf diffs: fix the fact that handle_mm_fault
doesn't guarantee to complete the operation in all cases; doesn't
guarantee that the resulting pte is writable if write access was
requested; and doesn't pin the page against immediately being swapped
back out.

--Stephen


[-- Attachment #2: 02-faultfix.diff --]
[-- Type: text/plain, Size: 1586 bytes --]

Only in linux-2.4.0-test10.kio.01/drivers/char: raw.c.~1~
Only in linux-2.4.0-test10.kio.01/fs: buffer.c.~1~
Only in linux-2.4.0-test10.kio.01/fs: iobuf.c.~1~
Only in linux-2.4.0-test10.kio.01/include/linux: iobuf.h.~1~
diff -ru linux-2.4.0-test10.kio.01/mm/memory.c linux-2.4.0-test10.kio.02/mm/memory.c
--- linux-2.4.0-test10.kio.01/mm/memory.c	Thu Nov  2 11:59:11 2000
+++ linux-2.4.0-test10.kio.02/mm/memory.c	Thu Nov  2 12:39:16 2000
@@ -384,7 +384,7 @@
 /*
  * Do a quick page-table lookup for a single page. 
  */
-static struct page * follow_page(unsigned long address) 
+static struct page * follow_page(unsigned long address, int write) 
 {
 	pgd_t *pgd;
 	pmd_t *pmd;
@@ -394,7 +394,8 @@
 	if (pmd) {
 		pte_t * pte = pte_offset(pmd, address);
 		if (pte && pte_present(*pte))
-			return pte_page(*pte);
+			if (!write || pte_write(*pte))
+				return pte_page(*pte);
 	}
 	
 	return NULL;
@@ -474,11 +475,14 @@
 		if (handle_mm_fault(current->mm, vma, ptr, datain) <= 0) 
 			goto out_unlock;
 		spin_lock(&mm->page_table_lock);
-		map = follow_page(ptr);
+		map = follow_page(ptr, datain);
 		if (!map) {
+			/* If handle_mm_fault did not complete the
+                           operation, or if we hit the swapout race
+                           before taking the page_table_lock, just try
+                           again on this page. */
 			spin_unlock(&mm->page_table_lock);
-			dprintk (KERN_ERR "Missing page in map_user_kiobuf\n");
-			goto out_unlock;
+			continue;
 		}
 		map = get_page_map(map);
 		if (map)
Only in linux-2.4.0-test10.kio.01/mm: memory.c.~1~

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PATCH [2.4.0test10]: Kiobuf#02, fault-in fix
  2000-11-02 13:40 PATCH [2.4.0test10]: Kiobuf#02, fault-in fix Stephen C. Tweedie
@ 2000-11-02 14:30 ` Jeff Garzik
  2000-11-02 15:58   ` Stephen C. Tweedie
  2000-11-03 22:27 ` Andrea Arcangeli
  1 sibling, 1 reply; 16+ messages in thread
From: Jeff Garzik @ 2000-11-02 14:30 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Linus Torvalds, Rik van Riel, Ingo Molnar, linux-mm

"Stephen C. Tweedie" wrote:
> Next part of the kiobuf diffs: fix the fact that handle_mm_fault
> doesn't guarantee to complete the operation in all cases; doesn't
> guarantee that the resulting pte is writable if write access was
> requested; and doesn't pin the page against immediately being swapped
> back out.

Dumb question time, if you don't mind.  :)  All code examples are from
mm/memory.c.

This seems to imply datain means 'read access':
	int datain = (rw == READ);

This seems to further imply datain means 'read access':
	if (((datain) && (!(vma->vm_flags & VM_WRITE))) ||

And then we pass 'datain' as the 'write_access' arg of handle_mm_fault:
	if (handle_mm_fault(current->mm, vma, ptr, datain) <= 0)

Further down in make_pages_present, there seems to be the opposite:
	write = (vma->vm_flags & VM_WRITE) != 0;
	[...]
	if (handle_mm_fault(mm, vma, addr, write) < 0)


So, why do we pass 'datain' as the 'write_access' arg of
handle_mm_fault? 
(and now in your 02-faultfix.diff patch, as the 'write' arg of
follow_page)

Regards,

	Jeff


-- 
Jeff Garzik             | Dinner is ready when
Building 1024           | the smoke alarm goes off.
MandrakeSoft            |	-/usr/games/fortune
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PATCH [2.4.0test10]: Kiobuf#02, fault-in fix
  2000-11-02 14:30 ` Jeff Garzik
@ 2000-11-02 15:58   ` Stephen C. Tweedie
  2000-11-04  1:28     ` Eric Lowe
  0 siblings, 1 reply; 16+ messages in thread
From: Stephen C. Tweedie @ 2000-11-02 15:58 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Stephen C. Tweedie, Linus Torvalds, Rik van Riel, Ingo Molnar, linux-mm

Hi,

On Thu, Nov 02, 2000 at 09:30:10AM -0500, Jeff Garzik wrote:
> 
> Dumb question time, if you don't mind.  :)  All code examples are from
> mm/memory.c.
> 
> This seems to imply datain means 'read access':
> 	int datain = (rw == READ);
> 
> And then we pass 'datain' as the 'write_access' arg of handle_mm_fault:
> 	if (handle_mm_fault(current->mm, vma, ptr, datain) <= 0)

Yes.  The kernel often has to make these checks the non-intuitive way
round, because a disk or network read IO actually involves write to
memory, but a write IO only has to read from memory.  The convention
is that read/write flags which affect IO paths indicate whether we are
writing from backing store, so we have to invert the sense to decide
whether it's a write to memory.

> This seems to further imply datain means 'read access':
> 	if (((datain) && (!(vma->vm_flags & VM_WRITE))) ||

No, because the next line is
				err = -EACCES;
so (rw==READ) and !VM_WRITE is an error --- datain does imply write
access to memory.

Cheers,
 Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PATCH [2.4.0test10]: Kiobuf#02, fault-in fix
  2000-11-02 13:40 PATCH [2.4.0test10]: Kiobuf#02, fault-in fix Stephen C. Tweedie
  2000-11-02 14:30 ` Jeff Garzik
@ 2000-11-03 22:27 ` Andrea Arcangeli
  2000-11-04  1:36   ` Eric Lowe
  2000-11-06 15:05   ` Stephen C. Tweedie
  1 sibling, 2 replies; 16+ messages in thread
From: Andrea Arcangeli @ 2000-11-03 22:27 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Linus Torvalds, Rik van Riel, Ingo Molnar, linux-mm

On Thu, Nov 02, 2000 at 01:40:21PM +0000, Stephen C. Tweedie wrote:
> +			if (!write || pte_write(*pte))

You should check pte is dirty, not only writeable.

>  		if (handle_mm_fault(current->mm, vma, ptr, datain) <= 0) 
>  			goto out_unlock;
>  		spin_lock(&mm->page_table_lock);
> -		map = follow_page(ptr);
> +		map = follow_page(ptr, datain);

Here you should _first_ follow_page and do handle_mm_fault _only_ if the pte is
not ok. This way only during first pagein we'll walk the pagetables two times,
all the other times we'll walk pagetables only once just to check that the
mapping is still there.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PATCH [2.4.0test10]: Kiobuf#02, fault-in fix
  2000-11-02 15:58   ` Stephen C. Tweedie
@ 2000-11-04  1:28     ` Eric Lowe
  0 siblings, 0 replies; 16+ messages in thread
From: Eric Lowe @ 2000-11-04  1:28 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Jeff Garzik, Linus Torvalds, Rik van Riel, Ingo Molnar, linux-mm

Hello,

> Yes.  The kernel often has to make these checks the non-intuitive way
> round, because a disk or network read IO actually involves write to
> memory, but a write IO only has to read from memory.  The convention
> is that read/write flags which affect IO paths indicate whether we are
> writing from backing store, so we have to invert the sense to decide
> whether it's a write to memory.
> 
> > This seems to further imply datain means 'read access':
> > 	if (((datain) && (!(vma->vm_flags & VM_WRITE))) ||
> 
> No, because the next line is
> 				err = -EACCES;
> so (rw==READ) and !VM_WRITE is an error --- datain does imply write
> access to memory.

That's why I call it write_access in my patches instead:
there's no ambiguity about what we mean. :)  But in any 
case, it's much better than using (rw==READ) everywhere like
it used to be ...

--
Eric Lowe
Software Engineer, Systran Corporation
elowe@systran.com


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PATCH [2.4.0test10]: Kiobuf#02, fault-in fix
  2000-11-03 22:27 ` Andrea Arcangeli
@ 2000-11-04  1:36   ` Eric Lowe
  2000-11-04  2:07     ` Andrea Arcangeli
  2000-11-06 15:05   ` Stephen C. Tweedie
  1 sibling, 1 reply; 16+ messages in thread
From: Eric Lowe @ 2000-11-04  1:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Linus Torvalds, Rik van Riel, Ingo Molnar, linux-mm

Hello,

> On Thu, Nov 02, 2000 at 01:40:21PM +0000, Stephen C. Tweedie wrote:
> > +			if (!write || pte_write(*pte))
>s
> You should check pte is dirty, not only writeable.
> 
> >  		if (handle_mm_fault(current->mm, vma, ptr, datain) <= 0) 
> >  			goto out_unlock;
> >  		spin_lock(&mm->page_table_lock);
> > -		map = follow_page(ptr);
> > +		map = follow_page(ptr, datain);
> 
> Here you should _first_ follow_page and do handle_mm_fault _only_ if the pte is
> not ok. This way only during first pagein we'll walk the pagetables two times,
> all the other times we'll walk pagetables only once just to check that the
> mapping is still there.
> 

Andrea,
I agree with you on this one, and in fact, my 2.2 patches already
do both these things.

Keep in mind this is for 2.2, not 2.4, need to merge my fixes
with what we have so far.  Do you agree this algorithm is correct?
(or at least closer):
+static inline int fault_page_in(struct vm_area_struct * vma,
+	unsigned long address, int write_access, pte_t *pte)
+{
+	int ret;
+
+	if (pte_present(*pte)) {
+		if (write_access && pte_write(*pte))
+			return 1;
+	}
+
+	ret = handle_pte_fault(current, vma, address, write_access, pte);
+	if (ret > 0)
+		update_mmu_cache(vma, address, *pte);
+	return (ret > 0);
+}
+

[...]

+	/*
+	 * First of all, try to fault in all of the necessary pages
+	 */
+	while (ptr < end) {
+		if (!vma || ptr >= vma->vm_end) {
+			vma = find_vma(current->mm, ptr);
+			if (!vma)
+				goto out_unlock;
+		}
+		pte = get_pte(vma, ptr);
+		if (!pte)
+			goto out_unlock;
+
+		if (!fault_page_in(vma, ptr, write_access, pte))
+			goto out_unlock;
+
+		page = pte_page(*pte);
+		if (!page) {
+			err = -EAGAIN;
+			goto out_unlock;
+		}
+		map = get_page_map(page);
+		if (map) {
+			if (TryLockPage(map))
+				goto retry;
+			atomic_inc(&map->count);
+			set_bit(PG_dirty, &map->flags);
+		}
+
+		if (!pte_present(*pte) || (write_access && !pte_write(*pte))) {
+			err = -EAGAIN;
+			goto out_unlock;
+		}
+		if (write_access && !pte_dirty(*pte))
+			panic("map_user_kiobuf: writable page w/o dirty pte\n");
+
+		dprintk ("Installing page %p %p: %d\n", (void *)page, map, i);
+		iobuf->pagelist[i] = page;
+		iobuf->maplist[i] = map;
+		iobuf->nr_pages = ++i;
+
+		ptr += PAGE_SIZE;
+	}
+

In my tests, the if (write_access && !pte_dirty(*pte)) never tripped,
but I don't know enough to know if this is valid or if we need
to do something else to guarantee the pte is dirty if it's writable?

--
Eric Lowe
Software Engineer, Systran Corporation
elowe@systran.com


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PATCH [2.4.0test10]: Kiobuf#02, fault-in fix
  2000-11-04  1:36   ` Eric Lowe
@ 2000-11-04  2:07     ` Andrea Arcangeli
  0 siblings, 0 replies; 16+ messages in thread
From: Andrea Arcangeli @ 2000-11-04  2:07 UTC (permalink / raw)
  To: Eric Lowe
  Cc: Stephen C. Tweedie, Linus Torvalds, Rik van Riel, Ingo Molnar, linux-mm

On Fri, Nov 03, 2000 at 08:36:08PM -0500, Eric Lowe wrote:
> I agree with you on this one, and in fact, my 2.2 patches already
> do both these things.

My one against 2.2.x is here:

	ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.2/2.2.18pre17aa1/13_bigmem-rawio-2.2.18pre17aa1-5.bz2

It fix several bugs (not only the ones that you attempted to fix).  Since your
patch seems to have several problems I suggest you to base on my one (you
may need to fix some reject to apply to clean 2.2.x though).

> +	while (ptr < end) {
> +		if (!vma || ptr >= vma->vm_end) {
> +			vma = find_vma(current->mm, ptr);
> +			if (!vma)
> +				goto out_unlock;
> +		}

Here you miss the check for the vm_start and vma flags.

> +		pte = get_pte(vma, ptr);
> +		if (!pte)
> +			goto out_unlock;
> +
> +		if (!fault_page_in(vma, ptr, write_access, pte))
> +			goto out_unlock;

So your fault_page_in should also check that the pte is dirty if
writing to memory.

> +		if (map) {
> +			if (TryLockPage(map))
> +				goto retry;
> +			atomic_inc(&map->count);
> +			set_bit(PG_dirty, &map->flags);
> +		}
> +

This doesn't fix the MM corruption (obviously since PG_dirty doesn't mean
_anything_ in 2.2.x, and even if it would mean something like in 2.4.x you
would need to rework core parts of the memory balancing to solve the MM
corruption that way).  Note also that so far it was legal to do rawio on
MAP_SHARED so I must preserve that semantics at least in the 2.2.x short term
in case somebody was depending on it in previous aa kernels. So right now I'm
locking down the pages during writes to memory. The _only_ real world downside
is that you can't write to the same page from two tasks at the same time but
_nobody_ really cares about that so unlikely corner case in real life.

> +		if (!pte_present(*pte) || (write_access && !pte_write(*pte))) {
> +			err = -EAGAIN;
> +			goto out_unlock;
> +		}

This isn't necessary. If it would be necessary it would be wrong to do it here
after you just grabbed the page reference.

> +		if (write_access && !pte_dirty(*pte))
> +			panic("map_user_kiobuf: writable page w/o dirty pte\n");

As said above this case should be handled by follow_page.  You shouldn't panic
but re-enter the page fault handler.

All those problems should be just fixed properly in the rawio patch in the aa
patchkit (and my stress stess is now happy on all kind of vmas). Please use
it and let me know if you have any problem or you see any bug. thanks!

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PATCH [2.4.0test10]: Kiobuf#02, fault-in fix
  2000-11-03 22:27 ` Andrea Arcangeli
  2000-11-04  1:36   ` Eric Lowe
@ 2000-11-06 15:05   ` Stephen C. Tweedie
  2000-11-06 16:12     ` Andrea Arcangeli
  2000-11-06 17:23     ` Linus Torvalds
  1 sibling, 2 replies; 16+ messages in thread
From: Stephen C. Tweedie @ 2000-11-06 15:05 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Linus Torvalds, Rik van Riel, Ingo Molnar, linux-mm

Hi,

On Fri, Nov 03, 2000 at 11:27:21PM +0100, Andrea Arcangeli wrote:
> On Thu, Nov 02, 2000 at 01:40:21PM +0000, Stephen C. Tweedie wrote:
> > +			if (!write || pte_write(*pte))
> 
> You should check pte is dirty, not only writeable.

Why?

> > -		map = follow_page(ptr);
> > +		map = follow_page(ptr, datain);
> 
> Here you should _first_ follow_page and do handle_mm_fault _only_ if the pte is
> not ok.

Agreed --- I'll push that as a performace diff to Linus once the
essential bug-fixes are in.

Cheers,
 Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PATCH [2.4.0test10]: Kiobuf#02, fault-in fix
  2000-11-06 15:05   ` Stephen C. Tweedie
@ 2000-11-06 16:12     ` Andrea Arcangeli
  2000-11-06 16:54       ` Stephen C. Tweedie
  2000-11-06 17:23     ` Linus Torvalds
  1 sibling, 1 reply; 16+ messages in thread
From: Andrea Arcangeli @ 2000-11-06 16:12 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Linus Torvalds, Rik van Riel, Ingo Molnar, linux-mm

On Mon, Nov 06, 2000 at 03:05:39PM +0000, Stephen C. Tweedie wrote:
> Why?

I think to avoid losing a write.


	handle_mm_fault()
	pte is dirty
					pager write it out and make it clean
					since it's not pinned on the
					physical side yet so it's allowed
	grab pagetable lock
	follow_page()
	pte is writeable but not dirty
	pin the page on the physical side to inibith the swapper
	unlock the pagetable lock

	read from disk and write to memory

	now the pte is clean and the page won't be synced back while
	closing the file or during msync

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PATCH [2.4.0test10]: Kiobuf#02, fault-in fix
  2000-11-06 16:12     ` Andrea Arcangeli
@ 2000-11-06 16:54       ` Stephen C. Tweedie
  2000-11-06 22:34         ` Andrea Arcangeli
  0 siblings, 1 reply; 16+ messages in thread
From: Stephen C. Tweedie @ 2000-11-06 16:54 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Linus Torvalds, Rik van Riel, Ingo Molnar, linux-mm

Hi,

On Mon, Nov 06, 2000 at 05:12:04PM +0100, Andrea Arcangeli wrote:
> On Mon, Nov 06, 2000 at 03:05:39PM +0000, Stephen C. Tweedie wrote:
> > Why?
> 
> 	handle_mm_fault()
> 	pte is dirty
> 					pager write it out and make it clean
> 					since it's not pinned on the
> 					physical side yet so it's allowed
> 	grab pagetable lock
> 	follow_page()
> 	pte is writeable but not dirty
> 	pin the page on the physical side to inibith the swapper
> 	unlock the pagetable lock
> 
> 	read from disk and write to memory
> 
> 	now the pte is clean and the page won't be synced back while
> 	closing the file or during msync

No.  Even if the page were dirty before we started the IO, it could be
cleaned during the IO.  The whole problem with the interaction between
the VM and the pages concerned has been that we need to mark the
physical pages dirty at the *end* of the IO, not at the beginning ---
and that we don't necessarily have the same mapping information once
the IO has complete (another thread may have unmapped the vma
entirely).

The patches I sent to Linus dirty the page physically once the write
to memory has completed, completely independently of the ptes.  The
one piece of that missing is the handling of PageDirty() on anonymous
pages --- Rik was going to deal with that.

Checking for page dirty when we create the mapping in the first place
is neither necessary nor sufficient.

Cheers,
 Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PATCH [2.4.0test10]: Kiobuf#02, fault-in fix
  2000-11-06 15:05   ` Stephen C. Tweedie
  2000-11-06 16:12     ` Andrea Arcangeli
@ 2000-11-06 17:23     ` Linus Torvalds
  2000-11-07 11:57       ` Stephen C. Tweedie
  2000-11-08 12:31       ` Stephen C. Tweedie
  1 sibling, 2 replies; 16+ messages in thread
From: Linus Torvalds @ 2000-11-06 17:23 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Andrea Arcangeli, Rik van Riel, Ingo Molnar, linux-mm

On Mon, 6 Nov 2000, Stephen C. Tweedie wrote:
> 
> > > -		map = follow_page(ptr);
> > > +		map = follow_page(ptr, datain);
> > 
> > Here you should _first_ follow_page and do handle_mm_fault _only_ if the pte is
> > not ok.
> 
> Agreed --- I'll push that as a performace diff to Linus once the
> essential bug-fixes are in.

I would _really_ want to see follow_page() just cleaned up altogether.

We should NOT have code that messes with combinations of
"handle_mm_fault()" and "follow_page()" at all.

We should just change the page followers (do_no_page() and friends) to
return the "struct page" directly, instead of returning an "int". Then
we'd have something on the order of

struct page * follow_page(struct mm_struct *mm, struct vm_area_struct * vma,
        unsigned long address, int write_access)
{
	pgd_t *pgd;
	pmd_t *pmd;

	pgd = pgd_offset(mm, address);
	pmd = pmd_alloc(pgd, address);

	if (pmd) {
		pte_t * pte = pte_alloc(pmd, address);
		if (pte) {
			struct page * page = handle_pte_fault(mm, vma, address, write_access, pte);
			if (page)
				return page;
		}
	}
	return NULL;
}

and just a simple

int handle_pte_fault(struct mm_struct *mm, ...
{
	struct page * page = follow_page(..);
	if (!IS_ERR(page)) {
		page_cache_release(page);	/* it's in the page tables */
		return 1;
	}
	return PTR_ERR(page);
}

and you' dbe done with it. 

Yes, I realize that we need to do the min_flt/maj_flt stuff too, and that
we'd need to tweak the return codes for sigbus/oom instead of having the
current 0 == SIGBUS, -1 == OOM magic, but that would actually clean things
up and would allow us to return proper errors on page faults (like
indicating whether it was due to ENOMEM or due to EIO or due to some other
reason like EPERM that we couldn't handle the page fault).

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PATCH [2.4.0test10]: Kiobuf#02, fault-in fix
  2000-11-06 16:54       ` Stephen C. Tweedie
@ 2000-11-06 22:34         ` Andrea Arcangeli
  2000-11-07 11:17           ` Stephen C. Tweedie
  0 siblings, 1 reply; 16+ messages in thread
From: Andrea Arcangeli @ 2000-11-06 22:34 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Linus Torvalds, Rik van Riel, Ingo Molnar, linux-mm

On Mon, Nov 06, 2000 at 04:54:16PM +0000, Stephen C. Tweedie wrote:
> [..] The
> one piece of that missing [..]

Ok, I was just looking the context of your diff.

About the implementation of the missing VM infrastructure for handling page
dirty at the physical pagecache layer, I'd suggest to change ramfs to use a new
PG_protected bitfield with the current semantics of PG_dirty, and to use
PG_dirty for the stuff that we must flush to disk.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PATCH [2.4.0test10]: Kiobuf#02, fault-in fix
  2000-11-06 22:34         ` Andrea Arcangeli
@ 2000-11-07 11:17           ` Stephen C. Tweedie
  0 siblings, 0 replies; 16+ messages in thread
From: Stephen C. Tweedie @ 2000-11-07 11:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Linus Torvalds, Rik van Riel, Ingo Molnar, linux-mm

Hi,

On Mon, Nov 06, 2000 at 11:34:57PM +0100, Andrea Arcangeli wrote:
> On Mon, Nov 06, 2000 at 04:54:16PM +0000, Stephen C. Tweedie wrote:
> 
> About the implementation of the missing VM infrastructure for handling page
> dirty at the physical pagecache layer, I'd suggest to change ramfs to use a new
> PG_protected bitfield with the current semantics of PG_dirty, and to use
> PG_dirty for the stuff that we must flush to disk.

PG_dirty works for some cases.  In particular, it works for any
filesystems which can safely ignore the struct file * in the writepage
address_space operation.  However, for things like NFS, we cannot ever
to arbitrary writeback to a file from the page cache --- we need the
user context of the original write in order to establish the
credentials for the server operation.

That's why my current bug-fix patch just does the writepage at the end
of the raw IO: it's a general fix which works for all mmap types.
Once that is in place, we can think about extending it so that
filesystems can provide a separate method for "flush" which honurs
PG_dirty.  For filesystems with such a flush method, marking a kiobuf
dirty would simply involve setting PG_dirty, but for others (such as
NFS) the mark_kiobuf_dirty would still have to do the full early
writepage.

Cheers,
 Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PATCH [2.4.0test10]: Kiobuf#02, fault-in fix
  2000-11-06 17:23     ` Linus Torvalds
@ 2000-11-07 11:57       ` Stephen C. Tweedie
  2000-11-07 13:37         ` Andrea Arcangeli
  2000-11-08 12:31       ` Stephen C. Tweedie
  1 sibling, 1 reply; 16+ messages in thread
From: Stephen C. Tweedie @ 2000-11-07 11:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, Andrea Arcangeli, Rik van Riel, Ingo Molnar,
	linux-mm

Hi,

On Mon, Nov 06, 2000 at 09:23:38AM -0800, Linus Torvalds wrote:
 
> I would _really_ want to see follow_page() just cleaned up altogether.
> 
> We should NOT have code that messes with combinations of
> "handle_mm_fault()" and "follow_page()" at all.

Is this a 2.5 cleanup or do you want things rearranged in the 2.4
bugfix too?

Cheers,
 Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PATCH [2.4.0test10]: Kiobuf#02, fault-in fix
  2000-11-07 11:57       ` Stephen C. Tweedie
@ 2000-11-07 13:37         ` Andrea Arcangeli
  0 siblings, 0 replies; 16+ messages in thread
From: Andrea Arcangeli @ 2000-11-07 13:37 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Linus Torvalds, Rik van Riel, Ingo Molnar, linux-mm

On Tue, Nov 07, 2000 at 11:57:44AM +0000, Stephen C. Tweedie wrote:
> Is this a 2.5 cleanup or do you want things rearranged in the 2.4
> bugfix too?

I'm sorry but I've not understood exactly the suggestion (the shown pseudocode
will stack overflow btw).

I don't think returning the page gives advantages.  The point here is the
locking. We need to do this atomically (with the spinlock acquired):

	spin_lock(&mm->page_table_lock);
	check the pte is ok
	get_page(page);
	spin_unlock(&mm->page_table_lock);

The above is not necessary for any real page fault. That's needed only by
map_user_kiobuf that must atomically (atomically w.r.t. swap_out) pin the
physical page. IMHO it would be silly to add the locking and a get_page() (plus
a put_page after the page is returned to the page fault arch code) inside the
common page fault handler just to skip a walk of the pagetables for the case
where rawio is accessing a not correctly mapped page.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PATCH [2.4.0test10]: Kiobuf#02, fault-in fix
  2000-11-06 17:23     ` Linus Torvalds
  2000-11-07 11:57       ` Stephen C. Tweedie
@ 2000-11-08 12:31       ` Stephen C. Tweedie
  1 sibling, 0 replies; 16+ messages in thread
From: Stephen C. Tweedie @ 2000-11-08 12:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, Andrea Arcangeli, Rik van Riel, Ingo Molnar,
	linux-mm

Hi,

On Mon, Nov 06, 2000 at 09:23:38AM -0800, Linus Torvalds wrote:

> We should just change the page followers (do_no_page() and friends) to
> return the "struct page" directly, instead of returning an "int".

Even, as Andrea pointed out, if the cost is that we have to do two
extra atomic ops to bump the page count inside do_*_page and drop
it again in the fault handler?

I'll do it, but not if we later decide that the cost isn't worth it
and have to throw the code out.

Cheers,
 Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2000-11-08 12:31 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-11-02 13:40 PATCH [2.4.0test10]: Kiobuf#02, fault-in fix Stephen C. Tweedie
2000-11-02 14:30 ` Jeff Garzik
2000-11-02 15:58   ` Stephen C. Tweedie
2000-11-04  1:28     ` Eric Lowe
2000-11-03 22:27 ` Andrea Arcangeli
2000-11-04  1:36   ` Eric Lowe
2000-11-04  2:07     ` Andrea Arcangeli
2000-11-06 15:05   ` Stephen C. Tweedie
2000-11-06 16:12     ` Andrea Arcangeli
2000-11-06 16:54       ` Stephen C. Tweedie
2000-11-06 22:34         ` Andrea Arcangeli
2000-11-07 11:17           ` Stephen C. Tweedie
2000-11-06 17:23     ` Linus Torvalds
2000-11-07 11:57       ` Stephen C. Tweedie
2000-11-07 13:37         ` Andrea Arcangeli
2000-11-08 12:31       ` Stephen C. Tweedie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox