linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: mapping user space buffer to kernel address space
       [not found]       ` <20001017001349.F17222@athlon.random>
@ 2000-10-17 13:53         ` Stephen Tweedie
  0 siblings, 0 replies; 3+ messages in thread
From: Stephen Tweedie @ 2000-10-17 13:53 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen Tweedie, Linus Torvalds, Rogier Wolff, linux-kernel, linux-mm

Hi,

On Tue, Oct 17, 2000 at 12:13:49AM +0200, Andrea Arcangeli wrote:
> 
> Correct. But the problem is that the page won't stay in physical memory after
> we finished the I/O because swap cache with page count 1 will be freed by the
> VM.

Rik has been waiting for an excuse to get deferred swapout into the
mainline.  Sounds like we've got the excuse.

> And anyways from a design standpoint it looks much better to really pin the
> page in the pte too (just like kernel reserved pages are pinend after a
> remap_page_range).

Unfortunately, there is one common case where we want to do exactly
that.  "dd < /dev/zero > something_using_raw_io" maps a whole series
of identical readonly ZERO_PAGE pages into the kiobuf.  One of the
reasons I removed the automatic page locking was that otherwise we're
forced to special-case things like ZERO_PAGE in the locking code.

Even ignoring that, users _will_ submit multiple IOs in the same page.
Pinning the physical page with page->count is clean.  Doing the
locking with the page lock makes no sense if you have adjacent IOs or
if you want to maintain the kiobuf mapping for any length of time.
The point of kiobufs was to avoid VM hacks so that IO can be done at
physical page level.  Pinning ptes should not have anything to do with
the IO or we've lost that abstraction.

> Replacing the get_user/put_user with handle_mm_fault _after_ changing
> follow_page to check the dirty bit too in the write case should be ok.

Right.

> > Once I'm back in the UK I'll look at getting map_user_kiobuf() simply
> > to call the existing access_one_page() from ptrace.  You're right,
> 
> access_one_page is missing the pagetable lock too, but that seems the only
> problem. I'm not convinced mixing the internal of access_one_page and
> map_user_kiobuf is a good thing since they needs to do a very different thing
> in the critical section.

Not the whole of access_one_page, but the pagetable-locked
follow-page / handle_mm_fault loop should be common code.  That's
where we're having the problem, so let's avoid having to maintain it
in two places.

Cheers, 
 Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: mapping user space buffer to kernel address space
  2000-10-19 20:06   ` Stephen Tweedie
@ 2000-10-20 17:34     ` Linus Torvalds
  0 siblings, 0 replies; 3+ messages in thread
From: Linus Torvalds @ 2000-10-20 17:34 UTC (permalink / raw)
  To: Stephen Tweedie; +Cc: Andrea Arcangeli, Rogier Wolff, linux-kernel, linux-mm


On Thu, 19 Oct 2000, Stephen Tweedie wrote:
> > 
> > Then, we'd move the "writeout" part into the LRU queue side, and at that
> > point I agree with you 100% that we probably should just delay it until
> > there are no mappings available
> 
> I've just been talking about this with Ben LaHaise and Rik van Riel,
> and Ben brought up a nasty problem --- NFS, which because of its
> credentials requirements needs to have the struct file available in
> its writepage function.  Of course, if we defer the write then we
> don't necessarily have the file available when we come to flush the
> page from cache.

Yes. But that doesn't mean that swapping couldn't do it (swapping
fundamentally doesn't have credentials).

And note that this is not about "NFS is broken" - any remote filesystem
will have some issues like this, and shared mappings will always have to
handle this case.

So basically I agree that shared mappings cannot be converted to this
setup, I was only talking about the specific case of the swapping (and
anonymous shared memory, which along with SysV IPC shm is basically the
same thing and already uses the swap cache).

So what I was thinking of was the very end of try_to_swap_out(), where we
have noticed that we do not have a "swapout()" function, and we need to
add the page to the swap cache. I would suggest moving _that_ code to the
LRU queue, and handling it conceptually together with the stuff that
handles the buffer cache writeout.

--

And no, I haven't forgotten about the case of direct IO into a shared
mapping. That _is_ going to be different in many ways, and I suspect that
a solution to that particular issue may be to move the "vm_file"
information from when we do the virtual kiobuf lookup into the kiobuf's,
because otherwise we'd basically lose that information.

(We _already_ lose that information, in fact. Keeping the page in the
virtual mapping doesn't really even fix it - because the page can be in
multiple virtual mappings with different vm_file's and thus different
credentials. And the kiobuf's do not really contain any information of
_which_ of the credentials we looked up. It happens to work, but it's
conceptually not very correct).

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: mapping user space buffer to kernel address space
       [not found] ` <Pine.LNX.4.10.10010172129290.6732-100000@penguin.transmeta.com>
@ 2000-10-19 20:06   ` Stephen Tweedie
  2000-10-20 17:34     ` Linus Torvalds
  0 siblings, 1 reply; 3+ messages in thread
From: Stephen Tweedie @ 2000-10-19 20:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, Rogier Wolff, Stephen C. Tweedie, linux-kernel,
	linux-mm

Hi,

On Tue, Oct 17, 2000 at 09:42:36PM -0700, Linus Torvalds wrote:

> Now, the way I'v ealways envisioned this to work is that the VM scanning
> function basically always does the equivalent of just
> 
>  - get PTE entry, clear it out.
>  - if PTE was dirty, add the page to the swap cache, and mark it dirty,
>    but DON'T ACTUALLY START THE IO!
>  - free the page.
> 
> Then, we'd move the "writeout" part into the LRU queue side, and at that
> point I agree with you 100% that we probably should just delay it until
> there are no mappings available

I've just been talking about this with Ben LaHaise and Rik van Riel,
and Ben brought up a nasty problem --- NFS, which because of its
credentials requirements needs to have the struct file available in
its writepage function.  Of course, if we defer the write then we
don't necessarily have the file available when we come to flush the
page from cache.

One answer is to say "well then NFS is broken, fix it".  It's not too
hard --- NFS mmaps need a wp_page function which registers the
caller's credentials against the page when we dirty it so that we can
use those credentials on flush.  That means that writes to a
multiply-mapped file essentially get random credentials, but I don't
think we care --- the credentials eventually used will be enough to
avoid the root_squash problems and the permissions at open make sure
we're not doing anything illegal.  

(Changing permissions on an already-mmaped file and causing the NFS
server to refuse the write raises problems which are ... interesting,
but I'm not convinced that that is a new problem; I suspect we can
fabricate such a failure today.)

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2000-10-20 17:34 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <200010140918.LAA10416@cave.bitwizard.nl>
     [not found] ` <Pine.LNX.4.10.10010141916490.1642-100000@penguin.transmeta.com>
     [not found]   ` <20001016000854.A27414@athlon.random>
     [not found]     ` <20001016221401.A19951@redhat.com>
     [not found]       ` <20001017001349.F17222@athlon.random>
2000-10-17 13:53         ` mapping user space buffer to kernel address space Stephen Tweedie
     [not found] <20001018015949.C4635@athlon.random>
     [not found] ` <Pine.LNX.4.10.10010172129290.6732-100000@penguin.transmeta.com>
2000-10-19 20:06   ` Stephen Tweedie
2000-10-20 17:34     ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox