questions on having a driver pin user memory for DMA

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* questions on having a driver pin user memory for DMA
@ 2000-04-19 23:02 Weimin Tchen
  2000-04-20  6:39 ` Eric W. Biederman
  2000-04-20 12:27 ` Stephen C. Tweedie
  0 siblings, 2 replies; 7+ messages in thread
From: Weimin Tchen @ 2000-04-19 23:02 UTC (permalink / raw)
  To: linux-mm

Hello,

Could you advise a former DEC VMS driver-guy who is a recent Linux
convert with much to learn? I'm working on a driver for a NIC that
support the Virtual Interface Architecture, which allows user processes
to register arbitrary virtual address ranges for DMA network transmit or
receive. The driver locks the user pages against paging and loads the
NIC with the physical addresses of these pages. Thus the user process
can initiate network DMA using its buffers directly (instead of having a
driver copy between a buffer in kernel memory and a user buffer)..

There are at least 3 issues to resolve in registering this user memory
for DMA that I need help on:

1. lock against paging
2. after a fork(), copy-on-write changes the physical address of the
user buffer
3.a memory leak that can hang the system, if a process does: malloc a
memory buffer, register this memory, free the memory, THEN deregister
the memory.

- Issue 1.
Initially our driver locked memory by  incrementing the page count. When
that turned out to be insufficient, I added setting the PG_locked bit
for the page frame. (However this bit is actually for locking during an
IO transfer. Thus I wonder if using PG_locked would cause a problem if
the user memory is also mapped to a file.) Since toggling the PG_locked
bit is not a counted semphore, it also doesn't handle pages that are
registered multiple times. A common case would be 2 adjacent
registrations that end & start on the same page (since the Virtual
Interface Architecture allows buffers to be registered which are NOT
paged aligned). Thus the first deregister will unlock the page even if
it is part of another buffer setup for DMA.

I'm probably misreading this, but it appears that  /mm/memory.c:
map_user_kiobuf() pins user memory by just incrementing the page count.
Will this actually prevent paging or will it only prevent the virtual
address from being deleted from the user address space by a free()? I
see that  /drivers/char/raw.c uses also has an #ifdef'ed call to
lock_kiovec(). This function lock_kiovec() checks the PG_locked flag,
and notes that multiply mapped pages are "Bad news". But our driver
needs to support multiple mappings.

Instead of using flags in the low-level page frame, I tried to use flags
in the vm_area_struct (process memory region) structures. I also hoped
to fix issue II (copy-on-write after fork) by setting VM_SHARED along w/
VM_LOCKED. So I tried adding private function from mlock.c into our
driver, by skipping the resource check and not aligning on page
boundaries and not merging segments. (Hopefully this would allow
adjacent registrations in the same page.)  However after these changes,
the driver could not load since these routines reference others that
handle memory AVL trees (which had appeared to be public but actually
aren't exported):

- insert_vms_struct(),
- make_pages_present(),
- vm_area_cachep().

- Issue 2. (copy-on-write after fork):
A process uses our driver to register memory for DMA by having the
driver convert the process's buffer virtual pages into physical page
adddresses which are then setup in the NIC for DMA. If the process forks
a child, then the Linux kernel appears to avoid overhead by copying the
vm_area_struct's and sharing the actual physical pages. If a write is
done, the child gets the physical pages and the parent gets new physical
pages which are copies. As a result the hardware is not pointing to the
correct physical pages in the parent. I was hoping to prevent this
copy-on-write by making the memory shared (which could have program side
effects) by setting VM_SHARED in the vm_area_struct. (Strangely VM_SHM
doesn't appear to be used much). But as noted above, I can not use
functions handing vm_area_struct's like those in mlock.c.

Instead I have *temporarily* solved problems I & II by setting the
PG_reserved flag in page frame (instead of PG_locked). But I'd much
appreciate any advice on a better approach.

- Issue 3: memory leak:
There is a system memory leak which results from a slight application
programming error, when a user buffer is free()'ed before being
deregistered by our driver. Repeated operations can hang the system.
When memory is registered, our driver increments the page count to 2.
This appears to prevent the free() & deregister (only decrements to 1)
from releasing the memory. This is actually needed to prevent releasing
the memory before unmapping it NIC from DMA. Instead of using the count,
PG_reserved can be used.. However this also prevents the count from
getting decremented and releasing as expected.

I had expected free() to just put the memory back on the heap which
would be cleaned-up at process exit. But glibc-2.1.2\malloc\malloc.c
indicates that with large buffers, free() calls malloc_trim() which
calls sbrk() with a negative argument. PG_reserved appears to prevent
memory cleanup ( /mm/page_alloc.c:__free_pages() checks
if (!PageReserved(page) && atomic_dec_and_test(&page->count)) before
calling free_pages_ok() ). I haven't traced how our earlier use of
PG_locked and incrementing the count, will also prevent free() from
decrementing the count.

When a process exits, the file_operations release function is run if the
NIC device has not been closed. Thus by artifically dropping the page
count in this function and doing __free_pages() , the leak can be
prevented. However the driver would need to be modified to have our
library's function to close_the_NIC()  not do a system close(),  in
order to just use the file_operations release function for final
cleanup. There apear to be other system dependencies involved here, so
I'm not pursuing this further.

I don't understand why process exit code cleans up the virtual address
space before closing remaining devices. ( /kernel/exit.c:do_exit() calls
__exit_mm() and later calls __exit_files() ). I had hoped to cleanup
registered memory  when __exit_files() runs our driver's release
function and let __exit_mm() do the rest.

Thanks for any advice,
-Weimin Tchen

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: questions on having a driver pin user memory for DMA
  2000-04-19 23:02 questions on having a driver pin user memory for DMA Weimin Tchen
@ 2000-04-20  6:39 ` Eric W. Biederman
  2000-04-20  9:20   ` Ingo Oeser
  2000-04-20 12:30   ` Stephen C. Tweedie
  2000-04-20 12:27 ` Stephen C. Tweedie
  1 sibling, 2 replies; 7+ messages in thread
From: Eric W. Biederman @ 2000-04-20  6:39 UTC (permalink / raw)
  To: Weimin Tchen; +Cc: linux-mm

Weimin Tchen <wtchen@giganet.com> writes:

The rules of thumb on this issue are:
1) Don't pin user memory let user space mmap driver memory.

   This appears to be what you are trying to achieve, with 
   your current implementation.  Think user getting direct
   access to kernel buffer, instead of kernel getting direct
   access to user buffer.  Same number of copies but
   the management is simpler...

2) If you must have access to user memory use the evolving kiobuf
   interface.  But that is mostly useful for the single shot
   read/write case.  
   
I'm a little dense, with all of the headers and trailers
that are put on packets how can it be efficient to DMA to/from
user memory?  You have to look at everything to compute checksums
etc.  

Your interface sounds like it walks around all of the networking
code in the kernel.  How can that be good?

> Hello,
> 
> Could you advise a former DEC VMS driver-guy who is a recent Linux
> convert with much to learn? I'm working on a driver for a NIC that
> support the Virtual Interface Architecture, which allows user processes
> to register arbitrary virtual address ranges for DMA network transmit or
> receive. The driver locks the user pages against paging and loads the
> NIC with the physical addresses of these pages. Thus the user process
> can initiate network DMA using its buffers directly (instead of having a
> driver copy between a buffer in kernel memory and a user buffer)..
> 
> There are at least 3 issues to resolve in registering this user memory
> for DMA that I need help on:
> 
> 1. lock against paging
> 2. after a fork(), copy-on-write changes the physical address of the
> user buffer

Only if written to.  It doesn't make sense to support writes
that happen a device while it is doing DMA.  Your only
resposibility to protect users from them selves is to prevent
kernel crashes.

> 3.a memory leak that can hang the system, if a process does: malloc a
> memory buffer, register this memory, free the memory, THEN deregister
> the memory.

Wrong interface. Using kiobufs or mmap clears this up.

> 
> - Issue 1.
> Initially our driver locked memory by  incrementing the page count. 

Which keeps the page from being freed.  Which garantees the
page won't be reused by another kernel proces.

> When
> that turned out to be insufficient, I added setting the PG_locked bit
> for the page frame. (However this bit is actually for locking during an
> IO transfer.
Well during a transfer in 2.3.x user space reads & writes also synchronize
with the page lock.

>  Thus I wonder if using PG_locked would cause a problem if
> the user memory is also mapped to a file.) Since toggling the PG_locked
> bit is not a counted semphore, it also doesn't handle pages that are
> registered multiple times. A common case would be 2 adjacent
> registrations that end & start on the same page (since the Virtual
> Interface Architecture allows buffers to be registered which are NOT
> paged aligned). Thus the first deregister will unlock the page even if
> it is part of another buffer setup for DMA.
> 
> I'm probably misreading this, but it appears that  /mm/memory.c:
> map_user_kiobuf() pins user memory by just incrementing the page count.

Yep.

> Will this actually prevent paging 

Paging is orthogonal it just gets a reference to the page, and keeps
the page from being reused by another kernel process until it is done.
The current users can still play with it...

> or will it only prevent the virtual
> address from being deleted from the user address space by a free()? 
It doesn't do that at all.

> I
> see that  /drivers/char/raw.c uses also has an #ifdef'ed call to
> lock_kiovec(). This function lock_kiovec() checks the PG_locked flag,
> and notes that multiply mapped pages are "Bad news". But our driver
> needs to support multiple mappings.

Right.  You can't have the same user page used for 2 different
positions in a single transaction.  It's just too confusing..

It's questionable if anyone will need lock_kiovec though...

> 
> Instead of using flags in the low-level page frame, I tried to use flags
> in the vm_area_struct (process memory region) structures. I also hoped
> to fix issue II (copy-on-write after fork) by setting VM_SHARED along w/
> VM_LOCKED.

No. You are getting farther, and farther from something maintainable.
Playing with the VM_AREA struct is silly.  It controls the user
view of memory.  If you need that implement an mmap operation.
If you are just borrowing the pages, use map_user_kiobuf... 

> So I tried adding private function from mlock.c into our
> driver, by skipping the resource check and not aligning on page
> boundaries and not merging segments. (Hopefully this would allow
> adjacent registrations in the same page.)  However after these changes,
> the driver could not load since these routines reference others that
> handle memory AVL trees (which had appeared to be public but actually
> aren't exported):
> 
> - insert_vms_struct(),
> - make_pages_present(),
> - vm_area_cachep().

Generally the call with this kind of this is to just add the
need functions to the exported list.  However for this
case you appear to be barking up the wrong tree.

> 
> 
> - Issue 2. (copy-on-write after fork):

Don't think register/deregister.  
Think read/write  -- kiobufs (1 shot deal)
or mmap/munmap   -- always there until the process dies, or the mumap.

> A process uses our driver to register memory for DMA by having the
> driver convert the process's buffer virtual pages into physical page
> adddresses which are then setup in the NIC for DMA. If the process forks
> a child, then the Linux kernel appears to avoid overhead by copying the
> vm_area_struct's and sharing the actual physical pages. 
Yep.

> If a write is
> done, the child gets the physical pages and the parent gets new physical
> pages which are copies. 

The first writer gets the copy, which could be parent or child.

> As a result the hardware is not pointing to the
> correct physical pages in the parent. 
Yep shure is you were just expecting something different.

> I was hoping to prevent this
> copy-on-write by making the memory shared (which could have program side
> effects) by setting VM_SHARED in the vm_area_struct. (Strangely VM_SHM
> doesn't appear to be used much). But as noted above, I can not use
> functions handing vm_area_struct's like those in mlock.c.

If you need that mmap.

> 
> Instead I have *temporarily* solved problems I & II by setting the
> PG_reserved flag in page frame (instead of PG_locked). But I'd much
> appreciate any advice on a better approach.
> 
> 
> - Issue 3: memory leak:
> There is a system memory leak which results from a slight application
> programming error, when a user buffer is free()'ed before being
> deregistered by our driver. 

If you go down the kernel primitives sbrk, and mmap & mumap I
can follow.  With malloc/free you aren't garanteed page alignment
or anything, so I can't tell what is happening at a kernel level.

> Repeated operations can hang the system.
> When memory is registered, our driver increments the page count to 2.
> This appears to prevent the free() & deregister (only decrements to 1)
> from releasing the memory. This is actually needed to prevent releasing
> the memory before unmapping it NIC from DMA. Instead of using the count,
> PG_reserved can be used.. However this also prevents the count from
> getting decremented and releasing as expected.

It looks like here you are trying to implement a weird form of mmap.
To do this right.  Let your driver call get_free_pages behind the
scenes of a mmap call, and then release those pages when a mapping
comes to an end.  This sounds like what you are struggling to implement.

> 
> I had expected free() to just put the memory back on the heap which
> would be cleaned-up at process exit. But glibc-2.1.2\malloc\malloc.c
> indicates that with large buffers, free() calls malloc_trim() which
> calls sbrk() with a negative argument. PG_reserved appears to prevent
> memory cleanup ( /mm/page_alloc.c:__free_pages() checks
> if (!PageReserved(page) && atomic_dec_and_test(&page->count)) before
> calling free_pages_ok() ). I haven't traced how our earlier use of
> PG_locked and incrementing the count, will also prevent free() from
> decrementing the count.

Playing with PG_reserved is just bad.  It's only usefull in special
cases.

> 
> When a process exits, the file_operations release function is run if the
> NIC device has not been closed. Thus by artifically dropping the page
> count in this function and doing __free_pages() , the leak can be
> prevented. However the driver would need to be modified to have our
> library's function to close_the_NIC()  not do a system close(),  in
> order to just use the file_operations release function for final
> cleanup. There apear to be other system dependencies involved here, so
> I'm not pursuing this further.
> 
> I don't understand why process exit code cleans up the virtual address
> space before closing remaining devices. ( /kernel/exit.c:do_exit() calls
> __exit_mm() and later calls __exit_files() ). I had hoped to cleanup
> registered memory  when __exit_files() runs our driver's release
> function and let __exit_mm() do the rest.

Well you still could.  exit_mm tears down the address space it doesn't
play with the except decreasing their count by one. If the count
is still elevated you can go back and do something to them...


Eric
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: questions on having a driver pin user memory for DMA
  2000-04-20  6:39 ` Eric W. Biederman
@ 2000-04-20  9:20   ` Ingo Oeser
  2000-04-20 12:30   ` Stephen C. Tweedie
  1 sibling, 0 replies; 7+ messages in thread
From: Ingo Oeser @ 2000-04-20  9:20 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Weimin Tchen, linux-mm

On 20 Apr 2000, Eric W. Biederman wrote:

> Your interface sounds like it walks around all of the networking
> code in the kernel.  How can that be good?

It is not a NIC in the sense that you do TCP/IP over it. These
NICs with VIA support are used in high speed homogenous networks
between cluster nodes IIRC.

So it _is_ ok to work around all this networking code, because
they do DSHM and message passing with these networks in a very
homogenous manner.

Right Weimin?

Regards

Ingo Oeser
-- 
Feel the power of the penguin - run linux@your.pc
<esc>:x

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: questions on having a driver pin user memory for DMA
  2000-04-20  6:39 ` Eric W. Biederman
  2000-04-20  9:20   ` Ingo Oeser
@ 2000-04-20 12:30   ` Stephen C. Tweedie
  1 sibling, 0 replies; 7+ messages in thread
From: Stephen C. Tweedie @ 2000-04-20 12:30 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Weimin Tchen, linux-mm

Hi,

On Thu, Apr 20, 2000 at 01:39:53AM -0500, Eric W. Biederman wrote:
> 
> The rules of thumb on this issue are:
> 1) Don't pin user memory let user space mmap driver memory.

map_user_kiobuf is intended to allow user space buffers to be mapped
the other way safely.

> 2) If you must have access to user memory use the evolving kiobuf
>    interface.  But that is mostly useful for the single shot
>    read/write case.  

There are not many problems with long-lived buffers mapped by kiobufs.
The fork problem is the main one, but we already have patches for that.

> I'm a little dense, with all of the headers and trailers
> that are put on packets how can it be efficient to DMA to/from
> user memory?  You have to look at everything to compute checksums
> etc.  

VIA != IP.

> Your interface sounds like it walks around all of the networking
> code in the kernel.  How can that be good?

VIA != networking.  VIA == messaging.  It provides for very (VERY) 
low latency user-space-to-user-space transfers, bypassing the O/S
entirely by allowing the O/S to grant the application direct, 
limited access to the HW control queues.

--Stephen

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: questions on having a driver pin user memory for DMA
  2000-04-19 23:02 questions on having a driver pin user memory for DMA Weimin Tchen
  2000-04-20  6:39 ` Eric W. Biederman
@ 2000-04-20 12:27 ` Stephen C. Tweedie
  2000-04-20 23:43   ` Weimin Tchen
  1 sibling, 1 reply; 7+ messages in thread
From: Stephen C. Tweedie @ 2000-04-20 12:27 UTC (permalink / raw)
  To: Weimin Tchen; +Cc: linux-mm

Hi,

On Wed, Apr 19, 2000 at 07:02:32PM -0400, Weimin Tchen wrote:
> 
> Could you advise a former DEC VMS driver-guy who is a recent Linux
> convert with much to learn?

Sure.  I'm a former DEC VMS F11BXQP and Spiralog guy myself. Pleased
to meet you!  :-)

> There are at least 3 issues to resolve in registering this user memory
> for DMA that I need help on:
> 
> 1. lock against paging

Simple enough.  Just a page reference count increment is enough for
this.

> 2. after a fork(), copy-on-write changes the physical address of the
> user buffer

What fork() semantics do you want, though?  VIA is a little ambiguous
about this right now.

> 3.a memory leak that can hang the system, if a process does: malloc a
> memory buffer, register this memory, free the memory, THEN deregister
> the memory.

Shouldn't be a problem if you handle page reference counts correctly.

> - Issue 1.
> Initially our driver locked memory by  incrementing the page count. When
> that turned out to be insufficient,

In what way is it insufficient?  An unlocked page may be removed from
the process's page tables, but as long as the refcount is held on the
physical page it should never actually be destroyed, and the mapping
between VA and physical page should be restored on any subsequent page
fault.

> I added setting the PG_locked bit
> for the page frame. (However this bit is actually for locking during an
> IO transfer. Thus I wonder if using PG_locked would cause a problem if
> the user memory is also mapped to a file.)

It shouldn't do.

> I'm probably misreading this, but it appears that  /mm/memory.c:
> map_user_kiobuf() pins user memory by just incrementing the page count.
> Will this actually prevent paging or will it only prevent the virtual
> address from being deleted from the user address space by a free()?

It prevents the physical page from being destroyed until the corresponding
free_page.  It also prevents the VA-to-physical-page mapping from 
disappearing, unless the user happens to do a new mmap or munmap on
that VA range.  If that happens, the physical page is dissociated from the
VA but remains available to the driver, so nothing bad happens.

> see that  /drivers/char/raw.c uses also has an #ifdef'ed call to
> lock_kiovec(). This function lock_kiovec() checks the PG_locked flag,
> and notes that multiply mapped pages are "Bad news". But our driver
> needs to support multiple mappings.

That's why we don't do a lock_kiovec() by default right now.

> Instead of using flags in the low-level page frame, I tried to use flags
> in the vm_area_struct (process memory region) structures. I also hoped
> to fix issue II (copy-on-write after fork) by setting VM_SHARED along w/
> VM_LOCKED.

We already have a solution to the fork issue and are currently trying 
to persuade Linus to accept it.  Essentially you just have to be able
to force a copy-out instead of deferring COW when you fork on a page 
which has outstanding hardware IO mapped, unless the VMA is VM_SHARED.

> Instead I have *temporarily* solved problems I & II by setting the
> PG_reserved flag in page frame (instead of PG_locked). But I'd much
> appreciate any advice on a better approach.

PG_reserved is actually quite widely used for this sort of thing.  
It is quite legitimate as long as you are very careful about what 
sort of pages you apply it to.  Specifically, you need to have 
cleanup in place for when the area is released, and that implies
that PG_reserved is only really legal if you are using it on pages
which have been explicitly allocated by a driver and mmap()ed into
user space.

> - Issue 3: memory leak:
> There is a system memory leak which results from a slight application
> programming error, when a user buffer is free()'ed before being
> deregistered by our driver. Repeated operations can hang the system.
> When memory is registered, our driver increments the page count to 2.
> This appears to prevent the free() & deregister (only decrements to 1)
> from releasing the memory.

That's correct.  The memory may no longer be in use by the driver, 
but until the application munmap()s it it is still registered as in 
use by the application.

> I had expected free() to just put the memory back on the heap which
> would be cleaned-up at process exit. But glibc-2.1.2\malloc\malloc.c
> indicates that with large buffers, free() calls malloc_trim() which
> calls sbrk() with a negative argument.

As far as the kernel is concerned internally, the unmapping fixup which
happens is the same in both cases.

> PG_reserved appears to prevent
> memory cleanup ( /mm/page_alloc.c:__free_pages() checks
> if (!PageReserved(page) && atomic_dec_and_test(&page->count)) before
> calling free_pages_ok() ).

Correct.  That's why you need to be mmap()ing, not using map_user_kiobuf,
to use PG_reserved.  Either that, or you record which pages the driver
has reserved, and release them manually when some other trigger happens
such as a close of a driver file descriptor.

> I don't understand why process exit code cleans up the virtual address
> space before closing remaining devices. ( /kernel/exit.c:do_exit() calls
> __exit_mm() and later calls __exit_files() ).

Driver-related memory functions are expected to be within driver-specific
mmap()ed areas, so the appropriate driver callback happens in exit_mm, 
not in exit_files.

Cheers,
 Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: questions on having a driver pin user memory for DMA
  2000-04-20 12:27 ` Stephen C. Tweedie
@ 2000-04-20 23:43   ` Weimin Tchen
  2000-04-21 18:20     ` Kanoj Sarcar
  0 siblings, 1 reply; 7+ messages in thread
From: Weimin Tchen @ 2000-04-20 23:43 UTC (permalink / raw)
  Cc: linux-mm

Ingo Oeser wrote:

> On 20 Apr 2000, Eric W. Biederman wrote:
>
> > Your interface sounds like it walks around all of the networking
> > code in the kernel.  How can that be good?
>
> It is not a NIC in the sense that you do TCP/IP over it. These
> NICs with VIA support are used in high speed homogenous networks
> between cluster nodes IIRC.

Yes, I should have explained better. Our NIC allows a user buffer to handle
message transfers & receives with a remote node also fitted our NIC. We have
recently added another driver that fits our software & hardware under the
TCP/IP stack using skb's & netif_rx() etc like the ethernet driver does. But
direct user-level access of the NIC is more efficient with minimal kernel
support.

user-level program with user-level memory
  |         or VI arch library calls
DMA      |
  |         or VI driver
NIC which has an ASIC that can DMA into/from user-level memory
  |
 +--- point-to-point connection to a remote node with our NIC or to our switch
box ---  etc.

For standard Ethernet, the software and hardware contrilbute about equal
overhead to total message latency. With gigabit-speed networks, the hardware
latency is a minor concern in comparison with the much higher software
overhead. The NIC's DMA skips much of the kernel work. One use of our product
is in scientific applications that can be run in parallel on PC's with fast
access to shared data. Our product can be layered underneath MPI Software
Technology 's Message Passing Interface library that is used by the scientific
community.

If you are interested here is more info on VI (Virtual Interface Arch)
    http://www.viarch.org/
    http://www.intel.com/design/servers/vi/
    http://www.mpi-softtech.com/

"Stephen C. Tweedie" wrote:

> Sure.  I'm a former DEC VMS F11BXQP and Spiralog guy myself.

Then Linux internals must seem to be a breeze compared w/ XQP crashes. My hat
is off to people who handled XQP, like Robert Rappaport . DEC did excellent
clustering using its proprietary SCS message protocol over a proprietary CI
bus. But inexpensive hardware and common standards are winning the day.

>
> > 2. after a fork(), copy-on-write changes the physical address of the
> > user buffer
>
> What fork() semantics do you want, though?  VIA is a little ambiguous
> about this right now.
>

The MPI library does a fork() outside of user program control, so this can
steal away the physical pages setup by the parent for DMA, without warning. We
didn't notice this since our library uses pthreads which probably clones to
share the adderss space.

>
> > 3.a memory leak that can hang the system, if a process does: malloc a
> > memory buffer, register this memory, free the memory, THEN deregister
> > the memory.
>
> Shouldn't be a problem if you handle page reference counts correctly.

By checking page count inside our driver, it appears that:

malloc() sets page count = 1
our driver's memory register operation increments count to 2
out-of-order free() does NOT reduce count (even when we were using PG_locked
instead of PG_reserved)
our driver's memory DEregister operation decrements count to 1

As a result, the page does not get released back to the system.

> > 1. lock against paging
>
> Simple enough.  Just a page reference count increment is enough for
> this.

Originally we thought that handling the page count was enough to prevent
paging, but DMA was not occuring into the correct user memory, when there was
heavy memory use by another application. This was fixed by setting PG_locked
on the page. (Now I'm using PG_reserved to also solve the fork() problem.)

> > - Issue 1.
> > Initially our driver locked memory by  incrementing the page count. When
> > that turned out to be insufficient,
>
> In what way is it insufficient?  An unlocked page may be removed from
> the process's page tables, but as long as the refcount is held on the
> physical page it should never actually be destroyed, and the mapping
> between VA and physical page should be restored on any subsequent page
> fault.
>

I imagine that if a CPU instruction references a virtual page that has been
totally paged out to disk, then the kernel will fixup the fault and setup a
NEW physical page with a copy of data from disk. However our NIC just DMA's to
the physical memory without faulting on a virtual address.

> > I added setting the PG_locked bit
> > for the page frame. (However this bit is actually for locking during an
> > IO transfer. Thus I wonder if using PG_locked would cause a problem if
> > the user memory is also mapped to a file.)
>
> It shouldn't do.
>

Thanks. I'm concerned about a user buffer being mapped to a file also. So when
file IO is done, the PG_locked flag would be cleared so the page is no longer
pinned.

>
> > I'm probably misreading this, but it appears that  /mm/memory.c:
> > map_user_kiobuf() pins user memory by just incrementing the page count.
> > Will this actually prevent paging or will it only prevent the virtual
> > address from being deleted from the user address space by a free()?
>
> It prevents the physical page from being destroyed until the corresponding
> free_page.  It also prevents the VA-to-physical-page mapping from
> disappearing,

I'm unclear: does a page count > 0
- only reserve the page frame structure so that new physical memory can be
setup when paging-in
- or does it actually keep the physical memory allocated for that user memory
virtual address ?

>  I also hoped
> > to fix issue II (copy-on-write after fork) by setting VM_SHARED along w/
> > VM_LOCKED.
>
> We already have a solution to the fork issue and are currently trying
> to persuade Linus to accept it.  Essentially you just have to be able
> to force a copy-out instead of deferring COW when you fork on a page
> which has outstanding hardware IO mapped, unless the VMA is VM_SHARED.
>

Is sounds great. Did you run into similar problems w/ fork()? We saw this even
if the child very little so probably did not touch the registered pages (which
seems to be contrary to COW operation).

>
> PG_reserved is actually quite widely used for this sort of thing.
> It is quite legitimate as long as you are very careful about what
> sort of pages you apply it to.  Specifically, you need to have
> cleanup in place for when the area is released, and that implies
> that PG_reserved is only really legal if you are using it on pages
> which have been explicitly allocated by a driver and mmap()ed into
> user space.
>

Yes I saw PG_reserved used in many drivers, but I'm concerned that this is a
kludge that has side effects. Rubini's book recommended not using it. Our
driver uses it in both a memory registration ioctl() and in a mmap operaton.
Our driver cleans-up in a DEregister ioctl() by using our driver's structures
that record the locked pages. This cleanup also gets run by the drivers's
release operations if the program aborts.

>
> > PG_reserved appears to prevent
> > memory cleanup

> Correct.  That's why you need to be mmap()ing, not using map_user_kiobuf,
> to use PG_reserved.  Either that, or you record which pages the driver
> has reserved, and release them manually when some other trigger happens
> such as a close of a driver file descriptor.
>

Yes our driver does that.

Thanks to all of you for your advice,
-Weimin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: questions on having a driver pin user memory for DMA
  2000-04-20 23:43   ` Weimin Tchen
@ 2000-04-21 18:20     ` Kanoj Sarcar
  0 siblings, 0 replies; 7+ messages in thread
From: Kanoj Sarcar @ 2000-04-21 18:20 UTC (permalink / raw)
  To: Weimin Tchen; +Cc: linux-mm

Just wanted to point out that I do have a patch for the fork/cow problem
for 2.3 (relevant only for threaded programs), and have talked to Linus 
about this. We will see if he agrees to take it in. Another thing is that 
there are races in map_user_kiobuf racing with kswapd, my patch has fixes 
for that too. 

What I haven't started looking at yet (acceptance of the above patch is
a prerequisite) is how an user program can do a system call that will
invoke map_user_kiobuf(), and then return from the call with the pages
staying pinned. (For now, the best alternative is to use mlock() for
such long lived pinning. I am not sure if anything more is needed here,
but would have to look at the fork path handling to decide). No, I 
an not going to be dragged into a discussion about this right now,
this is just an FYI if you do have a need for this support.

Oh, btw, stay away from PG_locked for your network driver pinning method, 
hangs will happen if your buffer is mapped to file pages.

Kanoj
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2000-04-21 18:20 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-04-19 23:02 questions on having a driver pin user memory for DMA Weimin Tchen
2000-04-20  6:39 ` Eric W. Biederman
2000-04-20  9:20   ` Ingo Oeser
2000-04-20 12:30   ` Stephen C. Tweedie
2000-04-20 12:27 ` Stephen C. Tweedie
2000-04-20 23:43   ` Weimin Tchen
2000-04-21 18:20     ` Kanoj Sarcar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox