[patch] mm: pageable memory allocator (for DRM-GEM?)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [patch] mm: pageable memory allocator (for DRM-GEM?)
@ 2008-09-23  9:10 Nick Piggin
  2008-09-23 10:21 ` Thomas Hellström
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Nick Piggin @ 2008-09-23  9:10 UTC (permalink / raw)
  To: keith.packard, eric, hugh, hch, airlied, jbarnes, thomas,
	dri-devel, Linux Memory Management List,
	Linux Kernel Mailing List

Hi,

So I promised I would look at this again, because I (and others) have some
issues with exporting shmem_file_setup for DRM-GEM to go off and do things
with.

The rationale for using shmem seems to be that pageable "objects" are needed,
and they can't be created by userspace because that would be ugly for some
reason, and/or they are required before userland is running.

I particularly don't like the idea of exposing these vfs objects to random
drivers because they're likely to get things wrong or become out of synch
or unreviewed if things change. I suggested a simple pageable object allocator
that could live in mm and hide the exact details of how shmem / pagecache
works. So I've coded that up quickly.

Upon actually looking at how "GEM" makes use of its shmem_file_setup filp, I
see something strange... it seems that userspace actually gets some kind of
descriptor, a descriptor to an object backed by this shmem file (let's call it
a "file descriptor"). Anyway, it turns out that userspace sometimes needs to
pread, pwrite, and mmap these objects, but unfortunately it has no direct way
to do that, due to not having open(2)ed the files directly. So what GEM does
is to add some ioctls which take the "file descriptor" things, and derives
the shmem file from them, and then calls into the vfs to perform the operation.

If my cursory reading is correct, then my allocator won't work so well as a
drop in replacement because one isn't allowed to know about the filp behind
the pageable object. It would also indicate some serious crack smoking by
anyone who thinks open(2), pread(2), mmap(2), etc is ugly in comparison...

So please, nobody who worked on that code is allowed to use ugly as an
argument. Technical arguments are fine, so let's try to cover them.

BTW. without knowing much of either the GEM or the SPU subsystems, the
GEM problem seems similar to SPU. Did anyone look at that code? Was it ever
considered to make the object allocator be a filesystem? That way you could
control the backing store to the objects yourself, those that want pageable
memory could use the following allocator, the ioctls could go away,
you could create your own objects if needed before userspace is up...

---

Create a simple memory allocator which can page out objects when they are
not in use. Uses shmem for the main infrastructure (except in the nommu
case where it uses slab). The smallest unit of granularity is a page, so it
is not yet suitable for tiny objects.

The API allows creation and deletion of memory objects, pinning and
unpinning of address ranges within an object, mapping ranges of an object
in KVA, dirtying ranges of an object, and operating on pages within the
object.

Cc: keith.packard@intel.com, eric@anholt.net, hugh@veritas.com, hch@infradead.org, airlied@linux.ie, jbarnes@virtuousgeek.org, thomas@tungstengraphics.com, dri-devel@lists.sourceforge.net

---
Index: linux-2.6/include/linux/pageable_alloc.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/pageable_alloc.h
@@ -0,0 +1,112 @@
+#ifndef __MM_PAGEABLE_ALLOC_H__
+#define __MM_PAGEABLE_ALLOC_H__
+
+#include <linux/mm.h>
+
+struct pgobj;
+typedef struct pgobj pgobj_t;
+
+/**
+ * pageable_alloc_object - Allocate a pageable object
+ * @size: size in bytes
+ * @nid: preferred node, or -1 for default policy
+ * Returns: an object pointer, or IS_ERR pointer on fail
+ */
+pgobj_t *pageable_alloc_object(unsigned long size, int nid);
+
+/**
+ * pageable_free_object - Free a pageable object
+ * @object: object pointer
+ */
+void pageable_free_object(pgobj_t *object);
+
+/**
+ * pageable_pin_object - Pin an address range of a pageable object
+ * @object: object pointer
+ * @start: first byte in the object to be pinned
+ * @end: last byte in the object to be pinned (not inclusive)
+ *
+ * pageable_pin_object must be called before the memory range can be used in
+ * any way the pageable object accessor functions. pageable_pin_object may
+ * have to swap pages in from disk. A successful call must be followed (at
+ * some point) by a call to pageable_unpin_object with the same range.
+ *
+ * Note: the end address is not inclusive, so a (0, 1) range is the first byte.
+ */
+int pageable_pin_object(pgobj_t *object, unsigned long start, unsigned long end);
+
+/**
+ * pageable_unpin_object - Unpin an address range of a pageable object
+ * @object: object pointer
+ * @start: first byte in the object to be unpinned
+ * @end: last byte in the object to be unpinned (not inclusive)
+ *
+ * Note: the end address is not inclusive, so a (0, 1) range is the first byte.
+ */
+void pageable_unpin_object(pgobj_t *object, unsigned long start, unsigned long end);
+
+/**
+ * pageable_dirty_object - Dirty an address range of a pageable object
+ * @object: object pointer
+ * @start: first byte in the object to be dirtied
+ * @end: last byte in the object to be dirtied (not inclusive)
+ *
+ * If a part of the memory of a pageable object is written to,
+ * pageable_dirty_object must be called on this range before it is unpinned.
+ *
+ * Note: the end address is not inclusive, so a (0, 1) range is the first byte.
+ */
+void pageable_dirty_object(pgobj_t *object, unsigned long start, unsigned long end);
+
+/**
+ * pageable_get_page - Get one page of a pageable object
+ * @object: object pointer
+ * @off: byte in the object containing the desired page
+ * @Returns: page pointer requested
+ *
+ * Note: this does not increment the page refcount in any way, however the
+ * page refcount would already be pinned by a call to pageable_pin_object.
+ */
+struct page *pageable_get_page(pgobj_t *object, unsigned long off);
+
+/**
+ * pageable_dirty_page - Dirty one page of a pageable object
+ * @object: object pointer
+ * @page: page pointer returned by pageable_get_page
+ *
+ * Like pageable_dirty_object. If the page returned by pageable_get_page
+ * is dirtied, pageable_dirty_page must be called before it is unpinned.
+ */
+void pageable_dirty_page(pgobj_t *object, struct page *page);
+
+/**
+ * pageable_vmap_object - Map an address range of a pageable object
+ * @object: object pointer
+ * @start: first byte in the object to be mapped
+ * @end: last byte in the object to be mapped (not inclusive)
+ * @Returns: kernel virtual address, NULL on memory allocation failure
+ *
+ * This maps a specified range of a pageable object into kernel virtual
+ * memory, where it can be treated and operated on as regular memory. It
+ * must be followed by a call to pageable_vunmap_object.
+ *
+ * Note: the end address is not inclusive, so a (0, 1) range is the first byte.
+ */
+void *pageable_vmap_object(pgobj_t *object, unsigned long start, unsigned long end);
+
+/**
+ * pageable_vunmap_object - Unmap an address range of a pageable object
+ * @object: object pointer
+ * @ptr: pointer returned by pageable_vmap_object
+ * @start: first byte in the object to be mapped
+ * @end: last byte in the object to be mapped (not inclusive)
+ *
+ * This maps a specified range of a pageable object into kernel virtual
+ * memory, where it can be treated and operated on as regular memory. It
+ * must be followed by a call to pageable_vunmap_object.
+ *
+ * Note: the end address is not inclusive, so a (0, 1) range is the first byte.
+ */
+void pageable_vunmap_object(pgobj_t *object, void *ptr, unsigned long start, unsigned long end);
+
+#endif
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile
+++ linux-2.6/mm/Makefile
@@ -11,7 +11,7 @@ obj-y			:= bootmem.o filemap.o mempool.o
 			   maccess.o page_alloc.o page-writeback.o pdflush.o \
 			   readahead.o swap.o truncate.o vmscan.o \
 			   prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
-			   page_isolation.o mm_init.o $(mmu-y)
+			   page_isolation.o mm_init.o pageable_alloc.o $(mmu-y)
 
 obj-$(CONFIG_PROC_PAGE_MONITOR) += pagewalk.o
 obj-$(CONFIG_BOUNCE)	+= bounce.o
Index: linux-2.6/mm/pageable_alloc.c
===================================================================
--- /dev/null
+++ linux-2.6/mm/pageable_alloc.c
@@ -0,0 +1,260 @@
+/*
+ * Simple pageable memory allocator
+ *
+ * Copyright (C) 2008 Nick Piggin
+ * Copyright (C) 2008 Novell Inc.
+ */
+#include <linux/pageable_alloc.h>
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <linux/pagemap.h>
+#include <linux/file.h>
+#include <linux/vmalloc.h>
+#include <linux/slab.h>
+#include <linux/radix-tree.h>
+
+#ifdef CONFIG_MMU
+struct pgobj {
+	struct file f;
+};
+
+pgobj_t *pageable_alloc_object(unsigned long size, int nid)
+{
+	struct file *filp;
+
+	filp = shmem_file_setup("pageable object", size, 0);
+
+	return (struct pgobj *)filp;
+}
+
+void pageable_free_object(pgobj_t *object)
+{
+	struct file *filp = (struct file *)object;
+
+	fput(filp);
+}
+
+int pageable_pin_object(pgobj_t *object, unsigned long start, unsigned long end)
+{
+	struct file *filp = (struct file *)object;
+	struct address_space *mapping = filp->f_dentry->d_inode->i_mapping;
+	pgoff_t first, last, i;
+	int err = 0;
+
+	BUG_ON(start >= end);
+
+	first = start / PAGE_SIZE;
+	last = DIV_ROUND_UP(end, PAGE_SIZE);
+
+	for (i = first; i < last; i++) {
+		struct page *page;
+
+		page = read_mapping_page(mapping, i, filp);
+		if (IS_ERR(page)) {
+			err = PTR_ERR(page);
+			goto out_error;
+		}
+	}
+
+	BUG_ON(err);
+	return 0;
+
+out_error:
+	if (i > first)
+		pageable_unpin_object(object, start, start + i*PAGE_SIZE);
+	return err;
+}
+
+void pageable_unpin_object(pgobj_t *object, unsigned long start, unsigned long end)
+{
+	struct file *filp = (struct file *)object;
+	struct address_space *mapping = filp->f_dentry->d_inode->i_mapping;
+	pgoff_t first, last, i;
+
+	BUG_ON(start >= end);
+
+	first = start / PAGE_SIZE;
+	last = DIV_ROUND_UP(end, PAGE_SIZE);
+
+	for (i = first; i < last; i++) {
+		struct page *page;
+
+		rcu_read_lock();
+		page = radix_tree_lookup(&mapping->page_tree, i);
+		rcu_read_unlock();
+		BUG_ON(!page);
+		BUG_ON(page_count(page) < 2);
+		page_cache_release(page);
+	}
+}
+
+void pageable_dirty_object(pgobj_t *object, unsigned long start, unsigned long end)
+{
+	struct file *filp = (struct file *)object;
+	struct address_space *mapping = filp->f_dentry->d_inode->i_mapping;
+	pgoff_t first, last, i;
+
+	BUG_ON(start >= end);
+
+	first = start / PAGE_SIZE;
+	last = DIV_ROUND_UP(end, PAGE_SIZE);
+
+	for (i = first; i < last; i++) {
+		struct page *page;
+
+		rcu_read_lock();
+		page = radix_tree_lookup(&mapping->page_tree, i);
+		rcu_read_unlock();
+		BUG_ON(!page);
+		BUG_ON(page_count(page) < 2);
+		set_page_dirty(page);
+	}
+}
+
+struct page *pageable_get_page(pgobj_t *object, unsigned long off)
+{
+	struct file *filp = (struct file *)object;
+	struct address_space *mapping = filp->f_dentry->d_inode->i_mapping;
+	struct page *page;
+
+	rcu_read_lock();
+	page = radix_tree_lookup(&mapping->page_tree, off / PAGE_SIZE);
+	rcu_read_unlock();
+
+	BUG_ON(!page);
+	BUG_ON(page_count(page) < 2);
+
+	return page;
+}
+
+void pageable_dirty_page(pgobj_t *object, struct page *page)
+{
+	BUG_ON(page_count(page) < 2);
+	set_page_dirty(page);
+}
+
+void *pageable_vmap_object(pgobj_t *object, unsigned long start, unsigned long end)
+{
+	struct file *filp = (struct file *)object;
+	struct address_space *mapping = filp->f_dentry->d_inode->i_mapping;
+	unsigned int offset = start & ~PAGE_CACHE_MASK;
+	pgoff_t first, last, i;
+	struct page **pages;
+	int nr;
+	void *ret;
+
+	BUG_ON(start >= end);
+
+	first = start / PAGE_SIZE;
+	last = DIV_ROUND_UP(end, PAGE_SIZE);
+	nr = last - first;
+
+#ifndef CONFIG_HIGHMEM
+	if (nr == 1) {
+		struct page *page;
+
+		rcu_read_lock();
+		page = radix_tree_lookup(&mapping->page_tree, first);
+		rcu_read_unlock();
+		BUG_ON(!page);
+		BUG_ON(page_count(page) < 2);
+
+		ret = page_address(page);
+
+		goto out;
+	}
+#endif
+
+	pages = kmalloc(sizeof(struct page *) * nr, GFP_KERNEL);
+	if (!pages)
+		return NULL;
+
+	for (i = first; i < last; i++) {
+		struct page *page;
+
+		rcu_read_lock();
+		page = radix_tree_lookup(&mapping->page_tree, i);
+		rcu_read_unlock();
+		BUG_ON(!page);
+		BUG_ON(page_count(page) < 2);
+
+		pages[i] = page;
+	}
+
+	ret = vmap(pages, nr, VM_MAP, PAGE_KERNEL);
+	kfree(pages);
+	if (!ret)
+		return NULL;
+
+out:
+	return ret + offset;
+}
+
+void pageable_vunmap_object(pgobj_t *object, void *ptr, unsigned long start, unsigned long end)
+{
+#ifndef CONFIG_HIGHMEM
+	pgoff_t first, last;
+	int nr;
+
+	BUG_ON(start >= end);
+
+	first = start / PAGE_SIZE;
+	last = DIV_ROUND_UP(end, PAGE_SIZE);
+	nr = last - first;
+	if (nr == 1)
+		return;
+#endif
+
+	vunmap((void *)((unsigned long)ptr & PAGE_CACHE_MASK));
+}
+
+#else
+
+pgobj_t *pageable_alloc_object(unsigned long size, int nid)
+{
+	void *ret;
+
+	ret = kmalloc(size, GFP_KERNEL);
+	if (!ret)
+		return ERR_PTR(-ENOMEM);
+
+	return ret;
+}
+
+void pageable_free_object(pgobj_t *object)
+{
+	kfree(object);
+}
+
+int pageable_pin_object(pgobj_t *object, unsigned long start, unsigned long end){
+	return 0;
+}
+
+void pageable_unpin_object(pgobj_t *object, unsigned long start, unsigned long end)
+{
+}
+
+void pageable_dirty_object(pgobj_t *object, unsigned long start, unsigned long end)
+{
+}
+
+struct page *pageable_get_page(pgobj_t *object, unsigned long off)
+{
+	void *ptr = object;
+	return virt_to_page(ptr + off);
+}
+
+void pageable_dirty_page(pgobj_t *object, struct page *page)
+{
+}
+
+void *pageable_vmap_object(pgobj_t *object, unsigned long start, unsigned long end)
+{
+	void *ptr = object;
+	return ptr + start;
+}
+
+void pageable_vunmap_object(pgobj_t *object, void *ptr, unsigned long start, unsigned long end)
+{
+}
+#endif

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: pageable memory allocator (for DRM-GEM?)
  2008-09-23  9:10 [patch] mm: pageable memory allocator (for DRM-GEM?) Nick Piggin
@ 2008-09-23 10:21 ` Thomas Hellström
  2008-09-23 11:31   ` Jerome Glisse
  2008-09-25  0:18   ` Nick Piggin
  2008-09-23 15:50 ` Keith Packard
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 23+ messages in thread
From: Thomas Hellström @ 2008-09-23 10:21 UTC (permalink / raw)
  To: Nick Piggin
  Cc: keith.packard, eric, hugh, hch, airlied, jbarnes, dri-devel,
	Linux Memory Management List, Linux Kernel Mailing List

Nick Piggin wrote:
> Hi,
>
> So I promised I would look at this again, because I (and others) have some
> issues with exporting shmem_file_setup for DRM-GEM to go off and do things
> with.
>
> The rationale for using shmem seems to be that pageable "objects" are needed,
> and they can't be created by userspace because that would be ugly for some
> reason, and/or they are required before userland is running.
>
> I particularly don't like the idea of exposing these vfs objects to random
> drivers because they're likely to get things wrong or become out of synch
> or unreviewed if things change. I suggested a simple pageable object allocator
> that could live in mm and hide the exact details of how shmem / pagecache
> works. So I've coded that up quickly.
>
> Upon actually looking at how "GEM" makes use of its shmem_file_setup filp, I
> see something strange... it seems that userspace actually gets some kind of
> descriptor, a descriptor to an object backed by this shmem file (let's call it
> a "file descriptor"). Anyway, it turns out that userspace sometimes needs to
> pread, pwrite, and mmap these objects, but unfortunately it has no direct way
> to do that, due to not having open(2)ed the files directly. So what GEM does
> is to add some ioctls which take the "file descriptor" things, and derives
> the shmem file from them, and then calls into the vfs to perform the operation.
>
> If my cursory reading is correct, then my allocator won't work so well as a
> drop in replacement because one isn't allowed to know about the filp behind
> the pageable object. It would also indicate some serious crack smoking by
> anyone who thinks open(2), pread(2), mmap(2), etc is ugly in comparison...
>
> So please, nobody who worked on that code is allowed to use ugly as an
> argument. Technical arguments are fine, so let's try to cover them.
>
>   
Nick,
 From my point of view, this is exactly what's needed, although there 
might be some different opinions among the
DRM developers. A question:

Sometimes it's desirable to indicate that a page / object is "cleaned", 
which would mean data has moved and is backed by device memory. In that 
case one could either free the object or indicate to it that it can 
release it's pages. Is freeing / recreating such an object an expensive 
operation? Would it, in that case, be possible to add an object / page 
"cleaned" function?

/Thomas
 








--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: pageable memory allocator (for DRM-GEM?)
  2008-09-23 10:21 ` Thomas Hellström
@ 2008-09-23 11:31   ` Jerome Glisse
  2008-09-23 13:18     ` Christoph Lameter
  2008-09-25  0:18   ` Nick Piggin
  1 sibling, 1 reply; 23+ messages in thread
From: Jerome Glisse @ 2008-09-23 11:31 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: Nick Piggin, keith.packard, eric, hugh, hch, airlied, jbarnes,
	dri-devel, Linux Memory Management List,
	Linux Kernel Mailing List

On Tue, 23 Sep 2008 12:21:26 +0200
Thomas Hellstrom <thomas@tungstengraphics.com> wrote:

> Nick Piggin wrote:
> > Hi,
> >
> > So I promised I would look at this again, because I (and others) have some
> > issues with exporting shmem_file_setup for DRM-GEM to go off and do things
> > with.
> >
> > The rationale for using shmem seems to be that pageable "objects" are needed,
> > and they can't be created by userspace because that would be ugly for some
> > reason, and/or they are required before userland is running.
> >
> > I particularly don't like the idea of exposing these vfs objects to random
> > drivers because they're likely to get things wrong or become out of synch
> > or unreviewed if things change. I suggested a simple pageable object allocator
> > that could live in mm and hide the exact details of how shmem / pagecache
> > works. So I've coded that up quickly.
> >
> > Upon actually looking at how "GEM" makes use of its shmem_file_setup filp, I
> > see something strange... it seems that userspace actually gets some kind of
> > descriptor, a descriptor to an object backed by this shmem file (let's call it
> > a "file descriptor"). Anyway, it turns out that userspace sometimes needs to
> > pread, pwrite, and mmap these objects, but unfortunately it has no direct way
> > to do that, due to not having open(2)ed the files directly. So what GEM does
> > is to add some ioctls which take the "file descriptor" things, and derives
> > the shmem file from them, and then calls into the vfs to perform the operation.
> >
> > If my cursory reading is correct, then my allocator won't work so well as a
> > drop in replacement because one isn't allowed to know about the filp behind
> > the pageable object. It would also indicate some serious crack smoking by
> > anyone who thinks open(2), pread(2), mmap(2), etc is ugly in comparison...
> >
> > So please, nobody who worked on that code is allowed to use ugly as an
> > argument. Technical arguments are fine, so let's try to cover them.
> >
> >   
> Nick,
>  From my point of view, this is exactly what's needed, although there 
> might be some different opinions among the
> DRM developers. A question:
> 
> Sometimes it's desirable to indicate that a page / object is "cleaned", 
> which would mean data has moved and is backed by device memory. In that 
> case one could either free the object or indicate to it that it can 
> release it's pages. Is freeing / recreating such an object an expensive 
> operation? Would it, in that case, be possible to add an object / page 
> "cleaned" function?
> 
> /Thomas

Also what about a uncached page allocator ? As some drivers might need
them, there is no number but i think their was some concern that changing
PAT too often might be costly and that we would better have a poll of
such pages.

Cheers,
Jerome Glisse <glisse@freedesktop.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: pageable memory allocator (for DRM-GEM?)
  2008-09-23 11:31   ` Jerome Glisse
@ 2008-09-23 13:18     ` Christoph Lameter
  0 siblings, 0 replies; 23+ messages in thread
From: Christoph Lameter @ 2008-09-23 13:18 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Thomas Hellström, Nick Piggin, keith.packard, eric, hugh,
	hch, airlied, jbarnes, dri-devel, Linux Memory Management List,
	Linux Kernel Mailing List

Jerome Glisse wrote:
> 
> Also what about a uncached page allocator ? As some drivers might need
> them, there is no number but i think their was some concern that changing
> PAT too often might be costly and that we would better have a poll of
> such pages.

IA64 has an uncached allocator. See arch/ia64/include/asm/incached.h and
arch/ia64/kernel/uncached.c. Probably not exactly what you want but its a
starting point.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: pageable memory allocator (for DRM-GEM?)
  2008-09-23  9:10 [patch] mm: pageable memory allocator (for DRM-GEM?) Nick Piggin
  2008-09-23 10:21 ` Thomas Hellström
@ 2008-09-23 15:50 ` Keith Packard
  2008-09-23 18:29   ` Jerome Glisse
  2008-09-25  0:30   ` Nick Piggin
  2008-09-25  8:45 ` KAMEZAWA Hiroyuki
  2008-09-30  1:10 ` Eric Anholt
  3 siblings, 2 replies; 23+ messages in thread
From: Keith Packard @ 2008-09-23 15:50 UTC (permalink / raw)
  To: Nick Piggin
  Cc: keithp, eric, hugh, hch, airlied, jbarnes, thomas, dri-devel,
	Linux Memory Management List, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 6188 bytes --]

On Tue, 2008-09-23 at 11:10 +0200, Nick Piggin wrote:

> So I promised I would look at this again, because I (and others) have some
> issues with exporting shmem_file_setup for DRM-GEM to go off and do things
> with.

Thanks for looking at this again.

> The rationale for using shmem seems to be that pageable "objects" are needed,
> and they can't be created by userspace because that would be ugly for some
> reason, and/or they are required before userland is running.

Right, creating them from user space was just a mild inconvenience as
we'd have to come up with suitable names. The semantics don't match
exactly as most of the time we never need the filename later, but some
objects will be referenced later on so we'd need to be able to come up
with a persistent name at that point.

The real issue is that we need to create objects early in the kernel
initialization sequence to provide storage for the console frame buffer,
long before user space starts up. Lacking this, we wouldn't be able to
present early kernel initialization messages to the user.

> I particularly don't like the idea of exposing these vfs objects to random
> drivers because they're likely to get things wrong or become out of synch
> or unreviewed if things change. I suggested a simple pageable object allocator
> that could live in mm and hide the exact details of how shmem / pagecache
> works. So I've coded that up quickly.

Thanks for trying another direction; let's see if that will work for us.

> Upon actually looking at how "GEM" makes use of its shmem_file_setup filp, I
> see something strange... it seems that userspace actually gets some kind of
> descriptor, a descriptor to an object backed by this shmem file (let's call it
> a "file descriptor"). Anyway, it turns out that userspace sometimes needs to
> pread, pwrite, and mmap these objects, but unfortunately it has no direct way
> to do that, due to not having open(2)ed the files directly. So what GEM does
> is to add some ioctls which take the "file descriptor" things, and derives
> the shmem file from them, and then calls into the vfs to perform the operation.

Sure, we've looked at using regular file descriptors for these objects
and it almost works, except for a few things:

 1) We create a lot of these objects. The X server itself may have tens
    of thousands of objects in use at any one time (my current session
    with gitk and firefox running is using 1565 objects). Right now, the
    maximum number of fds supported by 'normal' kernel configurations
    is somewhat smaller than this. Even when the kernel is fixed to
    support lifting this limit, we'll be at the mercy of existing user
    space configurations for normal applications.

 2) More annoyingly, applications which use these objects also use
    select(2) and depend on being able to represent the 'real' file
    descriptors in a compact space near zero. Sticking a few thousand
    of these new objects into the system would require some ability to
    relocate the descriptors up higher in fd space. This could also
    be done in user space using dup2, but that would require managing
    file descriptor allocation in user space.

 3) The pread/pwrite/mmap functions that we use need additional flags
    to indicate some level of application 'intent'. In particular, we
    need to know whether the data is being delivered only to the GPU
    or whether the CPU will need to look at it in the future. This
    drives the kind of memory access used within the kernel and has
    a significant performance impact.

If (when?) we can figure out solutions to these issues, we'd love to
revisit the descriptor allocation plan.

> If my cursory reading is correct, then my allocator won't work so well as a
> drop in replacement because one isn't allowed to know about the filp behind
> the pageable object. It would also indicate some serious crack smoking by
> anyone who thinks open(2), pread(2), mmap(2), etc is ugly in comparison...

Yes, we'd like to be able to use regular system calls for our API, right
now we haven't figured out how to do that.

> So please, nobody who worked on that code is allowed to use ugly as an
> argument. Technical arguments are fine, so let's try to cover them.

I think we're looking for a mechanism that we know how to use and which
will allow us to provide compatibility with user space going forward.
Hiding the precise semantics of the object storage behind our
ioctl-based API means that we can completely replace in the future
without affecting user space.

> BTW. without knowing much of either the GEM or the SPU subsystems, the
> GEM problem seems similar to SPU. Did anyone look at that code? Was it ever
> considered to make the object allocator be a filesystem? That way you could
> control the backing store to the objects yourself, those that want pageable
> memory could use the following allocator, the ioctls could go away,
> you could create your own objects if needed before userspace is up...

Yes, we've considered doing a separate file system, but as we'd start by
copying shmem directly, we're unsure how that would be received. It
seems like sharing the shmem code in some sensible way is a better plan.

We just need anonymous pages that we can read/write/map to kernel and
user space. Right now, shmem provides that functionality and is used by
two kernel subsystems (sysv IPC and tmpfs). It seems like any new API
should support all three uses rather than being specific to GEM.

> The API allows creation and deletion of memory objects, pinning and
> unpinning of address ranges within an object, mapping ranges of an object
> in KVA, dirtying ranges of an object, and operating on pages within the
> object.

The only question I have is whether we can map these objects to user
space; the other operations we need are fairly easily managed by just
looking at objects one page at a time. Of course, getting to the 'fast'
memcpy variants that the current vfs_write path finds may be a trick,
but we should be able to figure that out.

-- 
keith.packard@intel.com

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: pageable memory allocator (for DRM-GEM?)
  2008-09-23 15:50 ` Keith Packard
@ 2008-09-23 18:29   ` Jerome Glisse
  2008-09-25  0:30   ` Nick Piggin
  1 sibling, 0 replies; 23+ messages in thread
From: Jerome Glisse @ 2008-09-23 18:29 UTC (permalink / raw)
  To: Keith Packard
  Cc: Nick Piggin, eric, hugh, hch, airlied, jbarnes, thomas,
	dri-devel, Linux Memory Management List,
	Linux Kernel Mailing List

On Tue, 23 Sep 2008 08:50:29 -0700
Keith Packard <keithp@keithp.com> wrote:

> On Tue, 2008-09-23 at 11:10 +0200, Nick Piggin wrote:
> > If my cursory reading is correct, then my allocator won't work so well as a
> > drop in replacement because one isn't allowed to know about the filp behind
> > the pageable object. It would also indicate some serious crack smoking by
> > anyone who thinks open(2), pread(2), mmap(2), etc is ugly in comparison...
> 
> Yes, we'd like to be able to use regular system calls for our API, right
> now we haven't figured out how to do that.
> 
> > So please, nobody who worked on that code is allowed to use ugly as an
> > argument. Technical arguments are fine, so let's try to cover them.
> 
> I think we're looking for a mechanism that we know how to use and which
> will allow us to provide compatibility with user space going forward.
> Hiding the precise semantics of the object storage behind our
> ioctl-based API means that we can completely replace in the future
> without affecting user space.
> 

I am starting to ponder if driver specific ioctl for memory object is a
better plan. On intel you have your GTT mapping trick (whether you want
to access an object through GTT or directly map ram page iirc), on radeon
i can think of similar but bit different use case where we can ask to
map some vram with special properties on it so we can access some tiled
surface transparently.

Of course the underlying implementation will share quite bit of code.
I just think that each hw have its own specificity and that trying to
hamer out all this in a common userspace API is not the best thing to do.
I am pretty sure nvidia hw offer some nice trick that won't fit in any
common userspace interface.

So the point is that Nick proposal does make lot of sense and i think
we should let each driver design their own memory object API to fit their
need. We don't have the need for a common interface anymore in DRI2.

Cheers,
Jerome Glisse <glisse@freedesktop.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: pageable memory allocator (for DRM-GEM?)
  2008-09-23 10:21 ` Thomas Hellström
  2008-09-23 11:31   ` Jerome Glisse
@ 2008-09-25  0:18   ` Nick Piggin
  2008-09-25  7:19     ` Thomas Hellström
  1 sibling, 1 reply; 23+ messages in thread
From: Nick Piggin @ 2008-09-25  0:18 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: keith.packard, eric, hugh, hch, airlied, jbarnes, dri-devel,
	Linux Memory Management List, Linux Kernel Mailing List

On Tue, Sep 23, 2008 at 12:21:26PM +0200, Thomas Hellstrom wrote:
> Nick,
> From my point of view, this is exactly what's needed, although there 
> might be some different opinions among the
> DRM developers. A question:
> 
> Sometimes it's desirable to indicate that a page / object is "cleaned", 
> which would mean data has moved and is backed by device memory. In that 
> case one could either free the object or indicate to it that it can 
> release it's pages. Is freeing / recreating such an object an expensive 
> operation? Would it, in that case, be possible to add an object / page 
> "cleaned" function?

Ah, interesting... freeing/recreating isn't _too_ expensive, but it is
going to have to allocate a lot of pages (for a big object) and copy
a lot of memory. It's strange to say "cleaned", in a sense, because the
allocator itself doesn't know it is being used as a writeback cache ;)
(and it might get confusing with the shmem implementation because your
cleaned != shmem cleaned!).

I understand the operation you need, but it's tricky to make it work in
the existing shmem / vm infrastructure I think. Let's call it "dontneed",
and I'll add a hook in there we can play with later to see if it helps?

What I could imagine is to have a second backing store (not shmem), which
"dontneed" pages go onto, and they simply get discarded rather than swapped
out (eg. via the ->shrinker() memory pressure indicator). You could then
also register a callback to recreate these parts of memory if they have been
discarded then become used again. It wouldn't be terribly difficult come to
think of it... would that be useful?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: pageable memory allocator (for DRM-GEM?)
  2008-09-23 15:50 ` Keith Packard
  2008-09-23 18:29   ` Jerome Glisse
@ 2008-09-25  0:30   ` Nick Piggin
  2008-09-25  1:20     ` Keith Packard
  1 sibling, 1 reply; 23+ messages in thread
From: Nick Piggin @ 2008-09-25  0:30 UTC (permalink / raw)
  To: Keith Packard
  Cc: eric, hugh, hch, airlied, jbarnes, thomas, dri-devel,
	Linux Memory Management List, Linux Kernel Mailing List

On Tue, Sep 23, 2008 at 08:50:29AM -0700, Keith Packard wrote:
> On Tue, 2008-09-23 at 11:10 +0200, Nick Piggin wrote:
> > I particularly don't like the idea of exposing these vfs objects to random
> > drivers because they're likely to get things wrong or become out of synch
> > or unreviewed if things change. I suggested a simple pageable object allocator
> > that could live in mm and hide the exact details of how shmem / pagecache
> > works. So I've coded that up quickly.
> 
> Thanks for trying another direction; let's see if that will work for us.

Great!

 
> > Upon actually looking at how "GEM" makes use of its shmem_file_setup filp, I
> > see something strange... it seems that userspace actually gets some kind of
> > descriptor, a descriptor to an object backed by this shmem file (let's call it
> > a "file descriptor"). Anyway, it turns out that userspace sometimes needs to
> > pread, pwrite, and mmap these objects, but unfortunately it has no direct way
> > to do that, due to not having open(2)ed the files directly. So what GEM does
> > is to add some ioctls which take the "file descriptor" things, and derives
> > the shmem file from them, and then calls into the vfs to perform the operation.
> 
> Sure, we've looked at using regular file descriptors for these objects
> and it almost works, except for a few things:
> 
>  1) We create a lot of these objects. The X server itself may have tens
>     of thousands of objects in use at any one time (my current session
>     with gitk and firefox running is using 1565 objects). Right now, the
>     maximum number of fds supported by 'normal' kernel configurations
>     is somewhat smaller than this. Even when the kernel is fixed to
>     support lifting this limit, we'll be at the mercy of existing user
>     space configurations for normal applications.
> 
>  2) More annoyingly, applications which use these objects also use
>     select(2) and depend on being able to represent the 'real' file
>     descriptors in a compact space near zero. Sticking a few thousand
>     of these new objects into the system would require some ability to
>     relocate the descriptors up higher in fd space. This could also
>     be done in user space using dup2, but that would require managing
>     file descriptor allocation in user space.
> 
>  3) The pread/pwrite/mmap functions that we use need additional flags
>     to indicate some level of application 'intent'. In particular, we
>     need to know whether the data is being delivered only to the GPU
>     or whether the CPU will need to look at it in the future. This
>     drives the kind of memory access used within the kernel and has
>     a significant performance impact.

Pity. Anyway, I accept that, let's move on.

[...]

> Hiding the precise semantics of the object storage behind our
> ioctl-based API means that we can completely replace in the future
> without affecting user space.

I guess so. A big problem of ioctls is just that they had been easier to
add so they got less thought and review ;) If your ioctls are stable,
correct, cross platform etc. then I guess that's the best you can do.

 
> > BTW. without knowing much of either the GEM or the SPU subsystems, the
> > GEM problem seems similar to SPU. Did anyone look at that code? Was it ever
> > considered to make the object allocator be a filesystem? That way you could
> > control the backing store to the objects yourself, those that want pageable
> > memory could use the following allocator, the ioctls could go away,
> > you could create your own objects if needed before userspace is up...
> 
> Yes, we've considered doing a separate file system, but as we'd start by
> copying shmem directly, we're unsure how that would be received. It
> seems like sharing the shmem code in some sensible way is a better plan.

Well, no not a seperate filesystem to do the pageable backing store, but
a filesystem to do your object management. If there was a need for pageable
RAM backing store, then you would still go back to the pageable allocator. 

 
> We just need anonymous pages that we can read/write/map to kernel and
> user space. Right now, shmem provides that functionality and is used by
> two kernel subsystems (sysv IPC and tmpfs). It seems like any new API
> should support all three uses rather than being specific to GEM.
> 
> > The API allows creation and deletion of memory objects, pinning and
> > unpinning of address ranges within an object, mapping ranges of an object
> > in KVA, dirtying ranges of an object, and operating on pages within the
> > object.
> 
> The only question I have is whether we can map these objects to user
> space; the other operations we need are fairly easily managed by just
> looking at objects one page at a time. Of course, getting to the 'fast'
> memcpy variants that the current vfs_write path finds may be a trick,
> but we should be able to figure that out.

You can map them to userspace if you just take a page at a time and insert
them into the page tables at fault time (or mmap time if you prefer).
Currently, this will mean that mmapped pages would not be swappable; is
that a problem?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: pageable memory allocator (for DRM-GEM?)
  2008-09-25  0:30   ` Nick Piggin
@ 2008-09-25  1:20     ` Keith Packard
  2008-09-25  2:30       ` Nick Piggin
  0 siblings, 1 reply; 23+ messages in thread
From: Keith Packard @ 2008-09-25  1:20 UTC (permalink / raw)
  To: Nick Piggin
  Cc: keithp, eric, hugh, hch, airlied, jbarnes, thomas, dri-devel,
	Linux Memory Management List, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 2517 bytes --]

On Thu, 2008-09-25 at 02:30 +0200, Nick Piggin wrote:

> Pity. Anyway, I accept that, let's move on.

Well, the goal is to "eventually" get to use fds so that at least some
of our common operations can use regular sys calls. But, not having
those trapped in the shmem layer may end up being a feature as we'll get
to watch more closely, although dealing with the actual pread/pwrite
semantics doesn't look entirely like fun.

> I guess so. A big problem of ioctls is just that they had been easier to
> add so they got less thought and review ;) If your ioctls are stable,
> correct, cross platform etc. then I guess that's the best you can do.

One does what one can. Of course, in this case, 'cross platform' is just
x86/x86_64 as we're talking Intel integrated graphics. When (if?) we
figure out how to create a common interface across multiple cards for
some of these operations, we'll probably discover mistakes. We have
tried to be careful, but we cannot test in any other environment.

> Well, no not a seperate filesystem to do the pageable backing store, but
> a filesystem to do your object management. If there was a need for pageable
> RAM backing store, then you would still go back to the pageable allocator. 

Now that you've written one, we could go back and think about building a
file system and using fds for our operations. It would be a whole lot
easier than starting from scratch.

> You can map them to userspace if you just take a page at a time and insert
> them into the page tables at fault time (or mmap time if you prefer).
> Currently, this will mean that mmapped pages would not be swappable; is
> that a problem?

Yes. We leave a lot of objects mapped to user space as mmap isn't
exactly cheap. We're trying to use pread/pwrite for as much bulk I/O as
we can, but at this point, we're still mapping most of the pages we
allocate into user space and leaving them. Things like textures and
render buffers will get mmapped if there are any software fallbacks.
Other objects, like vertex buffers, will almost always end up mapped.

One of our explicit design goals was to make sure user space couldn't
ever pin arbitrary amounts of memory; I'd hate to go back on that as it
seems like an important property for any subsystem designed to support
regular user applications in a general purpose desktop environment. I
don't want to trust user space to do the right thing, I want to enforce
that from kernel space.

-- 
keith.packard@intel.com

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: pageable memory allocator (for DRM-GEM?)
  2008-09-25  1:20     ` Keith Packard
@ 2008-09-25  2:30       ` Nick Piggin
  2008-09-25  2:43         ` Keith Packard
  0 siblings, 1 reply; 23+ messages in thread
From: Nick Piggin @ 2008-09-25  2:30 UTC (permalink / raw)
  To: Keith Packard
  Cc: eric, hugh, hch, airlied, jbarnes, thomas, dri-devel,
	Linux Memory Management List, Linux Kernel Mailing List

On Wed, Sep 24, 2008 at 06:20:22PM -0700, Keith Packard wrote:
> On Thu, 2008-09-25 at 02:30 +0200, Nick Piggin wrote:
> 
> > I guess so. A big problem of ioctls is just that they had been easier to
> > add so they got less thought and review ;) If your ioctls are stable,
> > correct, cross platform etc. then I guess that's the best you can do.
> 
> One does what one can. Of course, in this case, 'cross platform' is just
> x86/x86_64 as we're talking Intel integrated graphics. When (if?) we
> figure out how to create a common interface across multiple cards for
> some of these operations, we'll probably discover mistakes. We have
> tried to be careful, but we cannot test in any other environment.

OK, that's all that can be asked I guess. Low level object / memory 
management hopefully can be shared.

 
> > Well, no not a seperate filesystem to do the pageable backing store, but
> > a filesystem to do your object management. If there was a need for pageable
> > RAM backing store, then you would still go back to the pageable allocator. 
> 
> Now that you've written one, we could go back and think about building a
> file system and using fds for our operations. It would be a whole lot
> easier than starting from scratch.

If it helps open some other possibilities, then great.

 
> > You can map them to userspace if you just take a page at a time and insert
> > them into the page tables at fault time (or mmap time if you prefer).
> > Currently, this will mean that mmapped pages would not be swappable; is
> > that a problem?
> 
> Yes. We leave a lot of objects mapped to user space as mmap isn't
> exactly cheap. We're trying to use pread/pwrite for as much bulk I/O as
> we can, but at this point, we're still mapping most of the pages we
> allocate into user space and leaving them. Things like textures and
> render buffers will get mmapped if there are any software fallbacks.
> Other objects, like vertex buffers, will almost always end up mapped.
> 
> One of our explicit design goals was to make sure user space couldn't
> ever pin arbitrary amounts of memory; I'd hate to go back on that as it
> seems like an important property for any subsystem designed to support
> regular user applications in a general purpose desktop environment. I
> don't want to trust user space to do the right thing, I want to enforce
> that from kernel space.

OK. I will have to add some facilities to allow mmaps that go back through
to tmpfs and be swappable... Thanks for the data point.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: pageable memory allocator (for DRM-GEM?)
  2008-09-25  2:30       ` Nick Piggin
@ 2008-09-25  2:43         ` Keith Packard
  2008-09-25  3:07           ` Nick Piggin
  0 siblings, 1 reply; 23+ messages in thread
From: Keith Packard @ 2008-09-25  2:43 UTC (permalink / raw)
  To: Nick Piggin
  Cc: keithp, eric, hugh, hch, airlied, jbarnes, thomas, dri-devel,
	Linux Memory Management List, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 411 bytes --]

On Thu, 2008-09-25 at 04:30 +0200, Nick Piggin wrote:

> OK. I will have to add some facilities to allow mmaps that go back through
> to tmpfs and be swappable... Thanks for the data point.

It seems like once you've done that you might consider extracting the
page allocator from shmem so that drm, tmpfs and sysv IPC would share
the same underlying memory manager API.

-- 
keith.packard@intel.com

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: pageable memory allocator (for DRM-GEM?)
  2008-09-25  2:43         ` Keith Packard
@ 2008-09-25  3:07           ` Nick Piggin
  2008-09-25  6:16             ` Keith Packard
  0 siblings, 1 reply; 23+ messages in thread
From: Nick Piggin @ 2008-09-25  3:07 UTC (permalink / raw)
  To: Keith Packard
  Cc: eric, hugh, hch, airlied, jbarnes, thomas, dri-devel,
	Linux Memory Management List, Linux Kernel Mailing List

On Wed, Sep 24, 2008 at 07:43:26PM -0700, Keith Packard wrote:
> On Thu, 2008-09-25 at 04:30 +0200, Nick Piggin wrote:
> 
> > OK. I will have to add some facilities to allow mmaps that go back through
> > to tmpfs and be swappable... Thanks for the data point.
> 
> It seems like once you've done that you might consider extracting the
> page allocator from shmem so that drm, tmpfs and sysv IPC would share
> the same underlying memory manager API.

That might be the cleanest logical way to do it actually. But for the moment
I'm happy not to pull tmpfs apart :) Even if it seems like the wrong way
around, at least it is insulated to within mm/
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: pageable memory allocator (for DRM-GEM?)
  2008-09-25  3:07           ` Nick Piggin
@ 2008-09-25  6:16             ` Keith Packard
  0 siblings, 0 replies; 23+ messages in thread
From: Keith Packard @ 2008-09-25  6:16 UTC (permalink / raw)
  To: Nick Piggin
  Cc: keithp, eric, hugh, hch, airlied, jbarnes, thomas, dri-devel,
	Linux Memory Management List, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 526 bytes --]

On Thu, 2008-09-25 at 05:07 +0200, Nick Piggin wrote:

> That might be the cleanest logical way to do it actually. But for the moment
> I'm happy not to pull tmpfs apart :) Even if it seems like the wrong way
> around, at least it is insulated to within mm/

Sure; no sense changing that before we've gotten some experience with
the new API anyway. Would we consider modifying sysv IPC as well? It
currently uses the shmem_file_setup function although it lives a long
ways from mm...

-- 
keith.packard@intel.com

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: pageable memory allocator (for DRM-GEM?)
  2008-09-25  0:18   ` Nick Piggin
@ 2008-09-25  7:19     ` Thomas Hellström
  2008-09-25 14:38       ` Keith Packard
  0 siblings, 1 reply; 23+ messages in thread
From: Thomas Hellström @ 2008-09-25  7:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: keith.packard, eric, hugh, hch, airlied, jbarnes, dri-devel,
	Linux Memory Management List, Linux Kernel Mailing List

Nick Piggin wrote:
> On Tue, Sep 23, 2008 at 12:21:26PM +0200, Thomas Hellstrom wrote:
>   
>> Nick,
>> From my point of view, this is exactly what's needed, although there 
>> might be some different opinions among the
>> DRM developers. A question:
>>
>> Sometimes it's desirable to indicate that a page / object is "cleaned", 
>> which would mean data has moved and is backed by device memory. In that 
>> case one could either free the object or indicate to it that it can 
>> release it's pages. Is freeing / recreating such an object an expensive 
>> operation? Would it, in that case, be possible to add an object / page 
>> "cleaned" function?
>>     
>
> Ah, interesting... freeing/recreating isn't _too_ expensive, but it is
> going to have to allocate a lot of pages (for a big object) and copy
> a lot of memory. It's strange to say "cleaned", in a sense, because the
> allocator itself doesn't know it is being used as a writeback cache ;)
> (and it might get confusing with the shmem implementation because your
> cleaned != shmem cleaned!).
>
> I understand the operation you need, but it's tricky to make it work in
> the existing shmem / vm infrastructure I think. Let's call it "dontneed",
> and I'll add a hook in there we can play with later to see if it helps?
>
> What I could imagine is to have a second backing store (not shmem), which
> "dontneed" pages go onto, and they simply get discarded rather than swapped
> out (eg. via the ->shrinker() memory pressure indicator). You could then
> also register a callback to recreate these parts of memory if they have been
> discarded then become used again. It wouldn't be terribly difficult come to
> think of it... would that be useful?
>
>   
Well, the typical usage pattern is:

1) User creates a texture object, the data of which lives in a pageable 
object.
2) DRM decides it needs to go into V(ideo)RAM, and doesn't need a 
backing store. It indicates "dontneed" status on the object.
3) Data is evicted from VRAM. If it's not dirtied in VRAM and the 
"dontneed" pages are still around in the
old backing object, fine. We can and should reuse them. If data is 
dirtied in VRAM or the page(s) got discarded
 we need new pages and to set up a copy operation.

So yes, that would indeed be very useful,
I think one way is to have the callback happen on a per-page basis.
 DRM can then collect a list of what pages need to be copied from VRAM 
based on the callback and its knowledge of VRAM data status and set up a 
single DMA operation. So the callback shouldn't implicitly mark the 
newly allocated pages dirty.

Another useful thing I come to think of looking through the interface 
specification again is to have a pgprot_t argument to the pageable 
object vmap function, so that the VMAP mapping can be set to uncached / 
write-combined.
This of course would imply that the caller has pinned the pages and had 
the default kernel mapping caching status changed as well.

/Thomas

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: pageable memory allocator (for DRM-GEM?)
  2008-09-23  9:10 [patch] mm: pageable memory allocator (for DRM-GEM?) Nick Piggin
  2008-09-23 10:21 ` Thomas Hellström
  2008-09-23 15:50 ` Keith Packard
@ 2008-09-25  8:45 ` KAMEZAWA Hiroyuki
  2008-09-30  1:10 ` Eric Anholt
  3 siblings, 0 replies; 23+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-25  8:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: keith.packard, eric, hugh, hch, airlied, jbarnes, thomas,
	dri-devel, Linux Memory Management List,
	Linux Kernel Mailing List

On Tue, 23 Sep 2008 11:10:17 +0200
Nick Piggin <npiggin@suse.de> wrote:

> +void *pageable_vmap_object(pgobj_t *object, unsigned long start, unsigned long end)
> +{
> +	struct file *filp = (struct file *)object;
> +	struct address_space *mapping = filp->f_dentry->d_inode->i_mapping;
> +	unsigned int offset = start & ~PAGE_CACHE_MASK;
> +	pgoff_t first, last, i;
> +	struct page **pages;
> +	int nr;
> +	void *ret;
> +
> +	BUG_ON(start >= end);
> +
> +	first = start / PAGE_SIZE;
> +	last = DIV_ROUND_UP(end, PAGE_SIZE);
> +	nr = last - first;
> +
> +#ifndef CONFIG_HIGHMEM
> +	if (nr == 1) {
> +		struct page *page;
> +
> +		rcu_read_lock();
> +		page = radix_tree_lookup(&mapping->page_tree, first);
> +		rcu_read_unlock();
> +		BUG_ON(!page);
> +		BUG_ON(page_count(page) < 2);
> +
> +		ret = page_address(page);
> +
> +		goto out;
> +	}
> +#endif
> +
> +	pages = kmalloc(sizeof(struct page *) * nr, GFP_KERNEL);
> +	if (!pages)
> +		return NULL;
> +
> +	for (i = first; i < last; i++) {
> +		struct page *page;
> +
> +		rcu_read_lock();
> +		page = radix_tree_lookup(&mapping->page_tree, i);
> +		rcu_read_unlock();
> +		BUG_ON(!page);
> +		BUG_ON(page_count(page) < 2);
> +
> +		pages[i] = page;
> +	}
> +
> +	ret = vmap(pages, nr, VM_MAP, PAGE_KERNEL);
> +	kfree(pages);
> +	if (!ret)
> +		return NULL;
> +
> +out:
> +	return ret + offset;
> +}
> +
> +void pageable_vunmap_object(pgobj_t *object, void *ptr, unsigned long start, unsigned long end)
> +{
> +#ifndef CONFIG_HIGHMEM
> +	pgoff_t first, last;
> +	int nr;
> +
> +	BUG_ON(start >= end);
> +
> +	first = start / PAGE_SIZE;
> +	last = DIV_ROUND_UP(end, PAGE_SIZE);
> +	nr = last - first;
> +	if (nr == 1)
> +		return;
> +#endif
> +
> +	vunmap((void *)((unsigned long)ptr & PAGE_CACHE_MASK));
> +}
> +

Some questions..

 - could you use GFP_HIGHUSER rather than GFP_HIGHUSER_MOVABLE ?
   I think setting mapping_gfp_mask() (address_space->flags) to appropriate
   value is enough.

 - Can we mlock pages while it's vmapped ? (or Reserve and remove from LRU)
   Then, new split-lru can ignore these pages while there are mapped. over-killing ?

 - Doesn't we need to increase page->mapcount ?

 - memory resource contorller should account these pages ?
   (Maybe this is question to myself....)

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: pageable memory allocator (for DRM-GEM?)
  2008-09-25  7:19     ` Thomas Hellström
@ 2008-09-25 14:38       ` Keith Packard
  2008-09-25 15:39         ` Thomas Hellström
  0 siblings, 1 reply; 23+ messages in thread
From: Keith Packard @ 2008-09-25 14:38 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: keithp, Nick Piggin, eric, hugh, hch, airlied, jbarnes,
	dri-devel, Linux Memory Management List,
	Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 548 bytes --]

On Thu, 2008-09-25 at 00:19 -0700, Thomas Hellström wrote:
>  If data is
> dirtied in VRAM or the page(s) got discarded
>  we need new pages and to set up a copy operation.

Note that this can occur as a result of a suspend-to-memory transition
at which point *all* of the objects in VRAM will need to be preserved in
main memory, and so the pages aren't really 'freed', they just don't
need to have valid contents, but the system should be aware that the
space may be needed at some point in the future.

-- 
keith.packard@intel.com

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: pageable memory allocator (for DRM-GEM?)
  2008-09-25 14:38       ` Keith Packard
@ 2008-09-25 15:39         ` Thomas Hellström
  2008-09-25 22:41           ` Dave Airlie
  0 siblings, 1 reply; 23+ messages in thread
From: Thomas Hellström @ 2008-09-25 15:39 UTC (permalink / raw)
  To: Keith Packard
  Cc: Nick Piggin, eric, hugh, hch, airlied, jbarnes, dri-devel,
	Linux Memory Management List, Linux Kernel Mailing List

Keith Packard wrote:
> On Thu, 2008-09-25 at 00:19 -0700, Thomas HellstrA?m wrote:
>   
>>  If data is
>> dirtied in VRAM or the page(s) got discarded
>>  we need new pages and to set up a copy operation.
>>     
>
> Note that this can occur as a result of a suspend-to-memory transition
> at which point *all* of the objects in VRAM will need to be preserved in
> main memory, and so the pages aren't really 'freed', they just don't
> need to have valid contents, but the system should be aware that the
> space may be needed at some point in the future.
>
>   
Actually, I think the pages must be allowed to be freed, and that we 
don't put a requirement on "pageable"  to keep
swap-space slots for these pages. If we hit an OOM-condition during 
suspend-to-memory that's bad, but let's say we
required "pageable" to keep swap space slots for us, the result would 
perhaps be that another device wasn't able to suspend, or a user-space 
program was killed due to lack of swap-space prior to suspend.

I'm not really sure what's the worst situation, but my feeling is that 
we should not require swap-space to be reserved for VRAM, and abort the 
suspend operation if we hit OOM. That would, in the worst case, mean 
that people with non-UMA laptops and a too small swap partition would 
see their battery run out much quicker than they expected...

/Thomas



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: pageable memory allocator (for DRM-GEM?)
  2008-09-25 15:39         ` Thomas Hellström
@ 2008-09-25 22:41           ` Dave Airlie
  0 siblings, 0 replies; 23+ messages in thread
From: Dave Airlie @ 2008-09-25 22:41 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: Keith Packard, Nick Piggin, eric, hugh, hch, airlied, jbarnes,
	dri-devel, Linux Memory Management List,
	Linux Kernel Mailing List

On Fri, Sep 26, 2008 at 1:39 AM, Thomas Hellstrom
<thomas@tungstengraphics.com> wrote:
> Keith Packard wrote:
>>
>> On Thu, 2008-09-25 at 00:19 -0700, Thomas Hellstrom wrote:
>>
>>>
>>>  If data is
>>> dirtied in VRAM or the page(s) got discarded
>>>  we need new pages and to set up a copy operation.
>>>
>>
>> Note that this can occur as a result of a suspend-to-memory transition
>> at which point *all* of the objects in VRAM will need to be preserved in
>> main memory, and so the pages aren't really 'freed', they just don't
>> need to have valid contents, but the system should be aware that the
>> space may be needed at some point in the future.
>>
>>
>
> Actually, I think the pages must be allowed to be freed, and that we don't
> put a requirement on "pageable"  to keep
> swap-space slots for these pages. If we hit an OOM-condition during
> suspend-to-memory that's bad, but let's say we
> required "pageable" to keep swap space slots for us, the result would
> perhaps be that another device wasn't able to suspend, or a user-space
> program was killed due to lack of swap-space prior to suspend.
>
> I'm not really sure what's the worst situation, but my feeling is that we
> should not require swap-space to be reserved for VRAM, and abort the suspend
> operation if we hit OOM. That would, in the worst case, mean that people
> with non-UMA laptops and a too small swap partition would see their battery
> run out much quicker than they expected...
>

You can't fail suspend, it just doesn't work like that. The use case
is close laptop
shove in bag, walk away. Having my bag heat up and the laptop inside
not suspended
isn't the answer ever.

So with that in mind, I think we either a) keep some backing pages
around, or b) make object
file backed so if the swap space fills up we can back out to the file objects.

Dave.

> /Thomas
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: pageable memory allocator (for DRM-GEM?)
  2008-09-23  9:10 [patch] mm: pageable memory allocator (for DRM-GEM?) Nick Piggin
                   ` (2 preceding siblings ...)
  2008-09-25  8:45 ` KAMEZAWA Hiroyuki
@ 2008-09-30  1:10 ` Eric Anholt
  2008-10-02 17:15   ` Jesse Barnes
  3 siblings, 1 reply; 23+ messages in thread
From: Eric Anholt @ 2008-09-30  1:10 UTC (permalink / raw)
  To: Nick Piggin
  Cc: keith.packard, hugh, hch, airlied, jbarnes, thomas, dri-devel,
	Linux Memory Management List, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 5024 bytes --]

On Tue, 2008-09-23 at 11:10 +0200, Nick Piggin wrote:
> Hi,
> 
> So I promised I would look at this again, because I (and others) have some
> issues with exporting shmem_file_setup for DRM-GEM to go off and do things
> with.
> 
> The rationale for using shmem seems to be that pageable "objects" are needed,
> and they can't be created by userspace because that would be ugly for some
> reason, and/or they are required before userland is running.
> 
> I particularly don't like the idea of exposing these vfs objects to random
> drivers because they're likely to get things wrong or become out of synch
> or unreviewed if things change. I suggested a simple pageable object allocator
> that could live in mm and hide the exact details of how shmem / pagecache
> works. So I've coded that up quickly.

Hiding the details of shmem and the pagecache sounds pretty good to me
(since we've got it wrong at least twice so far).  Hopefully the result
isn't even more fragile code on our part.

> Upon actually looking at how "GEM" makes use of its shmem_file_setup filp, I
> see something strange... it seems that userspace actually gets some kind of
> descriptor, a descriptor to an object backed by this shmem file (let's call it
> a "file descriptor"). Anyway, it turns out that userspace sometimes needs to
> pread, pwrite, and mmap these objects, but unfortunately it has no direct way
> to do that, due to not having open(2)ed the files directly. So what GEM does
> is to add some ioctls which take the "file descriptor" things, and derives
> the shmem file from them, and then calls into the vfs to perform the operation.
> 
> If my cursory reading is correct, then my allocator won't work so well as a
> drop in replacement because one isn't allowed to know about the filp behind
> the pageable object. It would also indicate some serious crack smoking by
> anyone who thinks open(2), pread(2), mmap(2), etc is ugly in comparison...

I think the explanation for this got covered in other parts of the
thread, but drm_gem.c comments at the top also cover it.

> So please, nobody who worked on that code is allowed to use ugly as an
> argument. Technical arguments are fine, so let's try to cover them.
> 
> BTW. without knowing much of either the GEM or the SPU subsystems, the
> GEM problem seems similar to SPU. Did anyone look at that code? Was it ever
> considered to make the object allocator be a filesystem? That way you could
> control the backing store to the objects yourself, those that want pageable
> memory could use the following allocator, the ioctls could go away,
> you could create your own objects if needed before userspace is up...

Yes, we definitely considered a filesystem (it would be nice for
debugging to be able to look at object contents from a debugger process
\easily).  However, once we realized that fds just wouldn't work (we're
allocating objects in a library, so we couldn't just dup2 them up high,
and we couldn't rely on being able to up the open file limit for the
process), shmem seemed to already be exactly what we wanted, and we
assumed that whatever future API changes in the couple of VFS and
pagecache calls we made would be easier to track than duplicating all of
shmem.c into our driver.

I'm porting our stuff to test on your API now, and everything looks
straightforward except for mmap.  For that I seem to have three options:

1) Implement a range allocator on the DRM device and have a hashtable of
ranges to objects, then have a GEM hook in the mmap handler of the drm
device when we find we're in one of those ranges.

This was the path that TTM took.  Since we have different paths to
mmapping objects (direct backing store access, or aperture access,
though the second isn't in my tree yet), it means we end up having
multiple offsets to represent different mmap types, or multiplying the
size of the range and having the top half of the range mean the other
mmap type.

2) Create a kernel-internal filesystem and get struct files for the
objects.

This is the method that seemed like the right thing to do in the linux
style, so I've been trying to figure that part out.  I've been assured
that libfs makes my job easy here, but as I look at it I'm less sure.
The sticking point to me is how the page list I get from your API ends
up getting used by simple_file_* and generic_file_*.  And, in the future
where the pageable memory allocator is actually pageable while mmapped,
what does the API I get to consume look like, roughly?

3) Use shmem_file_setup()

This was what we originally went with.  It got messy when we wanted a
different mmap path, so that we had to do one of 1) or 2) anyway.

Also, I'm looking at a bunch of spu*.c code, and I'm having a hard time
finding something relevant for us, but maybe I'm not looking in the
right place.  Can you elaborate on that comment?

-- 
Eric Anholt
eric@anholt.net                         eric.anholt@intel.com

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: pageable memory allocator (for DRM-GEM?)
  2008-09-30  1:10 ` Eric Anholt
@ 2008-10-02 17:15   ` Jesse Barnes
  2008-10-03  5:17     ` Keith Packard
  0 siblings, 1 reply; 23+ messages in thread
From: Jesse Barnes @ 2008-10-02 17:15 UTC (permalink / raw)
  To: Eric Anholt
  Cc: Nick Piggin, keith.packard, hugh, hch, airlied, thomas,
	dri-devel, Linux Memory Management List,
	Linux Kernel Mailing List

On Monday, September 29, 2008 6:10 pm Eric Anholt wrote:
> On Tue, 2008-09-23 at 11:10 +0200, Nick Piggin wrote:
> > If my cursory reading is correct, then my allocator won't work so well as
> > a drop in replacement because one isn't allowed to know about the filp
> > behind the pageable object. It would also indicate some serious crack
> > smoking by anyone who thinks open(2), pread(2), mmap(2), etc is ugly in
> > comparison...
>
> I think the explanation for this got covered in other parts of the
> thread, but drm_gem.c comments at the top also cover it.
>
> > So please, nobody who worked on that code is allowed to use ugly as an
> > argument. Technical arguments are fine, so let's try to cover them.

I don't think anyone would argue that using normal system calls would be ugly, 
but there are several limitations with that approach, including the fact that 
some of our operations become slightly more difficult to do, along with the 
other limitations mentioned in drm_gem.c and in other threads.

At this point I think we should go ahead and include Eric's earlier patchset 
into drm-next, and continue to refine the internals along the lines of what 
you've posted here in the post-2.6.28 timeframe.  The ioctl based interfaces 
(there aren't too many) are something we can support going forward, so we 
should be able to rip up/clean up the implementation over time as the VM 
becomes more friendly to these sort of operations.

Any objections?

Dave, you can add my Acked-by (or S-o-b if Eric includes my GTT mapping stuff) 
to Eric's patchset; hope you can do that soon so we can get a libdrm with the 
new APIs released soon.

Thanks,
-- 
Jesse Barnes, Intel Open Source Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: pageable memory allocator (for DRM-GEM?)
  2008-10-02 17:15   ` Jesse Barnes
@ 2008-10-03  5:17     ` Keith Packard
  2008-10-03  6:40       ` Nick Piggin
  0 siblings, 1 reply; 23+ messages in thread
From: Keith Packard @ 2008-10-03  5:17 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: keithp, Eric Anholt, Nick Piggin, hugh, hch, airlied, thomas,
	dri-devel, Linux Memory Management List,
	Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 831 bytes --]

On Thu, 2008-10-02 at 10:15 -0700, Jesse Barnes wrote:

> At this point I think we should go ahead and include Eric's earlier patchset
> into drm-next, and continue to refine the internals along the lines of what
> you've posted here in the post-2.6.28 timeframe. 

Nick, in case you missed the plea here, we're asking if you have any
objection to shipping the mm changes present in Eric's patch in 2.6.28.
When your new pageable allocator becomes available, we'll switch over to
using that instead and revert Eric's mm changes.

We're ready to promise to support the user-land DRM interface going
forward, and we've got lots of additional work queued up behind this
merge. We'd prefer to push stuff a bit at a time rather than shipping a
lot of new code in a single kernel release. 

-- 
keith.packard@intel.com

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: pageable memory allocator (for DRM-GEM?)
  2008-10-03  5:17     ` Keith Packard
@ 2008-10-03  6:40       ` Nick Piggin
  0 siblings, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2008-10-03  6:40 UTC (permalink / raw)
  To: Keith Packard
  Cc: Jesse Barnes, Eric Anholt, Nick Piggin, hugh, hch, airlied,
	thomas, dri-devel, Linux Memory Management List,
	Linux Kernel Mailing List

On Friday 03 October 2008 15:17, Keith Packard wrote:
> On Thu, 2008-10-02 at 10:15 -0700, Jesse Barnes wrote:
> > At this point I think we should go ahead and include Eric's earlier
> > patchset into drm-next, and continue to refine the internals along the
> > lines of what you've posted here in the post-2.6.28 timeframe.
>
> Nick, in case you missed the plea here, we're asking if you have any
> objection to shipping the mm changes present in Eric's patch in 2.6.28.
> When your new pageable allocator becomes available, we'll switch over to
> using that instead and revert Eric's mm changes.

So long as we don't have to support the shmem exports for too long,
I'm OK with that. The pageable allocator probably is probably not a
2.6.28 merge candidate at this point, so I don't want to hold things
up if we have a definite way forward.

> We're ready to promise to support the user-land DRM interface going
> forward, and we've got lots of additional work queued up behind this
> merge. We'd prefer to push stuff a bit at a time rather than shipping a
> lot of new code in a single kernel release.

I would have liked to see more effort going towards building the user
API starting with a pseudo filesystem rather than ioctls, but that's
just my opinion after squinting at the problem from 100 metres away..
So I don't have a strong standing to demand a change here ;)

So, I'm OK with it for 2.6.28.

I think Christoph had some concerns with the patches, and I'd like to
hear that he's happy now. Christoph? Does the pageable allocator API
satisfy your concerns, or did you have other issues with it?

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [patch] mm: pageable memory allocator (for DRM-GEM?)
@ 2008-09-23  9:10 Nick Piggin
  0 siblings, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2008-09-23  9:10 UTC (permalink / raw)
  To: keith.packard, eric, hugh, hch, airlied, jbarnes, thomas, dri-devel

Hi,

So I promised I would look at this again, because I (and others) have some
issues with exporting shmem_file_setup for DRM-GEM to go off and do things
with.

The rationale for using shmem seems to be that pageable "objects" are needed,
and they can't be created by userspace because that would be ugly for some
reason, and/or they are required before userland is running.

I particularly don't like the idea of exposing these vfs objects to random
drivers because they're likely to get things wrong or become out of synch
or unreviewed if things change. I suggested a simple pageable object allocator
that could live in mm and hide the exact details of how shmem / pagecache
works. So I've coded that up quickly.

Upon actually looking at how "GEM" makes use of its shmem_file_setup filp, I
see something strange... it seems that userspace actually gets some kind of
descriptor, a descriptor to an object backed by this shmem file (let's call it
a "file descriptor"). Anyway, it turns out that userspace sometimes needs to
pread, pwrite, and mmap these objects, but unfortunately it has no direct way
to do that, due to not having open(2)ed the files directly. So what GEM does
is to add some ioctls which take the "file descriptor" things, and derives
the shmem file from them, and then calls into the vfs to perform the operation.

If my cursory reading is correct, then my allocator won't work so well as a
drop in replacement because one isn't allowed to know about the filp behind
the pageable object. It would also indicate some serious crack smoking by
anyone who thinks open(2), pread(2), mmap(2), etc is ugly in comparison...

So please, nobody who worked on that code is allowed to use ugly as an
argument. Technical arguments are fine, so let's try to cover them.

BTW. without knowing much of either the GEM or the SPU subsystems, the
GEM problem seems similar to SPU. Did anyone look at that code? Was it ever
considered to make the object allocator be a filesystem? That way you could
control the backing store to the objects yourself, those that want pageable
memory could use the following allocator, the ioctls could go away,
you could create your own objects if needed before userspace is up...

---

Create a simple memory allocator which can page out objects when they are
not in use. Uses shmem for the main infrastructure (except in the nommu
case where it uses slab). The smallest unit of granularity is a page, so it
is not yet suitable for tiny objects.

The API allows creation and deletion of memory objects, pinning and
unpinning of address ranges within an object, mapping ranges of an object
in KVA, dirtying ranges of an object, and operating on pages within the
object.

Cc: keith.packard@intel.com, eric@anholt.net, hugh@veritas.com, hch@infradead.org, airlied@linux.ie, jbarnes@virtuousgeek.org, thomas@tungstengraphics.com, dri-devel@lists.sourceforge.net

---
Index: linux-2.6/include/linux/pageable_alloc.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/pageable_alloc.h
@@ -0,0 +1,112 @@
+#ifndef __MM_PAGEABLE_ALLOC_H__
+#define __MM_PAGEABLE_ALLOC_H__
+
+#include <linux/mm.h>
+
+struct pgobj;
+typedef struct pgobj pgobj_t;
+
+/**
+ * pageable_alloc_object - Allocate a pageable object
+ * @size: size in bytes
+ * @nid: preferred node, or -1 for default policy
+ * Returns: an object pointer, or IS_ERR pointer on fail
+ */
+pgobj_t *pageable_alloc_object(unsigned long size, int nid);
+
+/**
+ * pageable_free_object - Free a pageable object
+ * @object: object pointer
+ */
+void pageable_free_object(pgobj_t *object);
+
+/**
+ * pageable_pin_object - Pin an address range of a pageable object
+ * @object: object pointer
+ * @start: first byte in the object to be pinned
+ * @end: last byte in the object to be pinned (not inclusive)
+ *
+ * pageable_pin_object must be called before the memory range can be used in
+ * any way the pageable object accessor functions. pageable_pin_object may
+ * have to swap pages in from disk. A successful call must be followed (at
+ * some point) by a call to pageable_unpin_object with the same range.
+ *
+ * Note: the end address is not inclusive, so a (0, 1) range is the first byte.
+ */
+int pageable_pin_object(pgobj_t *object, unsigned long start, unsigned long end);
+
+/**
+ * pageable_unpin_object - Unpin an address range of a pageable object
+ * @object: object pointer
+ * @start: first byte in the object to be unpinned
+ * @end: last byte in the object to be unpinned (not inclusive)
+ *
+ * Note: the end address is not inclusive, so a (0, 1) range is the first byte.
+ */
+void pageable_unpin_object(pgobj_t *object, unsigned long start, unsigned long end);
+
+/**
+ * pageable_dirty_object - Dirty an address range of a pageable object
+ * @object: object pointer
+ * @start: first byte in the object to be dirtied
+ * @end: last byte in the object to be dirtied (not inclusive)
+ *
+ * If a part of the memory of a pageable object is written to,
+ * pageable_dirty_object must be called on this range before it is unpinned.
+ *
+ * Note: the end address is not inclusive, so a (0, 1) range is the first byte.
+ */
+void pageable_dirty_object(pgobj_t *object, unsigned long start, unsigned long end);
+
+/**
+ * pageable_get_page - Get one page of a pageable object
+ * @object: object pointer
+ * @off: byte in the object containing the desired page
+ * @Returns: page pointer requested
+ *
+ * Note: this does not increment the page refcount in any way, however the
+ * page refcount would already be pinned by a call to pageable_pin_object.
+ */
+struct page *pageable_get_page(pgobj_t *object, unsigned long off);
+
+/**
+ * pageable_dirty_page - Dirty one page of a pageable object
+ * @object: object pointer
+ * @page: page pointer returned by pageable_get_page
+ *
+ * Like pageable_dirty_object. If the page returned by pageable_get_page
+ * is dirtied, pageable_dirty_page must be called before it is unpinned.
+ */
+void pageable_dirty_page(pgobj_t *object, struct page *page);
+
+/**
+ * pageable_vmap_object - Map an address range of a pageable object
+ * @object: object pointer
+ * @start: first byte in the object to be mapped
+ * @end: last byte in the object to be mapped (not inclusive)
+ * @Returns: kernel virtual address, NULL on memory allocation failure
+ *
+ * This maps a specified range of a pageable object into kernel virtual
+ * memory, where it can be treated and operated on as regular memory. It
+ * must be followed by a call to pageable_vunmap_object.
+ *
+ * Note: the end address is not inclusive, so a (0, 1) range is the first byte.
+ */
+void *pageable_vmap_object(pgobj_t *object, unsigned long start, unsigned long end);
+
+/**
+ * pageable_vunmap_object - Unmap an address range of a pageable object
+ * @object: object pointer
+ * @ptr: pointer returned by pageable_vmap_object
+ * @start: first byte in the object to be mapped
+ * @end: last byte in the object to be mapped (not inclusive)
+ *
+ * This maps a specified range of a pageable object into kernel virtual
+ * memory, where it can be treated and operated on as regular memory. It
+ * must be followed by a call to pageable_vunmap_object.
+ *
+ * Note: the end address is not inclusive, so a (0, 1) range is the first byte.
+ */
+void pageable_vunmap_object(pgobj_t *object, void *ptr, unsigned long start, unsigned long end);
+
+#endif
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile
+++ linux-2.6/mm/Makefile
@@ -11,7 +11,7 @@ obj-y			:= bootmem.o filemap.o mempool.o
 			   maccess.o page_alloc.o page-writeback.o pdflush.o \
 			   readahead.o swap.o truncate.o vmscan.o \
 			   prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
-			   page_isolation.o mm_init.o $(mmu-y)
+			   page_isolation.o mm_init.o pageable_alloc.o $(mmu-y)
 
 obj-$(CONFIG_PROC_PAGE_MONITOR) += pagewalk.o
 obj-$(CONFIG_BOUNCE)	+= bounce.o
Index: linux-2.6/mm/pageable_alloc.c
===================================================================
--- /dev/null
+++ linux-2.6/mm/pageable_alloc.c
@@ -0,0 +1,260 @@
+/*
+ * Simple pageable memory allocator
+ *
+ * Copyright (C) 2008 Nick Piggin
+ * Copyright (C) 2008 Novell Inc.
+ */
+#include <linux/pageable_alloc.h>
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <linux/pagemap.h>
+#include <linux/file.h>
+#include <linux/vmalloc.h>
+#include <linux/slab.h>
+#include <linux/radix-tree.h>
+
+#ifdef CONFIG_MMU
+struct pgobj {
+	struct file f;
+};
+
+pgobj_t *pageable_alloc_object(unsigned long size, int nid)
+{
+	struct file *filp;
+
+	filp = shmem_file_setup("pageable object", size, 0);
+
+	return (struct pgobj *)filp;
+}
+
+void pageable_free_object(pgobj_t *object)
+{
+	struct file *filp = (struct file *)object;
+
+	fput(filp);
+}
+
+int pageable_pin_object(pgobj_t *object, unsigned long start, unsigned long end)
+{
+	struct file *filp = (struct file *)object;
+	struct address_space *mapping = filp->f_dentry->d_inode->i_mapping;
+	pgoff_t first, last, i;
+	int err = 0;
+
+	BUG_ON(start >= end);
+
+	first = start / PAGE_SIZE;
+	last = DIV_ROUND_UP(end, PAGE_SIZE);
+
+	for (i = first; i < last; i++) {
+		struct page *page;
+
+		page = read_mapping_page(mapping, i, filp);
+		if (IS_ERR(page)) {
+			err = PTR_ERR(page);
+			goto out_error;
+		}
+	}
+
+	BUG_ON(err);
+	return 0;
+
+out_error:
+	if (i > first)
+		pageable_unpin_object(object, start, start + i*PAGE_SIZE);
+	return err;
+}
+
+void pageable_unpin_object(pgobj_t *object, unsigned long start, unsigned long end)
+{
+	struct file *filp = (struct file *)object;
+	struct address_space *mapping = filp->f_dentry->d_inode->i_mapping;
+	pgoff_t first, last, i;
+
+	BUG_ON(start >= end);
+
+	first = start / PAGE_SIZE;
+	last = DIV_ROUND_UP(end, PAGE_SIZE);
+
+	for (i = first; i < last; i++) {
+		struct page *page;
+
+		rcu_read_lock();
+		page = radix_tree_lookup(&mapping->page_tree, i);
+		rcu_read_unlock();
+		BUG_ON(!page);
+		BUG_ON(page_count(page) < 2);
+		page_cache_release(page);
+	}
+}
+
+void pageable_dirty_object(pgobj_t *object, unsigned long start, unsigned long end)
+{
+	struct file *filp = (struct file *)object;
+	struct address_space *mapping = filp->f_dentry->d_inode->i_mapping;
+	pgoff_t first, last, i;
+
+	BUG_ON(start >= end);
+
+	first = start / PAGE_SIZE;
+	last = DIV_ROUND_UP(end, PAGE_SIZE);
+
+	for (i = first; i < last; i++) {
+		struct page *page;
+
+		rcu_read_lock();
+		page = radix_tree_lookup(&mapping->page_tree, i);
+		rcu_read_unlock();
+		BUG_ON(!page);
+		BUG_ON(page_count(page) < 2);
+		set_page_dirty(page);
+	}
+}
+
+struct page *pageable_get_page(pgobj_t *object, unsigned long off)
+{
+	struct file *filp = (struct file *)object;
+	struct address_space *mapping = filp->f_dentry->d_inode->i_mapping;
+	struct page *page;
+
+	rcu_read_lock();
+	page = radix_tree_lookup(&mapping->page_tree, off / PAGE_SIZE);
+	rcu_read_unlock();
+
+	BUG_ON(!page);
+	BUG_ON(page_count(page) < 2);
+
+	return page;
+}
+
+void pageable_dirty_page(pgobj_t *object, struct page *page)
+{
+	BUG_ON(page_count(page) < 2);
+	set_page_dirty(page);
+}
+
+void *pageable_vmap_object(pgobj_t *object, unsigned long start, unsigned long end)
+{
+	struct file *filp = (struct file *)object;
+	struct address_space *mapping = filp->f_dentry->d_inode->i_mapping;
+	unsigned int offset = start & ~PAGE_CACHE_MASK;
+	pgoff_t first, last, i;
+	struct page **pages;
+	int nr;
+	void *ret;
+
+	BUG_ON(start >= end);
+
+	first = start / PAGE_SIZE;
+	last = DIV_ROUND_UP(end, PAGE_SIZE);
+	nr = last - first;
+
+#ifndef CONFIG_HIGHMEM
+	if (nr == 1) {
+		struct page *page;
+
+		rcu_read_lock();
+		page = radix_tree_lookup(&mapping->page_tree, first);
+		rcu_read_unlock();
+		BUG_ON(!page);
+		BUG_ON(page_count(page) < 2);
+
+		ret = page_address(page);
+
+		goto out;
+	}
+#endif
+
+	pages = kmalloc(sizeof(struct page *) * nr, GFP_KERNEL);
+	if (!pages)
+		return NULL;
+
+	for (i = first; i < last; i++) {
+		struct page *page;
+
+		rcu_read_lock();
+		page = radix_tree_lookup(&mapping->page_tree, i);
+		rcu_read_unlock();
+		BUG_ON(!page);
+		BUG_ON(page_count(page) < 2);
+
+		pages[i] = page;
+	}
+
+	ret = vmap(pages, nr, VM_MAP, PAGE_KERNEL);
+	kfree(pages);
+	if (!ret)
+		return NULL;
+
+out:
+	return ret + offset;
+}
+
+void pageable_vunmap_object(pgobj_t *object, void *ptr, unsigned long start, unsigned long end)
+{
+#ifndef CONFIG_HIGHMEM
+	pgoff_t first, last;
+	int nr;
+
+	BUG_ON(start >= end);
+
+	first = start / PAGE_SIZE;
+	last = DIV_ROUND_UP(end, PAGE_SIZE);
+	nr = last - first;
+	if (nr == 1)
+		return;
+#endif
+
+	vunmap((void *)((unsigned long)ptr & PAGE_CACHE_MASK));
+}
+
+#else
+
+pgobj_t *pageable_alloc_object(unsigned long size, int nid)
+{
+	void *ret;
+
+	ret = kmalloc(size, GFP_KERNEL);
+	if (!ret)
+		return ERR_PTR(-ENOMEM);
+
+	return ret;
+}
+
+void pageable_free_object(pgobj_t *object)
+{
+	kfree(object);
+}
+
+int pageable_pin_object(pgobj_t *object, unsigned long start, unsigned long end){
+	return 0;
+}
+
+void pageable_unpin_object(pgobj_t *object, unsigned long start, unsigned long end)
+{
+}
+
+void pageable_dirty_object(pgobj_t *object, unsigned long start, unsigned long end)
+{
+}
+
+struct page *pageable_get_page(pgobj_t *object, unsigned long off)
+{
+	void *ptr = object;
+	return virt_to_page(ptr + off);
+}
+
+void pageable_dirty_page(pgobj_t *object, struct page *page)
+{
+}
+
+void *pageable_vmap_object(pgobj_t *object, unsigned long start, unsigned long end)
+{
+	void *ptr = object;
+	return ptr + start;
+}
+
+void pageable_vunmap_object(pgobj_t *object, void *ptr, unsigned long start, unsigned long end)
+{
+}
+#endif

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2008-10-03  6:40 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-09-23  9:10 [patch] mm: pageable memory allocator (for DRM-GEM?) Nick Piggin
2008-09-23 10:21 ` Thomas Hellström
2008-09-23 11:31   ` Jerome Glisse
2008-09-23 13:18     ` Christoph Lameter
2008-09-25  0:18   ` Nick Piggin
2008-09-25  7:19     ` Thomas Hellström
2008-09-25 14:38       ` Keith Packard
2008-09-25 15:39         ` Thomas Hellström
2008-09-25 22:41           ` Dave Airlie
2008-09-23 15:50 ` Keith Packard
2008-09-23 18:29   ` Jerome Glisse
2008-09-25  0:30   ` Nick Piggin
2008-09-25  1:20     ` Keith Packard
2008-09-25  2:30       ` Nick Piggin
2008-09-25  2:43         ` Keith Packard
2008-09-25  3:07           ` Nick Piggin
2008-09-25  6:16             ` Keith Packard
2008-09-25  8:45 ` KAMEZAWA Hiroyuki
2008-09-30  1:10 ` Eric Anholt
2008-10-02 17:15   ` Jesse Barnes
2008-10-03  5:17     ` Keith Packard
2008-10-03  6:40       ` Nick Piggin
  -- strict thread matches above, loose matches on Subject: below --
2008-09-23  9:10 Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox