[RFC 00/16] Variable Order Page Cache Patchset V2

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC 00/16] Variable Order Page Cache Patchset V2
@ 2007-04-23  6:48 Christoph Lameter
  2007-04-23  6:48 ` [RFC 01/16] Free up page->private for compound pages Christoph Lameter
                   ` (17 more replies)
  0 siblings, 18 replies; 40+ messages in thread
From: Christoph Lameter @ 2007-04-23  6:48 UTC (permalink / raw)
  To: linux-mm
  Cc: William Lee Irwin III, Badari Pulavarty, David Chinner,
	Jens Axboe, Adam Litke, Christoph Lameter, Dave Hansen,
	Mel Gorman, Avi Kivity

Sorry for the earlier mail. quilt and exim not cooperating.

RFC V1->V2
- Some ext2 support
- Some block layer, fs layer support etc.
- Better page cache macros
- Use macros to clean up code.

This patchset modifies the Linux kernel so that higher order page cache
pages become possible. The higher order page cache pages are compound pages
and can be handled in the same way as regular pages.

Rationales:

1. We have problems supporting devices with a higher blocksize than
   page size. This is for example important to support CD and DVDs that
   can only read and write 32k or 64k blocks. We currently have a shim
   layer in there to deal with this situation which limits the speed
   of I/O. The developers are currently looking for ways to completely
   bypass the page cache because of this deficiency.

2. 32/64k blocksize is also used in flash devices. Same issues.

3. Future harddisks will support bigger block sizes

4. Performace. If we look at IA64 vs. x86_64 then it seems that the
   faster interrupt handling on x86_64 compensate for the speed loss due to
   a smaller page size (4k vs 16k on IA64). Having higher page sizes on all
   platform allows a significant reduction in I/O overhead and increases the
   size of I/O that can be performed by hardware in a single request
   since the number of scatter gather entries are typically limited for
   one request. This is going to become increasingly important to support
   the ever growing memory sizes since we may have to handle excessively
   large amounts of 4k requests for data sizes that may become common
   soon. For example to write a 1 terabyte file the kernel would have to
   handle 256 million 4k chunks.

5. Cross arch compatibility: It is currently not possible to mount
   an 16k blocksize ext2 filesystem created on IA64 on an x86_64 system.

The support here is currently only for buffered I/O and only for two
filesystems ramfs and ext2.

Note that the higher order pages are subject to reclaim. This works in general
since we are always operating on a single page struct. Reclaim is fooled to
think that it is touching page sized objects (there are likely issues to be
fixed there if we want to go down this road).

What is currently not supported:
- Mmapping higher order pages
- Direct I/O (there are some fundamental issues with direct I/O
  putting compound pages that have to be treated as single pages
  on the pagevecs and the variable order page cache putting higher
  order compound pages that hjave to be treated as a single large page
  onto pagevecs.

Breakage:
- Reclaim does not work for some reasons. Compound pages on the active
  list get lost somehow.
- Disk data is corrupted when writing ext2fs data. There is likely
  still a lot of work to do in the block layer.
- There is a lot of incomplete work. There are numerous places
  where the kernel can no longer assume that the page cache consists
  of PAGE_SIZE pages that have not been fixed yet.

Future:
- Expect several more RFCs
- We hope for XFS support soon
- There are filesystem layer and lower layer issues here that I am not
  that familiar with. If you can then please enhance my patches.
- Mmap support could be done in a way that makes the mmap page size
  independent from the page cache order. There is no problem of mapping a
  4k section of a larger page cache page. This should leave mmap as is.
- Lets try to keep scope as small as possible.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC 01/16] Free up page->private for compound pages
  2007-04-23  6:48 [RFC 00/16] Variable Order Page Cache Patchset V2 Christoph Lameter
@ 2007-04-23  6:48 ` Christoph Lameter
  2007-04-24  2:12   ` Dave Hansen
  2007-04-25 10:55   ` Mel Gorman
  2007-04-23  6:48 ` [RFC 02/16] vmstat.c: Support accounting " Christoph Lameter
                   ` (16 subsequent siblings)
  17 siblings, 2 replies; 40+ messages in thread
From: Christoph Lameter @ 2007-04-23  6:48 UTC (permalink / raw)
  To: linux-mm
  Cc: William Lee Irwin III, Jens Axboe, David Chinner,
	Badari Pulavarty, Adam Litke, Christoph Lameter, Avi Kivity,
	Mel Gorman, Dave Hansen

[PATCH] Free up page->private for compound pages

If we add a new flag so that we can distinguish between the
first page and the tail pages then we can avoid to use page->private
in the first page. page->private == page for the first page, so there
is no real information in there.

Freeing up page->private makes the use of compound pages more transparent.
They become more usable like real pages. Right now we have to be careful f.e.
if we are going beyond PAGE_SIZE allocations in the slab on i386 because we
can then no longer use the private field. This is one of the issues that
cause us not to support debugging for page size slabs in SLAB.

Also if page->private is available then a compound page may be equipped
with buffer heads. This may free up the way for filesystems to support
larger blocks than page size.

Note that this patch is different from the one in mm. The one in mm
uses PG_reclaim as a PG_tail. We cannot use PG_tail since pages can
be reclaimed now. So use a separate page flag.

We allow compound page headers on pagevec. That will break
Direct I/O because direct I/O needs pagevecs to handle the components
but not the whole. Ideas for a solution welcome. Maybe we should
modify the Direct I/O layer to not operate on the individual pages
but on the compound page as a whole.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 arch/ia64/mm/init.c        |    2 +-
 include/linux/mm.h         |   32 ++++++++++++++++++++++++++------
 include/linux/page-flags.h |    6 ++++++
 mm/internal.h              |    2 +-
 mm/page_alloc.c            |   35 +++++++++++++++++++++++++----------
 mm/slab.c                  |    6 ++----
 mm/swap.c                  |   20 ++++++++++++++++++--
 7 files changed, 79 insertions(+), 24 deletions(-)

Index: linux-2.6.21-rc7/include/linux/mm.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/mm.h	2007-04-21 20:52:07.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/mm.h	2007-04-21 20:58:32.000000000 -0700
@@ -263,21 +263,24 @@ static inline int put_page_testzero(stru
  */
 static inline int get_page_unless_zero(struct page *page)
 {
-	VM_BUG_ON(PageCompound(page));
 	return atomic_inc_not_zero(&page->_count);
 }
 
+static inline struct page *compound_head(struct page *page)
+{
+	if (unlikely(PageTail(page)))
+		return (struct page *)page->private;
+	return page;
+}
+
 static inline int page_count(struct page *page)
 {
-	if (unlikely(PageCompound(page)))
-		page = (struct page *)page_private(page);
-	return atomic_read(&page->_count);
+	return atomic_read(&compound_head(page)->_count);
 }
 
 static inline void get_page(struct page *page)
 {
-	if (unlikely(PageCompound(page)))
-		page = (struct page *)page_private(page);
+	page = compound_head(page);
 	VM_BUG_ON(atomic_read(&page->_count) == 0);
 	atomic_inc(&page->_count);
 }
@@ -314,6 +317,23 @@ static inline compound_page_dtor *get_co
 	return (compound_page_dtor *)page[1].lru.next;
 }
 
+static inline int compound_order(struct page *page)
+{
+	if (!PageCompound(page) || PageTail(page))
+		return 0;
+	return (unsigned long)page[1].lru.prev;
+}
+
+static inline void set_compound_order(struct page *page, unsigned long order)
+{
+	page[1].lru.prev = (void *)order;
+}
+
+static inline int base_pages(struct page *page)
+{
+ 	return 1 << compound_order(page);
+}
+
 /*
  * Multiple processes may "see" the same page. E.g. for untouched
  * mappings of /dev/null, all processes see the same page full of
Index: linux-2.6.21-rc7/include/linux/page-flags.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/page-flags.h	2007-04-21 20:52:07.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/page-flags.h	2007-04-21 20:52:15.000000000 -0700
@@ -91,6 +91,8 @@
 #define PG_nosave_free		18	/* Used for system suspend/resume */
 #define PG_buddy		19	/* Page is free, on buddy lists */
 
+#define PG_tail			20	/* Page is tail of a compound page */
+
 /* PG_owner_priv_1 users should have descriptive aliases */
 #define PG_checked		PG_owner_priv_1 /* Used by some filesystems */
 
@@ -241,6 +243,10 @@ static inline void SetPageUptodate(struc
 #define __SetPageCompound(page)	__set_bit(PG_compound, &(page)->flags)
 #define __ClearPageCompound(page) __clear_bit(PG_compound, &(page)->flags)
 
+#define PageTail(page)	test_bit(PG_tail, &(page)->flags)
+#define __SetPageTail(page)	__set_bit(PG_tail, &(page)->flags)
+#define __ClearPageTail(page)	__clear_bit(PG_tail, &(page)->flags)
+
 #ifdef CONFIG_SWAP
 #define PageSwapCache(page)	test_bit(PG_swapcache, &(page)->flags)
 #define SetPageSwapCache(page)	set_bit(PG_swapcache, &(page)->flags)
Index: linux-2.6.21-rc7/mm/internal.h
===================================================================
--- linux-2.6.21-rc7.orig/mm/internal.h	2007-04-21 20:52:07.000000000 -0700
+++ linux-2.6.21-rc7/mm/internal.h	2007-04-21 20:52:15.000000000 -0700
@@ -24,7 +24,7 @@ static inline void set_page_count(struct
  */
 static inline void set_page_refcounted(struct page *page)
 {
-	VM_BUG_ON(PageCompound(page) && page_private(page) != (unsigned long)page);
+	VM_BUG_ON(PageTail(page));
 	VM_BUG_ON(atomic_read(&page->_count));
 	set_page_count(page, 1);
 }
Index: linux-2.6.21-rc7/mm/page_alloc.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/page_alloc.c	2007-04-21 20:52:07.000000000 -0700
+++ linux-2.6.21-rc7/mm/page_alloc.c	2007-04-21 20:58:32.000000000 -0700
@@ -227,7 +227,7 @@ static void bad_page(struct page *page)
 
 static void free_compound_page(struct page *page)
 {
-	__free_pages_ok(page, (unsigned long)page[1].lru.prev);
+	__free_pages_ok(page, compound_order(page));
 }
 
 static void prep_compound_page(struct page *page, unsigned long order)
@@ -236,12 +236,14 @@ static void prep_compound_page(struct pa
 	int nr_pages = 1 << order;
 
 	set_compound_page_dtor(page, free_compound_page);
-	page[1].lru.prev = (void *)order;
-	for (i = 0; i < nr_pages; i++) {
+	set_compound_order(page, order);
+	__SetPageCompound(page);
+	for (i = 1; i < nr_pages; i++) {
 		struct page *p = page + i;
 
+		__SetPageTail(p);
 		__SetPageCompound(p);
-		set_page_private(p, (unsigned long)page);
+		p->private = (unsigned long)page;
 	}
 }
 
@@ -250,15 +252,19 @@ static void destroy_compound_page(struct
 	int i;
 	int nr_pages = 1 << order;
 
-	if (unlikely((unsigned long)page[1].lru.prev != order))
+	if (unlikely(compound_order(page) != order))
 		bad_page(page);
 
-	for (i = 0; i < nr_pages; i++) {
+	if (unlikely(!PageCompound(page)))
+			bad_page(page);
+	__ClearPageCompound(page);
+	for (i = 1; i < nr_pages; i++) {
 		struct page *p = page + i;
 
-		if (unlikely(!PageCompound(p) |
-				(page_private(p) != (unsigned long)page)))
+		if (unlikely(!PageCompound(p) | !PageTail(p) |
+				((struct page *)p->private != page)))
 			bad_page(page);
+		__ClearPageTail(p);
 		__ClearPageCompound(p);
 	}
 }
@@ -1438,8 +1444,17 @@ void __pagevec_free(struct pagevec *pvec
 {
 	int i = pagevec_count(pvec);
 
-	while (--i >= 0)
-		free_hot_cold_page(pvec->pages[i], pvec->cold);
+	while (--i >= 0) {
+		struct page *page = pvec->pages[i];
+
+		if (PageCompound(page)) {
+			compound_page_dtor *dtor;
+
+			dtor = get_compound_page_dtor(page);
+			(*dtor)(page);
+		} else
+			free_hot_cold_page(page, pvec->cold);
+	}
 }
 
 fastcall void __free_pages(struct page *page, unsigned int order)
Index: linux-2.6.21-rc7/mm/slab.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/slab.c	2007-04-21 20:52:07.000000000 -0700
+++ linux-2.6.21-rc7/mm/slab.c	2007-04-21 20:52:15.000000000 -0700
@@ -592,8 +592,7 @@ static inline void page_set_cache(struct
 
 static inline struct kmem_cache *page_get_cache(struct page *page)
 {
-	if (unlikely(PageCompound(page)))
-		page = (struct page *)page_private(page);
+	page = compound_head(page);
 	BUG_ON(!PageSlab(page));
 	return (struct kmem_cache *)page->lru.next;
 }
@@ -605,8 +604,7 @@ static inline void page_set_slab(struct 
 
 static inline struct slab *page_get_slab(struct page *page)
 {
-	if (unlikely(PageCompound(page)))
-		page = (struct page *)page_private(page);
+	page = compound_head(page);
 	BUG_ON(!PageSlab(page));
 	return (struct slab *)page->lru.prev;
 }
Index: linux-2.6.21-rc7/mm/swap.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/swap.c	2007-04-21 20:52:07.000000000 -0700
+++ linux-2.6.21-rc7/mm/swap.c	2007-04-21 21:02:59.000000000 -0700
@@ -55,7 +55,7 @@ static void fastcall __page_cache_releas
 
 static void put_compound_page(struct page *page)
 {
-	page = (struct page *)page_private(page);
+	page = compound_head(page);
 	if (put_page_testzero(page)) {
 		compound_page_dtor *dtor;
 
@@ -263,7 +263,23 @@ void release_pages(struct page **pages, 
 	for (i = 0; i < nr; i++) {
 		struct page *page = pages[i];
 
-		if (unlikely(PageCompound(page))) {
+		/*
+		 * There is a conflict here between handling a compound
+		 * page as a single big page or a set of smaller pages.
+		 *
+		 * Direct I/O wants us to treat them separately. Variable
+		 * Page Size support means we need to treat then as
+		 * a single unit.
+		 *
+		 * So we compromise here. Tail pages are handled as a
+		 * single page (for direct I/O) but head pages are
+		 * handled as full pages (for Variable Page Size
+		 * Support).
+		 *
+		 * FIXME: That breaks direct I/O for the head page.
+		 */
+		if (unlikely(PageTail(page))) {
+			/* Must treat as a single page */
 			if (zone) {
 				spin_unlock_irq(&zone->lru_lock);
 				zone = NULL;
Index: linux-2.6.21-rc7/arch/ia64/mm/init.c
===================================================================
--- linux-2.6.21-rc7.orig/arch/ia64/mm/init.c	2007-04-21 20:52:07.000000000 -0700
+++ linux-2.6.21-rc7/arch/ia64/mm/init.c	2007-04-21 20:52:15.000000000 -0700
@@ -121,7 +121,7 @@ lazy_mmu_prot_update (pte_t pte)
 		return;				/* i-cache is already coherent with d-cache */
 
 	if (PageCompound(page)) {
-		order = (unsigned long) (page[1].lru.prev);
+		order = compound_order(page);
 		flush_icache_range(addr, addr + (1UL << order << PAGE_SHIFT));
 	}
 	else

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC 02/16] vmstat.c: Support accounting for compound pages
  2007-04-23  6:48 [RFC 00/16] Variable Order Page Cache Patchset V2 Christoph Lameter
  2007-04-23  6:48 ` [RFC 01/16] Free up page->private for compound pages Christoph Lameter
@ 2007-04-23  6:48 ` Christoph Lameter
  2007-04-25 10:59   ` Mel Gorman
  2007-04-23  6:49 ` [RFC 03/16] Variable Order Page Cache: Add order field in mapping Christoph Lameter
                   ` (15 subsequent siblings)
  17 siblings, 1 reply; 40+ messages in thread
From: Christoph Lameter @ 2007-04-23  6:48 UTC (permalink / raw)
  To: linux-mm
  Cc: William Lee Irwin III, Badari Pulavarty, David Chinner,
	Jens Axboe, Adam Litke, Christoph Lameter, Dave Hansen,
	Mel Gorman, Avi Kivity

Compound pages must increment the counters in terms of base pages.
If we detect a compound page then add the number of base pages that
a compound page has to the counter.

This will avoid numerous changes in the VM to fix up page accounting
as we add more support for  compound pages.

Also fix up the accounting for active / inactive pages.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/mm_inline.h |   12 ++++++------
 mm/vmstat.c               |    8 +++-----
 2 files changed, 9 insertions(+), 11 deletions(-)

Index: linux-2.6.21-rc7/mm/vmstat.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/vmstat.c	2007-04-21 23:35:49.000000000 -0700
+++ linux-2.6.21-rc7/mm/vmstat.c	2007-04-21 23:35:59.000000000 -0700
@@ -223,7 +223,7 @@ void __inc_zone_state(struct zone *zone,
 
 void __inc_zone_page_state(struct page *page, enum zone_stat_item item)
 {
-	__inc_zone_state(page_zone(page), item);
+	__mod_zone_page_state(page_zone(page), item, base_pages(page));
 }
 EXPORT_SYMBOL(__inc_zone_page_state);
 
@@ -244,7 +244,7 @@ void __dec_zone_state(struct zone *zone,
 
 void __dec_zone_page_state(struct page *page, enum zone_stat_item item)
 {
-	__dec_zone_state(page_zone(page), item);
+	__mod_zone_page_state(page_zone(page), item, -base_pages(page));
 }
 EXPORT_SYMBOL(__dec_zone_page_state);
 
@@ -260,11 +260,9 @@ void inc_zone_state(struct zone *zone, e
 void inc_zone_page_state(struct page *page, enum zone_stat_item item)
 {
 	unsigned long flags;
-	struct zone *zone;
 
-	zone = page_zone(page);
 	local_irq_save(flags);
-	__inc_zone_state(zone, item);
+	__inc_zone_page_state(page, item);
 	local_irq_restore(flags);
 }
 EXPORT_SYMBOL(inc_zone_page_state);
Index: linux-2.6.21-rc7/include/linux/mm_inline.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/mm_inline.h	2007-04-22 00:20:15.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/mm_inline.h	2007-04-22 00:21:12.000000000 -0700
@@ -2,28 +2,28 @@ static inline void
 add_page_to_active_list(struct zone *zone, struct page *page)
 {
 	list_add(&page->lru, &zone->active_list);
-	__inc_zone_state(zone, NR_ACTIVE);
+	__inc_zone_page_state(page, NR_ACTIVE);
 }
 
 static inline void
 add_page_to_inactive_list(struct zone *zone, struct page *page)
 {
 	list_add(&page->lru, &zone->inactive_list);
-	__inc_zone_state(zone, NR_INACTIVE);
+	__inc_zone_page_state(page, NR_INACTIVE);
 }
 
 static inline void
 del_page_from_active_list(struct zone *zone, struct page *page)
 {
 	list_del(&page->lru);
-	__dec_zone_state(zone, NR_ACTIVE);
+	__dec_zone_page_state(page, NR_ACTIVE);
 }
 
 static inline void
 del_page_from_inactive_list(struct zone *zone, struct page *page)
 {
 	list_del(&page->lru);
-	__dec_zone_state(zone, NR_INACTIVE);
+	__dec_zone_page_state(page, NR_INACTIVE);
 }
 
 static inline void
@@ -32,9 +32,9 @@ del_page_from_lru(struct zone *zone, str
 	list_del(&page->lru);
 	if (PageActive(page)) {
 		__ClearPageActive(page);
-		__dec_zone_state(zone, NR_ACTIVE);
+		__dec_zone_page_state(page, NR_ACTIVE);
 	} else {
-		__dec_zone_state(zone, NR_INACTIVE);
+		__dec_zone_page_state(page, NR_INACTIVE);
 	}
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC 03/16] Variable Order Page Cache: Add order field in mapping
  2007-04-23  6:48 [RFC 00/16] Variable Order Page Cache Patchset V2 Christoph Lameter
  2007-04-23  6:48 ` [RFC 01/16] Free up page->private for compound pages Christoph Lameter
  2007-04-23  6:48 ` [RFC 02/16] vmstat.c: Support accounting " Christoph Lameter
@ 2007-04-23  6:49 ` Christoph Lameter
  2007-04-25 11:05   ` Mel Gorman
  2007-04-23  6:49 ` [RFC 04/16] Variable Order Page Cache: Add basic allocation functions Christoph Lameter
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 40+ messages in thread
From: Christoph Lameter @ 2007-04-23  6:49 UTC (permalink / raw)
  To: linux-mm
  Cc: William Lee Irwin III, Jens Axboe, David Chinner,
	Badari Pulavarty, Adam Litke, Christoph Lameter, Avi Kivity,
	Mel Gorman, Dave Hansen

Variable Order Page Cache: Add order field in mapping

Add an "order" field in the address space structure that
specifies the page order of pages in an address space.

Set the field to zero by default so that filesystems not prepared to
deal with higher pages can be left as is.

Putting page order in the address space structure means that the order of the
pages in the page cache can be varied per file that a filesystem creates.
This means we can keep small 4k pages for small files. Larger files can
be configured by the file system to use a higher order.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/inode.c         |    1 +
 include/linux/fs.h |    1 +
 2 files changed, 2 insertions(+)

Index: linux-2.6.21-rc7/fs/inode.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/inode.c	2007-04-18 21:21:56.000000000 -0700
+++ linux-2.6.21-rc7/fs/inode.c	2007-04-18 21:26:31.000000000 -0700
@@ -145,6 +145,7 @@ static struct inode *alloc_inode(struct 
 		mapping->a_ops = &empty_aops;
  		mapping->host = inode;
 		mapping->flags = 0;
+		mapping->order = 0;
 		mapping_set_gfp_mask(mapping, GFP_HIGHUSER);
 		mapping->assoc_mapping = NULL;
 		mapping->backing_dev_info = &default_backing_dev_info;
Index: linux-2.6.21-rc7/include/linux/fs.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/fs.h	2007-04-18 21:21:56.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/fs.h	2007-04-18 21:26:31.000000000 -0700
@@ -435,6 +435,7 @@ struct address_space {
 	struct inode		*host;		/* owner: inode, block_device */
 	struct radix_tree_root	page_tree;	/* radix tree of all pages */
 	rwlock_t		tree_lock;	/* and rwlock protecting it */
+	unsigned int		order;		/* Page order in this space */
 	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
 	struct prio_tree_root	i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC 04/16] Variable Order Page Cache: Add basic allocation functions
  2007-04-23  6:48 [RFC 00/16] Variable Order Page Cache Patchset V2 Christoph Lameter
                   ` (2 preceding siblings ...)
  2007-04-23  6:49 ` [RFC 03/16] Variable Order Page Cache: Add order field in mapping Christoph Lameter
@ 2007-04-23  6:49 ` Christoph Lameter
  2007-04-23  6:49 ` [RFC 05/16] Variable Order Page Cache: Add functions to establish sizes Christoph Lameter
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 40+ messages in thread
From: Christoph Lameter @ 2007-04-23  6:49 UTC (permalink / raw)
  To: linux-mm
  Cc: William Lee Irwin III, Badari Pulavarty, David Chinner,
	Jens Axboe, Adam Litke, Christoph Lameter, Dave Hansen,
	Mel Gorman, Avi Kivity

Variable Order Page Cache: Add basic allocation functions

Extend __page_cache_alloc to take an order parameter and modify caller
sites. Modify mapping_set_gfp_mask to set __GFP_COMP if the mapping
requires higher order allocations.

put_page() is already capable of handling compound pages. So there are no
changes needed to release higher order page cache pages.

However, there is a call to "alloc_page" in mm/filemap.c that does not
perform an allocation conformand with the parameters of the mapping.
Fix that by introducing a new page_cache_alloc function that
is capable of taking a gfp_t flag.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/pagemap.h |   34 ++++++++++++++++++++++++++++------
 mm/filemap.c            |   12 +++++++-----
 2 files changed, 35 insertions(+), 11 deletions(-)

Index: linux-2.6.21-rc7/include/linux/pagemap.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/pagemap.h	2007-04-22 21:47:47.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/pagemap.h	2007-04-22 21:52:37.000000000 -0700
@@ -3,6 +3,9 @@
 
 /*
  * Copyright 1995 Linus Torvalds
+ *
+ * (C) 2007 sgi, Christoph Lameter <clameter@sgi.com>
+ * 	Add variable order page cache support.
  */
 #include <linux/mm.h>
 #include <linux/fs.h>
@@ -32,6 +35,18 @@ static inline void mapping_set_gfp_mask(
 {
 	m->flags = (m->flags & ~(__force unsigned long)__GFP_BITS_MASK) |
 				(__force unsigned long)mask;
+	if (m->order)
+		m->flags |= __GFP_COMP;
+}
+
+static inline void set_mapping_order(struct address_space *m, int order)
+{
+	m->order = order;
+
+	if (order)
+		m->flags |= __GFP_COMP;
+	else
+		m->flags &= ~__GFP_COMP;
 }
 
 /*
@@ -40,7 +55,7 @@ static inline void mapping_set_gfp_mask(
  * throughput (it can then be mapped into user
  * space in smaller chunks for same flexibility).
  *
- * Or rather, it _will_ be done in larger chunks.
+ * This is the base page size
  */
 #define PAGE_CACHE_SHIFT	PAGE_SHIFT
 #define PAGE_CACHE_SIZE		PAGE_SIZE
@@ -52,22 +67,29 @@ static inline void mapping_set_gfp_mask(
 void release_pages(struct page **pages, int nr, int cold);
 
 #ifdef CONFIG_NUMA
-extern struct page *__page_cache_alloc(gfp_t gfp);
+extern struct page *__page_cache_alloc(gfp_t gfp, int order);
 #else
-static inline struct page *__page_cache_alloc(gfp_t gfp)
+static inline struct page *__page_cache_alloc(gfp_t gfp, int order)
 {
-	return alloc_pages(gfp, 0);
+	return alloc_pages(gfp, order);
 }
 #endif
 
+static inline struct page *page_cache_alloc_mask(struct address_space *x,
+						gfp_t flags)
+{
+	return __page_cache_alloc(mapping_gfp_mask(x) | flags,
+		x->order);
+}
+
 static inline struct page *page_cache_alloc(struct address_space *x)
 {
-	return __page_cache_alloc(mapping_gfp_mask(x));
+	return page_cache_alloc_mask(x, 0);
 }
 
 static inline struct page *page_cache_alloc_cold(struct address_space *x)
 {
-	return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
+	return page_cache_alloc_mask(x, __GFP_COLD);
 }
 
 typedef int filler_t(void *, struct page *);
Index: linux-2.6.21-rc7/mm/filemap.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/filemap.c	2007-04-22 21:47:47.000000000 -0700
+++ linux-2.6.21-rc7/mm/filemap.c	2007-04-22 21:54:00.000000000 -0700
@@ -467,13 +467,13 @@ int add_to_page_cache_lru(struct page *p
 }
 
 #ifdef CONFIG_NUMA
-struct page *__page_cache_alloc(gfp_t gfp)
+struct page *__page_cache_alloc(gfp_t gfp, int order)
 {
 	if (cpuset_do_page_mem_spread()) {
 		int n = cpuset_mem_spread_node();
-		return alloc_pages_node(n, gfp, 0);
+		return alloc_pages_node(n, gfp, order);
 	}
-	return alloc_pages(gfp, 0);
+	return alloc_pages(gfp, order);
 }
 EXPORT_SYMBOL(__page_cache_alloc);
 #endif
@@ -670,7 +670,8 @@ repeat:
 	page = find_lock_page(mapping, index);
 	if (!page) {
 		if (!cached_page) {
-			cached_page = alloc_page(gfp_mask);
+			cached_page =
+				page_cache_alloc_mask(mapping, gfp_mask);
 			if (!cached_page)
 				return NULL;
 		}
@@ -803,7 +804,8 @@ grab_cache_page_nowait(struct address_sp
 		page_cache_release(page);
 		return NULL;
 	}
-	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS);
+	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS,
+		mapping->order);
 	if (page && add_to_page_cache_lru(page, mapping, index, GFP_KERNEL)) {
 		page_cache_release(page);
 		page = NULL;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC 05/16] Variable Order Page Cache: Add functions to establish sizes
  2007-04-23  6:48 [RFC 00/16] Variable Order Page Cache Patchset V2 Christoph Lameter
                   ` (3 preceding siblings ...)
  2007-04-23  6:49 ` [RFC 04/16] Variable Order Page Cache: Add basic allocation functions Christoph Lameter
@ 2007-04-23  6:49 ` Christoph Lameter
  2007-04-25 11:20   ` Mel Gorman
  2007-04-23  6:49 ` [RFC 06/16] Variable Page Cache: Add VM_BUG_ONs to check for correct page order Christoph Lameter
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 40+ messages in thread
From: Christoph Lameter @ 2007-04-23  6:49 UTC (permalink / raw)
  To: linux-mm
  Cc: William Lee Irwin III, Jens Axboe, David Chinner,
	Badari Pulavarty, Adam Litke, Christoph Lameter, Avi Kivity,
	Mel Gorman, Dave Hansen

Variable Order Page Cache: Add functions to establish sizes

We use the macros PAGE_CACHE_SIZE PAGE_CACHE_SHIFT PAGE_CACHE_MASK
and PAGE_CACHE_ALIGN in various places in the kernel. These are now
the base page size but we do not have a means to calculating these
values for higher order pages.

Provide these functions. An address_space pointer must be passed
to them. Also add a set of extended functions that will be used
to consolidate the hand crafted shifts and adds in use right
now for the page cache.

New function			Related base page constant
---------------------------------------------------
page_cache_shift(a)		PAGE_CACHE_SHIFT
page_cache_size(a)		PAGE_CACHE_SIZE
page_cache_mask(a)		PAGE_CACHE_MASK
page_cache_index(a, pos)	Calculate page number from position
page_cache_next(addr, pos)	Page number of next page
page_cache_offset(a, pos)	Calculate offset into a page
page_cache_pos(a, index, offset)
				Form position based on page number
				and an offset.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/pagemap.h |   42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

Index: linux-2.6.21-rc7/include/linux/pagemap.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/pagemap.h	2007-04-22 17:30:50.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/pagemap.h	2007-04-22 19:44:12.000000000 -0700
@@ -62,6 +62,48 @@ static inline void set_mapping_order(str
 #define PAGE_CACHE_MASK		PAGE_MASK
 #define PAGE_CACHE_ALIGN(addr)	(((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK)
 
+static inline int page_cache_shift(struct address_space *a)
+{
+	return a->order + PAGE_SHIFT;
+}
+
+static inline unsigned int page_cache_size(struct address_space *a)
+{
+	return PAGE_SIZE << a->order;
+}
+
+static inline loff_t page_cache_mask(struct address_space *a)
+{
+	return (loff_t)PAGE_MASK << a->order;
+}
+
+static inline unsigned int page_cache_offset(struct address_space *a,
+		loff_t pos)
+{
+	return pos & ~(PAGE_MASK << a->order);
+}
+
+static inline pgoff_t page_cache_index(struct address_space *a,
+		loff_t pos)
+{
+	return pos >> page_cache_shift(a);
+}
+
+/*
+ * Index of the page starting on or after the given position.
+ */
+static inline pgoff_t page_cache_next(struct address_space *a,
+		loff_t pos)
+{
+	return page_cache_index(a, pos + page_cache_size(a) - 1);
+}
+
+static inline loff_t page_cache_pos(struct address_space *a,
+		pgoff_t index, unsigned long offset)
+{
+	return ((loff_t)index << page_cache_shift(a)) + offset;
+}
+
 #define page_cache_get(page)		get_page(page)
 #define page_cache_release(page)	put_page(page)
 void release_pages(struct page **pages, int nr, int cold);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC 06/16] Variable Page Cache: Add VM_BUG_ONs to check for correct page order
  2007-04-23  6:48 [RFC 00/16] Variable Order Page Cache Patchset V2 Christoph Lameter
                   ` (4 preceding siblings ...)
  2007-04-23  6:49 ` [RFC 05/16] Variable Order Page Cache: Add functions to establish sizes Christoph Lameter
@ 2007-04-23  6:49 ` Christoph Lameter
  2007-04-25 11:22   ` Mel Gorman
  2007-04-23  6:49 ` [RFC 07/16] Variable Order Page Cache: Add clearing and flushing function Christoph Lameter
                   ` (11 subsequent siblings)
  17 siblings, 1 reply; 40+ messages in thread
From: Christoph Lameter @ 2007-04-23  6:49 UTC (permalink / raw)
  To: linux-mm
  Cc: William Lee Irwin III, Badari Pulavarty, David Chinner,
	Jens Axboe, Adam Litke, Christoph Lameter, Dave Hansen,
	Mel Gorman, Avi Kivity

Variable Page Cache: Add VM_BUG_ONs to check for correct page order

Before we start changing the page order we better get some debugging
in there that trips us up whenever a wrong order page shows up in a
mapping. This will be helpful for converting new filesystems to
utilize higher orders.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/filemap.c |   19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

Index: linux-2.6.21-rc7/mm/filemap.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/filemap.c	2007-04-22 21:54:00.000000000 -0700
+++ linux-2.6.21-rc7/mm/filemap.c	2007-04-22 21:59:15.000000000 -0700
@@ -127,6 +127,7 @@ void remove_from_page_cache(struct page 
 	struct address_space *mapping = page->mapping;
 
 	BUG_ON(!PageLocked(page));
+	VM_BUG_ON(mapping->order != compound_order(page));
 
 	write_lock_irq(&mapping->tree_lock);
 	__remove_from_page_cache(page);
@@ -268,6 +269,7 @@ int wait_on_page_writeback_range(struct 
 			if (page->index > end)
 				continue;
 
+			VM_BUG_ON(mapping->order != compound_order(page));
 			wait_on_page_writeback(page);
 			if (PageError(page))
 				ret = -EIO;
@@ -439,6 +441,7 @@ int add_to_page_cache(struct page *page,
 {
 	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 
+	VM_BUG_ON(mapping->order != compound_order(page));
 	if (error == 0) {
 		write_lock_irq(&mapping->tree_lock);
 		error = radix_tree_insert(&mapping->page_tree, offset, page);
@@ -598,8 +601,10 @@ struct page * find_get_page(struct addre
 
 	read_lock_irq(&mapping->tree_lock);
 	page = radix_tree_lookup(&mapping->page_tree, offset);
-	if (page)
+	if (page) {
+		VM_BUG_ON(mapping->order != compound_order(page));
 		page_cache_get(page);
+	}
 	read_unlock_irq(&mapping->tree_lock);
 	return page;
 }
@@ -624,6 +629,7 @@ struct page *find_lock_page(struct addre
 repeat:
 	page = radix_tree_lookup(&mapping->page_tree, offset);
 	if (page) {
+		VM_BUG_ON(mapping->order != compound_order(page));
 		page_cache_get(page);
 		if (TestSetPageLocked(page)) {
 			read_unlock_irq(&mapping->tree_lock);
@@ -683,6 +689,7 @@ repeat:
 		} else if (err == -EEXIST)
 			goto repeat;
 	}
+	VM_BUG_ON(mapping->order != compound_order(page));
 	if (cached_page)
 		page_cache_release(cached_page);
 	return page;
@@ -714,8 +721,10 @@ unsigned find_get_pages(struct address_s
 	read_lock_irq(&mapping->tree_lock);
 	ret = radix_tree_gang_lookup(&mapping->page_tree,
 				(void **)pages, start, nr_pages);
-	for (i = 0; i < ret; i++)
+	for (i = 0; i < ret; i++) {
+		VM_BUG_ON(mapping->order != compound_order(pages[i]));
 		page_cache_get(pages[i]);
+	}
 	read_unlock_irq(&mapping->tree_lock);
 	return ret;
 }
@@ -745,6 +754,7 @@ unsigned find_get_pages_contig(struct ad
 		if (pages[i]->mapping == NULL || pages[i]->index != index)
 			break;
 
+		VM_BUG_ON(mapping->order != compound_order(pages[i]));
 		page_cache_get(pages[i]);
 		index++;
 	}
@@ -772,8 +782,10 @@ unsigned find_get_pages_tag(struct addre
 	read_lock_irq(&mapping->tree_lock);
 	ret = radix_tree_gang_lookup_tag(&mapping->page_tree,
 				(void **)pages, *index, nr_pages, tag);
-	for (i = 0; i < ret; i++)
+	for (i = 0; i < ret; i++) {
+		VM_BUG_ON(mapping->order != compound_order(pages[i]));
 		page_cache_get(pages[i]);
+	}
 	if (ret)
 		*index = pages[ret - 1]->index + 1;
 	read_unlock_irq(&mapping->tree_lock);
@@ -2454,6 +2466,7 @@ int try_to_release_page(struct page *pag
 	struct address_space * const mapping = page->mapping;
 
 	BUG_ON(!PageLocked(page));
+	VM_BUG_ON(mapping->order != compound_order(page));
 	if (PageWriteback(page))
 		return 0;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC 07/16] Variable Order Page Cache: Add clearing and flushing function
  2007-04-23  6:48 [RFC 00/16] Variable Order Page Cache Patchset V2 Christoph Lameter
                   ` (5 preceding siblings ...)
  2007-04-23  6:49 ` [RFC 06/16] Variable Page Cache: Add VM_BUG_ONs to check for correct page order Christoph Lameter
@ 2007-04-23  6:49 ` Christoph Lameter
  2007-04-23  6:49 ` [RFC 08/16] Variable Order Page Cache: Fixup fallback functions Christoph Lameter
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 40+ messages in thread
From: Christoph Lameter @ 2007-04-23  6:49 UTC (permalink / raw)
  To: linux-mm
  Cc: William Lee Irwin III, Jens Axboe, David Chinner,
	Badari Pulavarty, Adam Litke, Christoph Lameter, Avi Kivity,
	Mel Gorman, Dave Hansen

Variable Order Page Cache: Add clearing and flushing function

Add a flushing and clearing function for higher order pages.
These are provisional and will likely have to be optimized.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/pagemap.h |   25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

Index: linux-2.6.21-rc7/include/linux/pagemap.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/pagemap.h	2007-04-22 17:37:24.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/pagemap.h	2007-04-22 17:37:39.000000000 -0700
@@ -250,6 +250,31 @@ static inline void wait_on_page_writebac
 
 extern void end_page_writeback(struct page *page);
 
+/* Support for clearing higher order pages */
+static inline void clear_mapping_page(struct page *page)
+{
+	int nr_pages = base_pages(page);
+	int i;
+
+	for (i = 0; i < nr_pages; i++)
+		clear_highpage(page + i);
+}
+
+/*
+ * Support for flushing higher order pages.
+ *
+ * A bit stupid: On many platforms flushing the first page
+ * will flush any TLB starting there
+ */
+static inline void flush_mapping_page(struct page *page)
+{
+	int nr_pages = base_pages(page);
+	int i;
+
+	for (i = 0; i < nr_pages; i++)
+		flush_dcache_page(page + i);
+}
+
 /*
  * Fault a userspace page into pagetables.  Return non-zero on a fault.
  *

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC 08/16] Variable Order Page Cache: Fixup fallback functions
  2007-04-23  6:48 [RFC 00/16] Variable Order Page Cache Patchset V2 Christoph Lameter
                   ` (6 preceding siblings ...)
  2007-04-23  6:49 ` [RFC 07/16] Variable Order Page Cache: Add clearing and flushing function Christoph Lameter
@ 2007-04-23  6:49 ` Christoph Lameter
  2007-04-23  6:49 ` [RFC 09/16] Variable Order Page Cache: Fix up mm/filemap.c Christoph Lameter
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 40+ messages in thread
From: Christoph Lameter @ 2007-04-23  6:49 UTC (permalink / raw)
  To: linux-mm
  Cc: William Lee Irwin III, Badari Pulavarty, David Chinner,
	Jens Axboe, Adam Litke, Christoph Lameter, Dave Hansen,
	Mel Gorman, Avi Kivity

Variable Order Page Cache: Fixup fallback functions

Fixup the fallback function in fs/libfs.c to be able to handle
higher order page cache pages.

FIXME: There is a use of kmap here that we leave unchanged
(none of my testing platforms use highmem). There needs to
be some way to clear higher order partial pages if a platform
supports HIGHMEM.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/libfs.c |   19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

Index: linux-2.6.21-rc7/fs/libfs.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/libfs.c	2007-04-22 17:28:04.000000000 -0700
+++ linux-2.6.21-rc7/fs/libfs.c	2007-04-22 17:38:58.000000000 -0700
@@ -320,8 +320,8 @@ int simple_rename(struct inode *old_dir,
 
 int simple_readpage(struct file *file, struct page *page)
 {
-	clear_highpage(page);
-	flush_dcache_page(page);
+	clear_mapping_page(page);
+	flush_mapping_page(page);
 	SetPageUptodate(page);
 	unlock_page(page);
 	return 0;
@@ -331,11 +331,15 @@ int simple_prepare_write(struct file *fi
 			unsigned from, unsigned to)
 {
 	if (!PageUptodate(page)) {
-		if (to - from != PAGE_CACHE_SIZE) {
+		if (to - from != page_cache_size(file->f_mapping)) {
+			/*
+			 * Mapping to higher order pages need to be supported
+			 * if higher order pages can be in highmem
+			 */
 			void *kaddr = kmap_atomic(page, KM_USER0);
 			memset(kaddr, 0, from);
-			memset(kaddr + to, 0, PAGE_CACHE_SIZE - to);
-			flush_dcache_page(page);
+			memset(kaddr + to, 0, page_cache_size(file->f_mapping) - to);
+			flush_mapping_page(page);
 			kunmap_atomic(kaddr, KM_USER0);
 		}
 	}
@@ -345,8 +349,9 @@ int simple_prepare_write(struct file *fi
 int simple_commit_write(struct file *file, struct page *page,
 			unsigned from, unsigned to)
 {
-	struct inode *inode = page->mapping->host;
-	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+	struct address_space *mapping = page->mapping;
+	struct inode *inode = mapping->host;
+	loff_t pos = page_cache_pos(mapping, page->index, to);
 
 	if (!PageUptodate(page))
 		SetPageUptodate(page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC 09/16] Variable Order Page Cache: Fix up mm/filemap.c
  2007-04-23  6:48 [RFC 00/16] Variable Order Page Cache Patchset V2 Christoph Lameter
                   ` (7 preceding siblings ...)
  2007-04-23  6:49 ` [RFC 08/16] Variable Order Page Cache: Fixup fallback functions Christoph Lameter
@ 2007-04-23  6:49 ` Christoph Lameter
  2007-04-23  6:49 ` [RFC 10/16] Variable Order Page Cache: Readahead fixups Christoph Lameter
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 40+ messages in thread
From: Christoph Lameter @ 2007-04-23  6:49 UTC (permalink / raw)
  To: linux-mm
  Cc: William Lee Irwin III, Jens Axboe, David Chinner,
	Badari Pulavarty, Adam Litke, Christoph Lameter, Avi Kivity,
	Mel Gorman, Dave Hansen

Variable Order Page Cache: Fix up mm/filemap.c

Fix up the function in mm/filemap.c to use the variable page cache.
As many of the following patches this is also pretty straightforward.

1. Convert the bit ops into calls of page_cache_xxx(mapping, ....)
2. Use the mapping flush function

Doing this also cleans up the handling of page cache pages.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/filemap.c |   62 +++++++++++++++++++++++++++++------------------------------
 1 file changed, 31 insertions(+), 31 deletions(-)

Index: linux-2.6.21-rc7/mm/filemap.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/filemap.c	2007-04-22 21:59:15.000000000 -0700
+++ linux-2.6.21-rc7/mm/filemap.c	2007-04-22 22:03:09.000000000 -0700
@@ -304,8 +304,8 @@ int wait_on_page_writeback_range(struct 
 int sync_page_range(struct inode *inode, struct address_space *mapping,
 			loff_t pos, loff_t count)
 {
-	pgoff_t start = pos >> PAGE_CACHE_SHIFT;
-	pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT;
+	pgoff_t start = page_cache_index(mapping, pos);
+	pgoff_t end = page_cache_index(mapping, pos + count - 1);
 	int ret;
 
 	if (!mapping_cap_writeback_dirty(mapping) || !count)
@@ -336,8 +336,8 @@ EXPORT_SYMBOL(sync_page_range);
 int sync_page_range_nolock(struct inode *inode, struct address_space *mapping,
 			   loff_t pos, loff_t count)
 {
-	pgoff_t start = pos >> PAGE_CACHE_SHIFT;
-	pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT;
+	pgoff_t start = page_cache_index(mapping, pos);
+	pgoff_t end = page_cache_index(mapping, pos + count - 1);
 	int ret;
 
 	if (!mapping_cap_writeback_dirty(mapping) || !count)
@@ -366,7 +366,7 @@ int filemap_fdatawait(struct address_spa
 		return 0;
 
 	return wait_on_page_writeback_range(mapping, 0,
-				(i_size - 1) >> PAGE_CACHE_SHIFT);
+				page_cache_index(mapping, i_size - 1));
 }
 EXPORT_SYMBOL(filemap_fdatawait);
 
@@ -414,8 +414,8 @@ int filemap_write_and_wait_range(struct 
 		/* See comment of filemap_write_and_wait() */
 		if (err != -EIO) {
 			int err2 = wait_on_page_writeback_range(mapping,
-						lstart >> PAGE_CACHE_SHIFT,
-						lend >> PAGE_CACHE_SHIFT);
+					page_cache_index(mapping, lstart),
+					page_cache_index(mapping, lend));
 			if (!err)
 				err = err2;
 		}
@@ -888,27 +888,27 @@ void do_generic_mapping_read(struct addr
 	struct file_ra_state ra = *_ra;
 
 	cached_page = NULL;
-	index = *ppos >> PAGE_CACHE_SHIFT;
+	index = page_cache_index(mapping, *ppos);
 	next_index = index;
 	prev_index = ra.prev_page;
-	last_index = (*ppos + desc->count + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT;
-	offset = *ppos & ~PAGE_CACHE_MASK;
+	last_index = page_cache_next(mapping, *ppos + desc->count);
+	offset = page_cache_offset(mapping, *ppos);
 
 	isize = i_size_read(inode);
 	if (!isize)
 		goto out;
 
-	end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
+	end_index = page_cache_index(mapping, isize - 1);
 	for (;;) {
 		struct page *page;
 		unsigned long nr, ret;
 
 		/* nr is the maximum number of bytes to copy from this page */
-		nr = PAGE_CACHE_SIZE;
+		nr = page_cache_size(mapping);
 		if (index >= end_index) {
 			if (index > end_index)
 				goto out;
-			nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
+			nr = page_cache_offset(mapping, isize - 1) + 1;
 			if (nr <= offset) {
 				goto out;
 			}
@@ -935,7 +935,7 @@ page_ok:
 		 * before reading the page on the kernel side.
 		 */
 		if (mapping_writably_mapped(mapping))
-			flush_dcache_page(page);
+			flush_mapping_page(page);
 
 		/*
 		 * When (part of) the same page is read multiple times
@@ -957,8 +957,8 @@ page_ok:
 		 */
 		ret = actor(desc, page, offset, nr);
 		offset += ret;
-		index += offset >> PAGE_CACHE_SHIFT;
-		offset &= ~PAGE_CACHE_MASK;
+		index += page_cache_index(mapping, offset);
+		offset = page_cache_offset(mapping, offset);
 
 		page_cache_release(page);
 		if (ret == nr && desc->count)
@@ -1022,16 +1022,16 @@ readpage:
 		 * another truncate extends the file - this is desired though).
 		 */
 		isize = i_size_read(inode);
-		end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
+		end_index = page_cache_index(mapping, isize - 1);
 		if (unlikely(!isize || index > end_index)) {
 			page_cache_release(page);
 			goto out;
 		}
 
 		/* nr is the maximum number of bytes to copy from this page */
-		nr = PAGE_CACHE_SIZE;
+		nr = page_cache_size(mapping);
 		if (index == end_index) {
-			nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
+			nr = page_cache_offset(mapping, isize - 1) + 1;
 			if (nr <= offset) {
 				page_cache_release(page);
 				goto out;
@@ -1074,7 +1074,7 @@ no_cached_page:
 out:
 	*_ra = ra;
 
-	*ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
+	*ppos = page_cache_pos(mapping, index, offset);
 	if (cached_page)
 		page_cache_release(cached_page);
 	if (filp)
@@ -1270,8 +1270,8 @@ asmlinkage ssize_t sys_readahead(int fd,
 	if (file) {
 		if (file->f_mode & FMODE_READ) {
 			struct address_space *mapping = file->f_mapping;
-			unsigned long start = offset >> PAGE_CACHE_SHIFT;
-			unsigned long end = (offset + count - 1) >> PAGE_CACHE_SHIFT;
+			unsigned long start = page_cache_index(mapping, offset);
+			unsigned long end = page_cache_index(mapping, offset + count - 1);
 			unsigned long len = end - start + 1;
 			ret = do_readahead(mapping, file, start, len);
 		}
@@ -2086,9 +2086,9 @@ generic_file_buffered_write(struct kiocb
 		unsigned long offset;
 		size_t copied;
 
-		offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
-		index = pos >> PAGE_CACHE_SHIFT;
-		bytes = PAGE_CACHE_SIZE - offset;
+		offset = page_cache_offset(mapping, pos);
+		index = page_cache_index(mapping, pos);
+		bytes = page_cache_size(mapping) - offset;
 
 		/* Limit the size of the copy to the caller's write size */
 		bytes = min(bytes, count);
@@ -2149,7 +2149,7 @@ generic_file_buffered_write(struct kiocb
 		else
 			copied = filemap_copy_from_user_iovec(page, offset,
 						cur_iov, iov_base, bytes);
-		flush_dcache_page(page);
+		flush_mapping_page(page);
 		status = a_ops->commit_write(file, page, offset, offset+bytes);
 		if (status == AOP_TRUNCATED_PAGE) {
 			page_cache_release(page);
@@ -2315,8 +2315,8 @@ __generic_file_aio_write_nolock(struct k
 		if (err == 0) {
 			written = written_buffered;
 			invalidate_mapping_pages(mapping,
-						 pos >> PAGE_CACHE_SHIFT,
-						 endbyte >> PAGE_CACHE_SHIFT);
+						 page_cache_index(mapping, pos),
+						 page_cache_index(mapping, endbyte));
 		} else {
 			/*
 			 * We don't know how much we wrote, so just return
@@ -2403,7 +2403,7 @@ generic_file_direct_IO(int rw, struct ki
 	 */
 	if (rw == WRITE) {
 		write_len = iov_length(iov, nr_segs);
-		end = (offset + write_len - 1) >> PAGE_CACHE_SHIFT;
+		end = page_cache_index(mapping, offset + write_len - 1);
 	       	if (mapping_mapped(mapping))
 			unmap_mapping_range(mapping, offset, write_len, 0);
 	}
@@ -2420,7 +2420,7 @@ generic_file_direct_IO(int rw, struct ki
 	 */
 	if (rw == WRITE && mapping->nrpages) {
 		retval = invalidate_inode_pages2_range(mapping,
-					offset >> PAGE_CACHE_SHIFT, end);
+					page_cache_index(mapping, offset), end);
 		if (retval)
 			goto out;
 	}
@@ -2438,7 +2438,7 @@ generic_file_direct_IO(int rw, struct ki
 	 */
 	if (rw == WRITE && mapping->nrpages) {
 		int err = invalidate_inode_pages2_range(mapping,
-					      offset >> PAGE_CACHE_SHIFT, end);
+					      page_cache_index(mapping, offset), end);
 		if (err && retval >= 0)
 			retval = err;
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC 10/16] Variable Order Page Cache: Readahead fixups
  2007-04-23  6:48 [RFC 00/16] Variable Order Page Cache Patchset V2 Christoph Lameter
                   ` (8 preceding siblings ...)
  2007-04-23  6:49 ` [RFC 09/16] Variable Order Page Cache: Fix up mm/filemap.c Christoph Lameter
@ 2007-04-23  6:49 ` Christoph Lameter
  2007-04-25 11:36   ` Mel Gorman
  2007-04-23  6:49 ` [RFC 11/16] Variable Page Cache Size: Fix up reclaim counters Christoph Lameter
                   ` (7 subsequent siblings)
  17 siblings, 1 reply; 40+ messages in thread
From: Christoph Lameter @ 2007-04-23  6:49 UTC (permalink / raw)
  To: linux-mm
  Cc: William Lee Irwin III, Badari Pulavarty, David Chinner,
	Jens Axboe, Adam Litke, Christoph Lameter, Dave Hansen,
	Mel Gorman, Avi Kivity

Variable Order Page Cache: Readahead fixups

Readahead is now dependent on the page size. For larger page sizes
we want less readahead.

Add a parameter to max_sane_readahead specifying the page order
and update the code in mm/readahead.c to be aware of variant
page sizes.

Mark the 2M readahead constant as a potential future problem.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/mm.h |    2 +-
 mm/fadvise.c       |    5 +++--
 mm/filemap.c       |    5 +++--
 mm/madvise.c       |    4 +++-
 mm/readahead.c     |   20 +++++++++++++-------
 5 files changed, 23 insertions(+), 13 deletions(-)

Index: linux-2.6.21-rc7/include/linux/mm.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/mm.h	2007-04-22 21:48:22.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/mm.h	2007-04-22 22:04:44.000000000 -0700
@@ -1104,7 +1104,7 @@ unsigned long page_cache_readahead(struc
 			  unsigned long size);
 void handle_ra_miss(struct address_space *mapping, 
 		    struct file_ra_state *ra, pgoff_t offset);
-unsigned long max_sane_readahead(unsigned long nr);
+unsigned long max_sane_readahead(unsigned long nr, int order);
 
 /* Do stack extension */
 extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
Index: linux-2.6.21-rc7/mm/fadvise.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/fadvise.c	2007-04-22 21:47:41.000000000 -0700
+++ linux-2.6.21-rc7/mm/fadvise.c	2007-04-22 22:04:44.000000000 -0700
@@ -86,10 +86,11 @@ asmlinkage long sys_fadvise64_64(int fd,
 		nrpages = end_index - start_index + 1;
 		if (!nrpages)
 			nrpages = ~0UL;
-		
+
 		ret = force_page_cache_readahead(mapping, file,
 				start_index,
-				max_sane_readahead(nrpages));
+				max_sane_readahead(nrpages,
+					mapping->order));
 		if (ret > 0)
 			ret = 0;
 		break;
Index: linux-2.6.21-rc7/mm/filemap.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/filemap.c	2007-04-22 22:03:09.000000000 -0700
+++ linux-2.6.21-rc7/mm/filemap.c	2007-04-22 22:04:44.000000000 -0700
@@ -1256,7 +1256,7 @@ do_readahead(struct address_space *mappi
 		return -EINVAL;
 
 	force_page_cache_readahead(mapping, filp, index,
-					max_sane_readahead(nr));
+				max_sane_readahead(nr, mapping->order));
 	return 0;
 }
 
@@ -1391,7 +1391,8 @@ retry_find:
 			count_vm_event(PGMAJFAULT);
 		}
 		did_readaround = 1;
-		ra_pages = max_sane_readahead(file->f_ra.ra_pages);
+		ra_pages = max_sane_readahead(file->f_ra.ra_pages,
+							mapping->order);
 		if (ra_pages) {
 			pgoff_t start = 0;
 
Index: linux-2.6.21-rc7/mm/madvise.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/madvise.c	2007-04-22 21:47:41.000000000 -0700
+++ linux-2.6.21-rc7/mm/madvise.c	2007-04-22 22:04:44.000000000 -0700
@@ -105,7 +105,9 @@ static long madvise_willneed(struct vm_a
 	end = ((end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
 
 	force_page_cache_readahead(file->f_mapping,
-			file, start, max_sane_readahead(end - start));
+			file, start,
+			max_sane_readahead(end - start,
+				file->f_mapping->order));
 	return 0;
 }
 
Index: linux-2.6.21-rc7/mm/readahead.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/readahead.c	2007-04-22 21:47:41.000000000 -0700
+++ linux-2.6.21-rc7/mm/readahead.c	2007-04-22 22:06:47.000000000 -0700
@@ -152,7 +152,7 @@ int read_cache_pages(struct address_spac
 			put_pages_list(pages);
 			break;
 		}
-		task_io_account_read(PAGE_CACHE_SIZE);
+		task_io_account_read(page_cache_size(mapping));
 	}
 	pagevec_lru_add(&lru_pvec);
 	return ret;
@@ -276,7 +276,7 @@ __do_page_cache_readahead(struct address
 	if (isize == 0)
 		goto out;
 
- 	end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
+ 	end_index = page_cache_index(mapping, isize - 1);
 
 	/*
 	 * Preallocate as many pages as we will need.
@@ -330,7 +330,11 @@ int force_page_cache_readahead(struct ad
 	while (nr_to_read) {
 		int err;
 
-		unsigned long this_chunk = (2 * 1024 * 1024) / PAGE_CACHE_SIZE;
+		/*
+		 * FIXME: Note the 2M constant here that may prove to
+		 * be a problem if page sizes become bigger than one megabyte.
+		 */
+		unsigned long this_chunk = page_cache_index(mapping, 2 * 1024 * 1024);
 
 		if (this_chunk > nr_to_read)
 			this_chunk = nr_to_read;
@@ -570,11 +574,13 @@ void handle_ra_miss(struct address_space
 }
 
 /*
- * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
+ * Given a desired number of page order readahead pages, return a
  * sensible upper limit.
  */
-unsigned long max_sane_readahead(unsigned long nr)
+unsigned long max_sane_readahead(unsigned long nr, int order)
 {
-	return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE)
-		+ node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
+	unsigned long base_pages = node_page_state(numa_node_id(), NR_INACTIVE)
+			+ node_page_state(numa_node_id(), NR_FREE_PAGES);
+
+	return min(nr, (base_pages / 2) >> order);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC 11/16] Variable Page Cache Size: Fix up reclaim counters
  2007-04-23  6:48 [RFC 00/16] Variable Order Page Cache Patchset V2 Christoph Lameter
                   ` (9 preceding siblings ...)
  2007-04-23  6:49 ` [RFC 10/16] Variable Order Page Cache: Readahead fixups Christoph Lameter
@ 2007-04-23  6:49 ` Christoph Lameter
  2007-04-25 13:08   ` Mel Gorman
  2007-04-23  6:49 ` [RFC 12/16] Variable Order Page Cache: Fix up the writeback logic Christoph Lameter
                   ` (6 subsequent siblings)
  17 siblings, 1 reply; 40+ messages in thread
From: Christoph Lameter @ 2007-04-23  6:49 UTC (permalink / raw)
  To: linux-mm
  Cc: William Lee Irwin III, Jens Axboe, David Chinner,
	Badari Pulavarty, Adam Litke, Christoph Lameter, Avi Kivity,
	Mel Gorman, Dave Hansen

Variable Page Cache Size: Fix up reclaim counters

We can now reclaim larger pages. Adjust the VM counters
to deal with it.

Note that this does currently not make things work.
For some reason we keep loosing pages off the active lists
and reclaim stalls at some point attempting to remove
active pages from an empty active list.
It seems that the removal from the active lists happens
outside of reclaim ?!?

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/vmscan.c |   15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

Index: linux-2.6.21-rc7/mm/vmscan.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/vmscan.c	2007-04-22 06:50:03.000000000 -0700
+++ linux-2.6.21-rc7/mm/vmscan.c	2007-04-22 17:19:35.000000000 -0700
@@ -471,14 +471,14 @@ static unsigned long shrink_page_list(st
 
 		VM_BUG_ON(PageActive(page));
 
-		sc->nr_scanned++;
+		sc->nr_scanned += base_pages(page);
 
 		if (!sc->may_swap && page_mapped(page))
 			goto keep_locked;
 
 		/* Double the slab pressure for mapped and swapcache pages */
 		if (page_mapped(page) || PageSwapCache(page))
-			sc->nr_scanned++;
+			sc->nr_scanned += base_pages(page);
 
 		if (PageWriteback(page))
 			goto keep_locked;
@@ -581,7 +581,7 @@ static unsigned long shrink_page_list(st
 
 free_it:
 		unlock_page(page);
-		nr_reclaimed++;
+		nr_reclaimed += base_pages(page);
 		if (!pagevec_add(&freed_pvec, page))
 			__pagevec_release_nonlru(&freed_pvec);
 		continue;
@@ -627,7 +627,7 @@ static unsigned long isolate_lru_pages(u
 	struct page *page;
 	unsigned long scan;
 
-	for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
+	for (scan = 0; scan < nr_to_scan && !list_empty(src); ) {
 		struct list_head *target;
 		page = lru_to_page(src);
 		prefetchw_prev_lru_page(page, src, flags);
@@ -644,10 +644,11 @@ static unsigned long isolate_lru_pages(u
 			 */
 			ClearPageLRU(page);
 			target = dst;
-			nr_taken++;
+			nr_taken += base_pages(page);
 		} /* else it is being freed elsewhere */
 
 		list_add(&page->lru, target);
+		scan += base_pages(page);
 	}
 
 	*scanned = scan;
@@ -856,7 +857,7 @@ force_reclaim_mapped:
 		ClearPageActive(page);
 
 		list_move(&page->lru, &zone->inactive_list);
-		pgmoved++;
+		pgmoved += base_pages(page);
 		if (!pagevec_add(&pvec, page)) {
 			__mod_zone_page_state(zone, NR_INACTIVE, pgmoved);
 			spin_unlock_irq(&zone->lru_lock);
@@ -884,7 +885,7 @@ force_reclaim_mapped:
 		SetPageLRU(page);
 		VM_BUG_ON(!PageActive(page));
 		list_move(&page->lru, &zone->active_list);
-		pgmoved++;
+		pgmoved += base_pages(page);
 		if (!pagevec_add(&pvec, page)) {
 			__mod_zone_page_state(zone, NR_ACTIVE, pgmoved);
 			pgmoved = 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC 12/16] Variable Order Page Cache: Fix up the writeback logic
  2007-04-23  6:48 [RFC 00/16] Variable Order Page Cache Patchset V2 Christoph Lameter
                   ` (10 preceding siblings ...)
  2007-04-23  6:49 ` [RFC 11/16] Variable Page Cache Size: Fix up reclaim counters Christoph Lameter
@ 2007-04-23  6:49 ` Christoph Lameter
  2007-04-23  6:49 ` [RFC 13/16] Variable Order Page Cache: Fixed to block layer Christoph Lameter
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 40+ messages in thread
From: Christoph Lameter @ 2007-04-23  6:49 UTC (permalink / raw)
  To: linux-mm
  Cc: William Lee Irwin III, Badari Pulavarty, David Chinner,
	Jens Axboe, Adam Litke, Christoph Lameter, Dave Hansen,
	Mel Gorman, Avi Kivity

Variable Order Page Cache: Fix up the writeback logic

Nothing special here. Just the usual transformations.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/sync.c           |    8 ++++----
 mm/fadvise.c        |    8 ++++----
 mm/page-writeback.c |    4 ++--
 mm/truncate.c       |   23 ++++++++++++-----------
 4 files changed, 22 insertions(+), 21 deletions(-)

Index: linux-2.6.21-rc7/mm/page-writeback.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/page-writeback.c	2007-04-22 21:47:34.000000000 -0700
+++ linux-2.6.21-rc7/mm/page-writeback.c	2007-04-22 22:08:35.000000000 -0700
@@ -606,8 +606,8 @@ int generic_writepages(struct address_sp
 		index = mapping->writeback_index; /* Start from prev offset */
 		end = -1;
 	} else {
-		index = wbc->range_start >> PAGE_CACHE_SHIFT;
-		end = wbc->range_end >> PAGE_CACHE_SHIFT;
+		index = page_cache_index(mapping, wbc->range_start);
+		end = page_cache_index(mapping, wbc->range_end);
 		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
 			range_whole = 1;
 		scanned = 1;
Index: linux-2.6.21-rc7/fs/sync.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/sync.c	2007-04-22 21:47:34.000000000 -0700
+++ linux-2.6.21-rc7/fs/sync.c	2007-04-22 22:08:35.000000000 -0700
@@ -254,8 +254,8 @@ int do_sync_file_range(struct file *file
 	ret = 0;
 	if (flags & SYNC_FILE_RANGE_WAIT_BEFORE) {
 		ret = wait_on_page_writeback_range(mapping,
-					offset >> PAGE_CACHE_SHIFT,
-					endbyte >> PAGE_CACHE_SHIFT);
+					page_cache_index(mapping, offset),
+					page_cache_index(mapping, endbyte));
 		if (ret < 0)
 			goto out;
 	}
@@ -269,8 +269,8 @@ int do_sync_file_range(struct file *file
 
 	if (flags & SYNC_FILE_RANGE_WAIT_AFTER) {
 		ret = wait_on_page_writeback_range(mapping,
-					offset >> PAGE_CACHE_SHIFT,
-					endbyte >> PAGE_CACHE_SHIFT);
+					page_cache_index(mapping, offset),
+					page_cache_index(mapping, endbyte));
 	}
 out:
 	return ret;
Index: linux-2.6.21-rc7/mm/fadvise.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/fadvise.c	2007-04-22 22:04:44.000000000 -0700
+++ linux-2.6.21-rc7/mm/fadvise.c	2007-04-22 22:08:35.000000000 -0700
@@ -79,8 +79,8 @@ asmlinkage long sys_fadvise64_64(int fd,
 		}
 
 		/* First and last PARTIAL page! */
-		start_index = offset >> PAGE_CACHE_SHIFT;
-		end_index = endbyte >> PAGE_CACHE_SHIFT;
+		start_index = page_cache_index(mapping, offset);
+		end_index = page_cache_index(mapping, endbyte);
 
 		/* Careful about overflow on the "+1" */
 		nrpages = end_index - start_index + 1;
@@ -101,8 +101,8 @@ asmlinkage long sys_fadvise64_64(int fd,
 			filemap_flush(mapping);
 
 		/* First and last FULL page! */
-		start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
-		end_index = (endbyte >> PAGE_CACHE_SHIFT);
+		start_index = page_cache_next(mapping, offset);
+		end_index = page_cache_index(mapping, endbyte);
 
 		if (end_index >= start_index)
 			invalidate_mapping_pages(mapping, start_index,
Index: linux-2.6.21-rc7/mm/truncate.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/truncate.c	2007-04-22 21:47:34.000000000 -0700
+++ linux-2.6.21-rc7/mm/truncate.c	2007-04-22 22:11:19.000000000 -0700
@@ -46,7 +46,8 @@ void do_invalidatepage(struct page *page
 
 static inline void truncate_partial_page(struct page *page, unsigned partial)
 {
-	memclear_highpage_flush(page, partial, PAGE_CACHE_SIZE-partial);
+	memclear_highpage_flush(page, partial,
+		(PAGE_SIZE << compound_order(page)) - partial);
 	if (PagePrivate(page))
 		do_invalidatepage(page, partial);
 }
@@ -94,7 +95,7 @@ truncate_complete_page(struct address_sp
 	if (page->mapping != mapping)
 		return;
 
-	cancel_dirty_page(page, PAGE_CACHE_SIZE);
+	cancel_dirty_page(page, page_cache_size(mapping));
 
 	if (PagePrivate(page))
 		do_invalidatepage(page, 0);
@@ -156,9 +157,9 @@ invalidate_complete_page(struct address_
 void truncate_inode_pages_range(struct address_space *mapping,
 				loff_t lstart, loff_t lend)
 {
-	const pgoff_t start = (lstart + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT;
+	const pgoff_t start = page_cache_next(mapping, lstart);
 	pgoff_t end;
-	const unsigned partial = lstart & (PAGE_CACHE_SIZE - 1);
+	const unsigned partial = page_cache_offset(mapping, lstart);
 	struct pagevec pvec;
 	pgoff_t next;
 	int i;
@@ -166,8 +167,9 @@ void truncate_inode_pages_range(struct a
 	if (mapping->nrpages == 0)
 		return;
 
-	BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1));
-	end = (lend >> PAGE_CACHE_SHIFT);
+	BUG_ON(page_cache_offset(mapping, lend) !=
+				page_cache_size(mapping) - 1);
+	end = page_cache_index(mapping, lend);
 
 	pagevec_init(&pvec, 0);
 	next = start;
@@ -402,9 +404,8 @@ int invalidate_inode_pages2_range(struct
 					 * Zap the rest of the file in one hit.
 					 */
 					unmap_mapping_range(mapping,
-					   (loff_t)page_index<<PAGE_CACHE_SHIFT,
-					   (loff_t)(end - page_index + 1)
-							<< PAGE_CACHE_SHIFT,
+					   page_cache_pos(mapping, page_index, 0),
+					   page_cache_pos(mapping, end - page_index + 1, 0),
 					    0);
 					did_range_unmap = 1;
 				} else {
@@ -412,8 +413,8 @@ int invalidate_inode_pages2_range(struct
 					 * Just zap this page
 					 */
 					unmap_mapping_range(mapping,
-					  (loff_t)page_index<<PAGE_CACHE_SHIFT,
-					  PAGE_CACHE_SIZE, 0);
+					  page_cache_pos(mapping, page_index, 0),
+					  page_cache_size(mapping), 0);
 				}
 			}
 			ret = do_launder_page(mapping, page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC 13/16] Variable Order Page Cache: Fixed to block layer
  2007-04-23  6:48 [RFC 00/16] Variable Order Page Cache Patchset V2 Christoph Lameter
                   ` (11 preceding siblings ...)
  2007-04-23  6:49 ` [RFC 12/16] Variable Order Page Cache: Fix up the writeback logic Christoph Lameter
@ 2007-04-23  6:49 ` Christoph Lameter
  2007-04-23  6:49 ` [RFC 14/16] Variable Order Page Cache: Add support to ramfs Christoph Lameter
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 40+ messages in thread
From: Christoph Lameter @ 2007-04-23  6:49 UTC (permalink / raw)
  To: linux-mm
  Cc: William Lee Irwin III, Jens Axboe, David Chinner,
	Badari Pulavarty, Adam Litke, Christoph Lameter, Avi Kivity,
	Mel Gorman, Dave Hansen

Variable Order Page Cache: Fixed to block layer

Fix up (at least some pieces of) the block layer. It already has some
flexibility. Extend that for larger page sizes.

set_blocksize is changed to allow to specify a blocksize larger than a
page. If that occurs then we switch the device to use compound pages.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/block_dev.c              |   22 ++++++---
 fs/buffer.c                 |  101 +++++++++++++++++++++++---------------------
 fs/inode.c                  |    5 +-
 fs/mpage.c                  |   34 +++++++-------
 include/linux/buffer_head.h |    9 +++
 5 files changed, 100 insertions(+), 71 deletions(-)

Index: linux-2.6.21-rc7/include/linux/buffer_head.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/buffer_head.h	2007-04-22 21:47:33.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/buffer_head.h	2007-04-22 22:14:41.000000000 -0700
@@ -129,7 +129,14 @@ BUFFER_FNS(Ordered, ordered)
 BUFFER_FNS(Eopnotsupp, eopnotsupp)
 BUFFER_FNS(Unwritten, unwritten)
 
-#define bh_offset(bh)		((unsigned long)(bh)->b_data & ~PAGE_MASK)
+static inline unsigned long bh_offset(struct buffer_head *bh)
+{
+	/* Cannot use the mapping since it may be set to NULL. */
+	unsigned long mask = ~(PAGE_MASK << compound_order(bh->b_page));
+
+	return (unsigned long)bh->b_data & mask;
+}
+
 #define touch_buffer(bh)	mark_page_accessed(bh->b_page)
 
 /* If we *know* page->private refers to buffer_heads */
Index: linux-2.6.21-rc7/fs/block_dev.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/block_dev.c	2007-04-22 21:47:33.000000000 -0700
+++ linux-2.6.21-rc7/fs/block_dev.c	2007-04-22 22:11:44.000000000 -0700
@@ -60,12 +60,12 @@ static void kill_bdev(struct block_devic
 {
 	invalidate_bdev(bdev, 1);
 	truncate_inode_pages(bdev->bd_inode->i_mapping, 0);
-}	
+}
 
 int set_blocksize(struct block_device *bdev, int size)
 {
-	/* Size must be a power of two, and between 512 and PAGE_SIZE */
-	if (size > PAGE_SIZE || size < 512 || (size & (size-1)))
+	/* Size must be a power of two, and greater than 512 */
+	if (size < 512 || (size & (size-1)))
 		return -EINVAL;
 
 	/* Size cannot be smaller than the size supported by the device */
@@ -74,10 +74,16 @@ int set_blocksize(struct block_device *b
 
 	/* Don't change the size if it is same as current */
 	if (bdev->bd_block_size != size) {
+		int bits = blksize_bits(size);
+		struct address_space *mapping =
+			bdev->bd_inode->i_mapping;
+
 		sync_blockdev(bdev);
-		bdev->bd_block_size = size;
-		bdev->bd_inode->i_blkbits = blksize_bits(size);
 		kill_bdev(bdev);
+		bdev->bd_block_size = size;
+		bdev->bd_inode->i_blkbits = bits;
+		set_mapping_order(mapping,
+			bits < PAGE_SHIFT ? 0 : bits - PAGE_SHIFT);
 	}
 	return 0;
 }
@@ -88,8 +94,10 @@ int sb_set_blocksize(struct super_block 
 {
 	if (set_blocksize(sb->s_bdev, size))
 		return 0;
-	/* If we get here, we know size is power of two
-	 * and it's value is between 512 and PAGE_SIZE */
+	/*
+	 * If we get here, we know size is power of two
+	 * and it's value is larger than 512
+	 */
 	sb->s_blocksize = size;
 	sb->s_blocksize_bits = blksize_bits(size);
 	return sb->s_blocksize;
Index: linux-2.6.21-rc7/fs/buffer.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/buffer.c	2007-04-22 21:47:33.000000000 -0700
+++ linux-2.6.21-rc7/fs/buffer.c	2007-04-22 22:11:44.000000000 -0700
@@ -259,7 +259,7 @@ __find_get_block_slow(struct block_devic
 	struct page *page;
 	int all_mapped = 1;
 
-	index = block >> (PAGE_CACHE_SHIFT - bd_inode->i_blkbits);
+	index = block >> (page_cache_shift(bd_mapping) - bd_inode->i_blkbits);
 	page = find_get_page(bd_mapping, index);
 	if (!page)
 		goto out;
@@ -733,7 +733,7 @@ int __set_page_dirty_buffers(struct page
 	if (page->mapping) {	/* Race with truncate? */
 		if (mapping_cap_account_dirty(mapping)) {
 			__inc_zone_page_state(page, NR_FILE_DIRTY);
-			task_io_account_write(PAGE_CACHE_SIZE);
+			task_io_account_write(page_cache_size(mapping));
 		}
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
@@ -879,10 +879,13 @@ struct buffer_head *alloc_page_buffers(s
 {
 	struct buffer_head *bh, *head;
 	long offset;
+	unsigned page_size = page_cache_size(page->mapping);
+
+	BUG_ON(size > page_size);
 
 try_again:
 	head = NULL;
-	offset = PAGE_SIZE;
+	offset = page_size;
 	while ((offset -= size) >= 0) {
 		bh = alloc_buffer_head(GFP_NOFS);
 		if (!bh)
@@ -1080,7 +1083,7 @@ __getblk_slow(struct block_device *bdev,
 {
 	/* Size must be multiple of hard sectorsize */
 	if (unlikely(size & (bdev_hardsect_size(bdev)-1) ||
-			(size < 512 || size > PAGE_SIZE))) {
+			size < 512)) {
 		printk(KERN_ERR "getblk(): invalid block size %d requested\n",
 					size);
 		printk(KERN_ERR "hardsect size: %d\n",
@@ -1417,7 +1420,7 @@ void set_bh_page(struct buffer_head *bh,
 		struct page *page, unsigned long offset)
 {
 	bh->b_page = page;
-	BUG_ON(offset >= PAGE_SIZE);
+	VM_BUG_ON(offset >= page_cache_size(page->mapping));
 	if (PageHighMem(page))
 		/*
 		 * This catches illegal uses and preserves the offset:
@@ -1766,8 +1769,8 @@ static int __block_prepare_write(struct 
 	struct buffer_head *bh, *head, *wait[2], **wait_bh=wait;
 
 	BUG_ON(!PageLocked(page));
-	BUG_ON(from > PAGE_CACHE_SIZE);
-	BUG_ON(to > PAGE_CACHE_SIZE);
+	BUG_ON(from > page_cache_size(inode->i_mapping));
+	BUG_ON(to > page_cache_size(inode->i_mapping));
 	BUG_ON(from > to);
 
 	blocksize = 1 << inode->i_blkbits;
@@ -1776,7 +1779,7 @@ static int __block_prepare_write(struct 
 	head = page_buffers(page);
 
 	bbits = inode->i_blkbits;
-	block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits);
+	block = (sector_t)page->index << (page_cache_shift(inode->i_mapping) - bbits);
 
 	for(bh = head, block_start = 0; bh != head || !block_start;
 	    block++, block_start=block_end, bh = bh->b_this_page) {
@@ -1934,7 +1937,7 @@ int block_read_full_page(struct page *pa
 		create_empty_buffers(page, blocksize, 0);
 	head = page_buffers(page);
 
-	iblock = (sector_t)page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
+	iblock = (sector_t)page->index << (page_cache_shift(page->mapping) - inode->i_blkbits);
 	lblock = (i_size_read(inode)+blocksize-1) >> inode->i_blkbits;
 	bh = head;
 	nr = 0;
@@ -1957,7 +1960,7 @@ int block_read_full_page(struct page *pa
 			if (!buffer_mapped(bh)) {
 				void *kaddr = kmap_atomic(page, KM_USER0);
 				memset(kaddr + i * blocksize, 0, blocksize);
-				flush_dcache_page(page);
+				flush_mapping_page(page);
 				kunmap_atomic(kaddr, KM_USER0);
 				if (!err)
 					set_buffer_uptodate(bh);
@@ -2058,10 +2061,11 @@ out:
 
 int generic_cont_expand(struct inode *inode, loff_t size)
 {
+	struct address_space *mapping = inode->i_mapping;
 	pgoff_t index;
 	unsigned int offset;
 
-	offset = (size & (PAGE_CACHE_SIZE - 1)); /* Within page */
+	offset = page_cache_offset(mapping, size);
 
 	/* ugh.  in prepare/commit_write, if from==to==start of block, we
 	** skip the prepare.  make sure we never send an offset for the start
@@ -2071,7 +2075,7 @@ int generic_cont_expand(struct inode *in
 		/* caller must handle this extra byte. */
 		offset++;
 	}
-	index = size >> PAGE_CACHE_SHIFT;
+	index = page_cache_index(mapping, size);
 
 	return __generic_cont_expand(inode, size, index, offset);
 }
@@ -2079,8 +2083,8 @@ int generic_cont_expand(struct inode *in
 int generic_cont_expand_simple(struct inode *inode, loff_t size)
 {
 	loff_t pos = size - 1;
-	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
-	unsigned int offset = (pos & (PAGE_CACHE_SIZE - 1)) + 1;
+	pgoff_t index = page_cache_index(inode->i_mapping, pos);
+	unsigned int offset = page_cache_offset(inode->i_mapping, pos) + 1;
 
 	/* prepare/commit_write can handle even if from==to==start of block. */
 	return __generic_cont_expand(inode, size, index, offset);
@@ -2103,31 +2107,32 @@ int cont_prepare_write(struct page *page
 	unsigned blocksize = 1 << inode->i_blkbits;
 	void *kaddr;
 
-	while(page->index > (pgpos = *bytes>>PAGE_CACHE_SHIFT)) {
+	while(page->index > (pgpos = page_cache_index(mapping, *bytes))) {
 		status = -ENOMEM;
 		new_page = grab_cache_page(mapping, pgpos);
 		if (!new_page)
 			goto out;
 		/* we might sleep */
-		if (*bytes>>PAGE_CACHE_SHIFT != pgpos) {
+		if (page_cache_index(mapping, *bytes) != pgpos) {
 			unlock_page(new_page);
 			page_cache_release(new_page);
 			continue;
 		}
-		zerofrom = *bytes & ~PAGE_CACHE_MASK;
+		zerofrom = page_cache_offset(mapping, *bytes);
 		if (zerofrom & (blocksize-1)) {
 			*bytes |= (blocksize-1);
 			(*bytes)++;
 		}
 		status = __block_prepare_write(inode, new_page, zerofrom,
-						PAGE_CACHE_SIZE, get_block);
+						page_cache_size(mapping), get_block);
 		if (status)
 			goto out_unmap;
+		/* Need higher order kmap?? */
 		kaddr = kmap_atomic(new_page, KM_USER0);
-		memset(kaddr+zerofrom, 0, PAGE_CACHE_SIZE-zerofrom);
+		memset(kaddr+zerofrom, 0, page_cache_size(mapping)-zerofrom);
 		flush_dcache_page(new_page);
 		kunmap_atomic(kaddr, KM_USER0);
-		generic_commit_write(NULL, new_page, zerofrom, PAGE_CACHE_SIZE);
+		generic_commit_write(NULL, new_page, zerofrom, page_cache_size(mapping));
 		unlock_page(new_page);
 		page_cache_release(new_page);
 	}
@@ -2137,7 +2142,7 @@ int cont_prepare_write(struct page *page
 		zerofrom = offset;
 	} else {
 		/* page covers the boundary, find the boundary offset */
-		zerofrom = *bytes & ~PAGE_CACHE_MASK;
+		zerofrom = page_cache_offset(mapping, *bytes);
 
 		/* if we will expand the thing last block will be filled */
 		if (to > zerofrom && (zerofrom & (blocksize-1))) {
@@ -2192,8 +2197,9 @@ int block_commit_write(struct page *page
 int generic_commit_write(struct file *file, struct page *page,
 		unsigned from, unsigned to)
 {
-	struct inode *inode = page->mapping->host;
-	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+	struct address_space *mapping = page->mapping;
+	struct inode *inode = mapping->host;
+	loff_t pos = page_cache_pos(mapping, page->index, to);
 	__block_commit_write(inode,page,from,to);
 	/*
 	 * No need to use i_size_read() here, the i_size
@@ -2235,6 +2241,7 @@ static void end_buffer_read_nobh(struct 
 int nobh_prepare_write(struct page *page, unsigned from, unsigned to,
 			get_block_t *get_block)
 {
+	struct address_space *mapping = page->mapping;
 	struct inode *inode = page->mapping->host;
 	const unsigned blkbits = inode->i_blkbits;
 	const unsigned blocksize = 1 << blkbits;
@@ -2242,6 +2249,7 @@ int nobh_prepare_write(struct page *page
 	struct buffer_head *read_bh[MAX_BUF_PER_PAGE];
 	unsigned block_in_page;
 	unsigned block_start;
+	unsigned page_size = page_cache_size(mapping);
 	sector_t block_in_file;
 	char *kaddr;
 	int nr_reads = 0;
@@ -2252,7 +2260,7 @@ int nobh_prepare_write(struct page *page
 	if (PageMappedToDisk(page))
 		return 0;
 
-	block_in_file = (sector_t)page->index << (PAGE_CACHE_SHIFT - blkbits);
+	block_in_file = (sector_t)page->index << (page_cache_shift(mapping) - blkbits);
 	map_bh.b_page = page;
 
 	/*
@@ -2261,7 +2269,7 @@ int nobh_prepare_write(struct page *page
 	 * page is fully mapped-to-disk.
 	 */
 	for (block_start = 0, block_in_page = 0;
-		  block_start < PAGE_CACHE_SIZE;
+		  block_start < page_size;
 		  block_in_page++, block_start += blocksize) {
 		unsigned block_end = block_start + blocksize;
 		int create;
@@ -2288,7 +2296,7 @@ int nobh_prepare_write(struct page *page
 				memset(kaddr+block_start, 0, from-block_start);
 			if (block_end > to)
 				memset(kaddr + to, 0, block_end - to);
-			flush_dcache_page(page);
+			flush_mapping_page(page);
 			kunmap_atomic(kaddr, KM_USER0);
 			continue;
 		}
@@ -2356,8 +2364,8 @@ failed:
 	 * so we'll later zero out any blocks which _were_ allocated.
 	 */
 	kaddr = kmap_atomic(page, KM_USER0);
-	memset(kaddr, 0, PAGE_CACHE_SIZE);
-	flush_dcache_page(page);
+	memset(kaddr, 0, page_size);
+	flush_mapping_page(page);
 	kunmap_atomic(kaddr, KM_USER0);
 	SetPageUptodate(page);
 	set_page_dirty(page);
@@ -2372,8 +2380,9 @@ EXPORT_SYMBOL(nobh_prepare_write);
 int nobh_commit_write(struct file *file, struct page *page,
 		unsigned from, unsigned to)
 {
-	struct inode *inode = page->mapping->host;
-	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+	struct address_space *mapping = page->mapping;
+	struct inode *inode = mapping->host;
+	loff_t pos = page_cache_pos(mapping, page->index, to);
 
 	SetPageUptodate(page);
 	set_page_dirty(page);
@@ -2395,7 +2404,7 @@ int nobh_writepage(struct page *page, ge
 {
 	struct inode * const inode = page->mapping->host;
 	loff_t i_size = i_size_read(inode);
-	const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
+	const pgoff_t end_index = page_cache_offset(page->mapping, i_size);
 	unsigned offset;
 	void *kaddr;
 	int ret;
@@ -2405,7 +2414,7 @@ int nobh_writepage(struct page *page, ge
 		goto out;
 
 	/* Is the page fully outside i_size? (truncate in progress) */
-	offset = i_size & (PAGE_CACHE_SIZE-1);
+	offset = page_cache_offset(page->mapping, i_size);
 	if (page->index >= end_index+1 || !offset) {
 		/*
 		 * The page may have dirty, unmapped buffers.  For example,
@@ -2429,7 +2438,7 @@ int nobh_writepage(struct page *page, ge
 	 * writes to that region are not written out to the file."
 	 */
 	kaddr = kmap_atomic(page, KM_USER0);
-	memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
+	memset(kaddr + offset, 0, page_cache_size(page->mapping) - offset);
 	flush_dcache_page(page);
 	kunmap_atomic(kaddr, KM_USER0);
 out:
@@ -2447,8 +2456,8 @@ int nobh_truncate_page(struct address_sp
 {
 	struct inode *inode = mapping->host;
 	unsigned blocksize = 1 << inode->i_blkbits;
-	pgoff_t index = from >> PAGE_CACHE_SHIFT;
-	unsigned offset = from & (PAGE_CACHE_SIZE-1);
+	pgoff_t index = page_cache_index(mapping, from);
+	unsigned offset = page_cache_offset(mapping, from);
 	unsigned to;
 	struct page *page;
 	const struct address_space_operations *a_ops = mapping->a_ops;
@@ -2467,8 +2476,8 @@ int nobh_truncate_page(struct address_sp
 	ret = a_ops->prepare_write(NULL, page, offset, to);
 	if (ret == 0) {
 		kaddr = kmap_atomic(page, KM_USER0);
-		memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
-		flush_dcache_page(page);
+		memset(kaddr + offset, 0, page_cache_size(mapping) - offset);
+		flush_mapping_page(page);
 		kunmap_atomic(kaddr, KM_USER0);
 		/*
 		 * It would be more correct to call aops->commit_write()
@@ -2487,8 +2496,8 @@ EXPORT_SYMBOL(nobh_truncate_page);
 int block_truncate_page(struct address_space *mapping,
 			loff_t from, get_block_t *get_block)
 {
-	pgoff_t index = from >> PAGE_CACHE_SHIFT;
-	unsigned offset = from & (PAGE_CACHE_SIZE-1);
+	pgoff_t index = page_cache_index(mapping, from);
+	unsigned offset = page_cache_offset(mapping, from);
 	unsigned blocksize;
 	sector_t iblock;
 	unsigned length, pos;
@@ -2506,7 +2515,7 @@ int block_truncate_page(struct address_s
 		return 0;
 
 	length = blocksize - length;
-	iblock = (sector_t)index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
+	iblock = (sector_t)index << (page_cache_shift(mapping) - inode->i_blkbits);
 	
 	page = grab_cache_page(mapping, index);
 	err = -ENOMEM;
@@ -2551,7 +2560,7 @@ int block_truncate_page(struct address_s
 
 	kaddr = kmap_atomic(page, KM_USER0);
 	memset(kaddr + offset, 0, length);
-	flush_dcache_page(page);
+	flush_mapping_page(page);
 	kunmap_atomic(kaddr, KM_USER0);
 
 	mark_buffer_dirty(bh);
@@ -2572,7 +2581,7 @@ int block_write_full_page(struct page *p
 {
 	struct inode * const inode = page->mapping->host;
 	loff_t i_size = i_size_read(inode);
-	const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
+	const pgoff_t end_index = page_cache_index(page->mapping, i_size);
 	unsigned offset;
 	void *kaddr;
 
@@ -2581,7 +2590,7 @@ int block_write_full_page(struct page *p
 		return __block_write_full_page(inode, page, get_block, wbc);
 
 	/* Is the page fully outside i_size? (truncate in progress) */
-	offset = i_size & (PAGE_CACHE_SIZE-1);
+	offset = page_cache_offset(page->mapping, i_size);
 	if (page->index >= end_index+1 || !offset) {
 		/*
 		 * The page may have dirty, unmapped buffers.  For example,
@@ -2601,8 +2610,8 @@ int block_write_full_page(struct page *p
 	 * writes to that region are not written out to the file."
 	 */
 	kaddr = kmap_atomic(page, KM_USER0);
-	memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
-	flush_dcache_page(page);
+	memset(kaddr + offset, 0, page_cache_size(page->mapping) - offset);
+	flush_mapping_page(page);
 	kunmap_atomic(kaddr, KM_USER0);
 	return __block_write_full_page(inode, page, get_block, wbc);
 }
@@ -2857,7 +2866,7 @@ int try_to_free_buffers(struct page *pag
 	 * dirty bit from being lost.
 	 */
 	if (ret)
-		cancel_dirty_page(page, PAGE_CACHE_SIZE);
+		cancel_dirty_page(page, page_cache_size(mapping));
 	spin_unlock(&mapping->private_lock);
 out:
 	if (buffers_to_free) {
Index: linux-2.6.21-rc7/fs/inode.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/inode.c	2007-04-22 21:52:18.000000000 -0700
+++ linux-2.6.21-rc7/fs/inode.c	2007-04-22 22:11:44.000000000 -0700
@@ -145,7 +145,10 @@ static struct inode *alloc_inode(struct 
 		mapping->a_ops = &empty_aops;
  		mapping->host = inode;
 		mapping->flags = 0;
-		mapping->order = 0;
+		if (inode->i_blkbits > PAGE_SHIFT)
+			set_mapping_order(mapping, inode->i_blkbits - PAGE_SHIFT);
+		else
+			set_mapping_order(mapping, 0);
 		mapping_set_gfp_mask(mapping, GFP_HIGHUSER);
 		mapping->assoc_mapping = NULL;
 		mapping->backing_dev_info = &default_backing_dev_info;
Index: linux-2.6.21-rc7/fs/mpage.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/mpage.c	2007-04-22 21:47:33.000000000 -0700
+++ linux-2.6.21-rc7/fs/mpage.c	2007-04-22 22:11:44.000000000 -0700
@@ -133,7 +133,8 @@ mpage_alloc(struct block_device *bdev,
 static void 
 map_buffer_to_page(struct page *page, struct buffer_head *bh, int page_block) 
 {
-	struct inode *inode = page->mapping->host;
+	struct address_space *mapping = page->mapping;
+	struct inode *inode = mapping->host;
 	struct buffer_head *page_bh, *head;
 	int block = 0;
 
@@ -142,9 +143,9 @@ map_buffer_to_page(struct page *page, st
 		 * don't make any buffers if there is only one buffer on
 		 * the page and the page just needs to be set up to date
 		 */
-		if (inode->i_blkbits == PAGE_CACHE_SHIFT && 
+		if (inode->i_blkbits == page_cache_shift(mapping) &&
 		    buffer_uptodate(bh)) {
-			SetPageUptodate(page);    
+			SetPageUptodate(page);
 			return;
 		}
 		create_empty_buffers(page, 1 << inode->i_blkbits, 0);
@@ -177,9 +178,10 @@ do_mpage_readpage(struct bio *bio, struc
 		sector_t *last_block_in_bio, struct buffer_head *map_bh,
 		unsigned long *first_logical_block, get_block_t get_block)
 {
-	struct inode *inode = page->mapping->host;
+	struct address_space *mapping = page->mapping;
+	struct inode *inode = mapping->host;
 	const unsigned blkbits = inode->i_blkbits;
-	const unsigned blocks_per_page = PAGE_CACHE_SIZE >> blkbits;
+	const unsigned blocks_per_page = page_cache_size(mapping) >> blkbits;
 	const unsigned blocksize = 1 << blkbits;
 	sector_t block_in_file;
 	sector_t last_block;
@@ -196,7 +198,7 @@ do_mpage_readpage(struct bio *bio, struc
 	if (page_has_buffers(page))
 		goto confused;
 
-	block_in_file = (sector_t)page->index << (PAGE_CACHE_SHIFT - blkbits);
+	block_in_file = (sector_t)page->index << (page_cache_shift(mapping) - blkbits);
 	last_block = block_in_file + nr_pages * blocks_per_page;
 	last_block_in_file = (i_size_read(inode) + blocksize - 1) >> blkbits;
 	if (last_block > last_block_in_file)
@@ -286,8 +288,8 @@ do_mpage_readpage(struct bio *bio, struc
 	if (first_hole != blocks_per_page) {
 		char *kaddr = kmap_atomic(page, KM_USER0);
 		memset(kaddr + (first_hole << blkbits), 0,
-				PAGE_CACHE_SIZE - (first_hole << blkbits));
-		flush_dcache_page(page);
+				page_cache_size(mapping) - (first_hole << blkbits));
+		flush_mapping_page(page);
 		kunmap_atomic(kaddr, KM_USER0);
 		if (first_hole == 0) {
 			SetPageUptodate(page);
@@ -465,7 +467,7 @@ __mpage_writepage(struct bio *bio, struc
 	struct inode *inode = page->mapping->host;
 	const unsigned blkbits = inode->i_blkbits;
 	unsigned long end_index;
-	const unsigned blocks_per_page = PAGE_CACHE_SIZE >> blkbits;
+	const unsigned blocks_per_page = page_cache_size(mapping) >> blkbits;
 	sector_t last_block;
 	sector_t block_in_file;
 	sector_t blocks[MAX_BUF_PER_PAGE];
@@ -533,7 +535,7 @@ __mpage_writepage(struct bio *bio, struc
 	 * The page has no buffers: map it to disk
 	 */
 	BUG_ON(!PageUptodate(page));
-	block_in_file = (sector_t)page->index << (PAGE_CACHE_SHIFT - blkbits);
+	block_in_file = (sector_t)page->index << (page_cache_shift(mapping) - blkbits);
 	last_block = (i_size - 1) >> blkbits;
 	map_bh.b_page = page;
 	for (page_block = 0; page_block < blocks_per_page; ) {
@@ -565,7 +567,7 @@ __mpage_writepage(struct bio *bio, struc
 	first_unmapped = page_block;
 
 page_is_mapped:
-	end_index = i_size >> PAGE_CACHE_SHIFT;
+	end_index = page_cache_index(mapping, i_size);
 	if (page->index >= end_index) {
 		/*
 		 * The page straddles i_size.  It must be zeroed out on each
@@ -575,14 +577,14 @@ page_is_mapped:
 		 * is zeroed when mapped, and writes to that region are not
 		 * written out to the file."
 		 */
-		unsigned offset = i_size & (PAGE_CACHE_SIZE - 1);
+		unsigned offset = page_cache_offset(mapping, i_size);
 		char *kaddr;
 
 		if (page->index > end_index || !offset)
 			goto confused;
 		kaddr = kmap_atomic(page, KM_USER0);
-		memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
-		flush_dcache_page(page);
+		memset(kaddr + offset, 0, page_cache_size(mapping) - offset);
+		flush_mapping_page(page);
 		kunmap_atomic(kaddr, KM_USER0);
 	}
 
@@ -727,8 +729,8 @@ mpage_writepages(struct address_space *m
 		index = mapping->writeback_index; /* Start from prev offset */
 		end = -1;
 	} else {
-		index = wbc->range_start >> PAGE_CACHE_SHIFT;
-		end = wbc->range_end >> PAGE_CACHE_SHIFT;
+		index = page_cache_index(mapping, wbc->range_start);
+		end = page_cache_index(mapping, wbc->range_end);
 		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
 			range_whole = 1;
 		scanned = 1;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC 14/16] Variable Order Page Cache: Add support to ramfs
  2007-04-23  6:48 [RFC 00/16] Variable Order Page Cache Patchset V2 Christoph Lameter
                   ` (12 preceding siblings ...)
  2007-04-23  6:49 ` [RFC 13/16] Variable Order Page Cache: Fixed to block layer Christoph Lameter
@ 2007-04-23  6:49 ` Christoph Lameter
  2007-04-23  6:50 ` [RFC 15/16] ext2: Add variable page size support Christoph Lameter
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 40+ messages in thread
From: Christoph Lameter @ 2007-04-23  6:49 UTC (permalink / raw)
  To: linux-mm
  Cc: William Lee Irwin III, Badari Pulavarty, David Chinner,
	Jens Axboe, Adam Litke, Christoph Lameter, Dave Hansen,
	Mel Gorman, Avi Kivity

Variable Order Page Cache: Add support to ramfs

The simplest file system to use is ramfs. Add a mount parameter that
specifies the page order of the pages that ramfs should use. If the
order is greater than zero then disable mmap functionality.

This could be removed if the VM would be changes to support faulting
higher order pages but for now we are content with buffered I/O on higher
order pages.

Note that ramfs does not use the lower layers (buffer I/O etc) so its
the safest to use right now.

If you apply this patch and then you can f.e. try this:

mount -tramfs -o10 none /media

Mounts a ramfs filesystem with order 10 pages (4 MB)

cp linux-2.6.21-rc7.tar.gz /media

Populate the ramfs. Note that we allocate 14 pages of 4M each
instead of 13508..

umount /media

Gets rid of the large pages again

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/ramfs/file-mmu.c   |   11 +++++++++++
 fs/ramfs/inode.c      |   15 ++++++++++++---
 include/linux/ramfs.h |    1 +
 3 files changed, 24 insertions(+), 3 deletions(-)

Index: linux-2.6.21-rc7/fs/ramfs/file-mmu.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/ramfs/file-mmu.c	2007-04-18 21:46:38.000000000 -0700
+++ linux-2.6.21-rc7/fs/ramfs/file-mmu.c	2007-04-18 22:02:03.000000000 -0700
@@ -45,6 +45,17 @@ const struct file_operations ramfs_file_
 	.llseek		= generic_file_llseek,
 };
 
+/* Higher order mappings do not support mmmap */
+const struct file_operations ramfs_file_higher_order_operations = {
+	.read		= do_sync_read,
+	.aio_read	= generic_file_aio_read,
+	.write		= do_sync_write,
+	.aio_write	= generic_file_aio_write,
+	.fsync		= simple_sync_file,
+	.sendfile	= generic_file_sendfile,
+	.llseek		= generic_file_llseek,
+};
+
 const struct inode_operations ramfs_file_inode_operations = {
 	.getattr	= simple_getattr,
 };
Index: linux-2.6.21-rc7/fs/ramfs/inode.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/ramfs/inode.c	2007-04-18 21:46:38.000000000 -0700
+++ linux-2.6.21-rc7/fs/ramfs/inode.c	2007-04-18 22:02:03.000000000 -0700
@@ -61,6 +61,7 @@ struct inode *ramfs_get_inode(struct sup
 		inode->i_blocks = 0;
 		inode->i_mapping->a_ops = &ramfs_aops;
 		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
+		inode->i_mapping->order = sb->s_blocksize_bits - PAGE_CACHE_SHIFT;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		switch (mode & S_IFMT) {
 		default:
@@ -68,7 +69,10 @@ struct inode *ramfs_get_inode(struct sup
 			break;
 		case S_IFREG:
 			inode->i_op = &ramfs_file_inode_operations;
-			inode->i_fop = &ramfs_file_operations;
+			if (inode->i_mapping->order)
+				inode->i_fop = &ramfs_file_higher_order_operations;
+			else
+				inode->i_fop = &ramfs_file_operations;
 			break;
 		case S_IFDIR:
 			inode->i_op = &ramfs_dir_inode_operations;
@@ -164,10 +168,15 @@ static int ramfs_fill_super(struct super
 {
 	struct inode * inode;
 	struct dentry * root;
+	int order = 0;
+	char *options = data;
+
+	if (options && *options)
+		order = simple_strtoul(options, NULL, 10);
 
 	sb->s_maxbytes = MAX_LFS_FILESIZE;
-	sb->s_blocksize = PAGE_CACHE_SIZE;
-	sb->s_blocksize_bits = PAGE_CACHE_SHIFT;
+	sb->s_blocksize = PAGE_CACHE_SIZE << order;
+	sb->s_blocksize_bits = order + PAGE_CACHE_SHIFT;
 	sb->s_magic = RAMFS_MAGIC;
 	sb->s_op = &ramfs_ops;
 	sb->s_time_gran = 1;
Index: linux-2.6.21-rc7/include/linux/ramfs.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/ramfs.h	2007-04-18 21:46:38.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/ramfs.h	2007-04-18 22:02:03.000000000 -0700
@@ -16,6 +16,7 @@ extern int ramfs_nommu_mmap(struct file 
 #endif
 
 extern const struct file_operations ramfs_file_operations;
+extern const struct file_operations ramfs_file_higher_order_operations;
 extern struct vm_operations_struct generic_file_vm_ops;
 extern int __init init_rootfs(void);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC 15/16] ext2: Add variable page size support
  2007-04-23  6:48 [RFC 00/16] Variable Order Page Cache Patchset V2 Christoph Lameter
                   ` (13 preceding siblings ...)
  2007-04-23  6:49 ` [RFC 14/16] Variable Order Page Cache: Add support to ramfs Christoph Lameter
@ 2007-04-23  6:50 ` Christoph Lameter
  2007-04-23 16:30   ` Badari Pulavarty
  2007-04-23  6:50 ` [RFC 16/16] Variable Order Page Cache: Alternate implementation of page cache macros Christoph Lameter
                   ` (2 subsequent siblings)
  17 siblings, 1 reply; 40+ messages in thread
From: Christoph Lameter @ 2007-04-23  6:50 UTC (permalink / raw)
  To: linux-mm
  Cc: William Lee Irwin III, Jens Axboe, David Chinner,
	Badari Pulavarty, Adam Litke, Christoph Lameter, Avi Kivity,
	Mel Gorman, Dave Hansen

This adds variable page size support. It is then possible to mount filesystems
that have a larger blocksize than the page size.

F.e. the following is possible on x86_64 and i386 that have only a 4k page
size.

mke2fs -b 16384 /dev/hdd2	<Ignore warning about too large block size>

mount /dev/hdd2 /media
ls -l /media

.... Do more things with the volume that uses a 16k page cache size on
a 4k page sized platform..

Note that there are issues with ext2 support:

1. Data is not writtten back correctly (block layer?)
2. Reclaim does not work right.

And we disable mmap for higher order pages like also done for ramfs. This
is temporary until we get support for mmapping higher order pages.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/ext2/dir.c   |   40 +++++++++++++++++++++++-----------------
 fs/ext2/ext2.h  |    1 +
 fs/ext2/file.c  |   18 ++++++++++++++++++
 fs/ext2/inode.c |   10 ++++++++--
 fs/ext2/namei.c |   10 ++++++++--
 5 files changed, 58 insertions(+), 21 deletions(-)

Index: linux-2.6.21-rc7/fs/ext2/dir.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/ext2/dir.c	2007-04-22 19:43:05.000000000 -0700
+++ linux-2.6.21-rc7/fs/ext2/dir.c	2007-04-22 20:09:57.000000000 -0700
@@ -44,7 +44,8 @@ static inline void ext2_put_page(struct 
 
 static inline unsigned long dir_pages(struct inode *inode)
 {
-	return (inode->i_size+PAGE_CACHE_SIZE-1)>>PAGE_CACHE_SHIFT;
+	return (inode->i_size+page_cache_size(inode->i_mapping)-1)>>
+			page_cache_shift(inode->i_mapping);
 }
 
 /*
@@ -55,10 +56,11 @@ static unsigned
 ext2_last_byte(struct inode *inode, unsigned long page_nr)
 {
 	unsigned last_byte = inode->i_size;
+	struct address_space *mapping = inode->i_mapping;
 
-	last_byte -= page_nr << PAGE_CACHE_SHIFT;
-	if (last_byte > PAGE_CACHE_SIZE)
-		last_byte = PAGE_CACHE_SIZE;
+	last_byte -= page_nr << page_cache_shift(mapping);
+	if (last_byte > page_cache_size(mapping))
+		last_byte = page_cache_size(mapping);
 	return last_byte;
 }
 
@@ -77,18 +79,19 @@ static int ext2_commit_chunk(struct page
 
 static void ext2_check_page(struct page *page)
 {
-	struct inode *dir = page->mapping->host;
+	struct address_space *mapping = page->mapping;
+	struct inode *dir = mapping->host;
 	struct super_block *sb = dir->i_sb;
 	unsigned chunk_size = ext2_chunk_size(dir);
 	char *kaddr = page_address(page);
 	u32 max_inumber = le32_to_cpu(EXT2_SB(sb)->s_es->s_inodes_count);
 	unsigned offs, rec_len;
-	unsigned limit = PAGE_CACHE_SIZE;
+	unsigned limit = page_cache_size(mapping);
 	ext2_dirent *p;
 	char *error;
 
-	if ((dir->i_size >> PAGE_CACHE_SHIFT) == page->index) {
-		limit = dir->i_size & ~PAGE_CACHE_MASK;
+	if (page_cache_index(mapping, dir->i_size) == page->index) {
+		limit = page_cache_offset(mapping, dir->i_size);
 		if (limit & (chunk_size - 1))
 			goto Ebadsize;
 		if (!limit)
@@ -140,7 +143,7 @@ Einumber:
 bad_entry:
 	ext2_error (sb, "ext2_check_page", "bad entry in directory #%lu: %s - "
 		"offset=%lu, inode=%lu, rec_len=%d, name_len=%d",
-		dir->i_ino, error, (page->index<<PAGE_CACHE_SHIFT)+offs,
+		dir->i_ino, error, page_cache_pos(mapping, page->index, offs),
 		(unsigned long) le32_to_cpu(p->inode),
 		rec_len, p->name_len);
 	goto fail;
@@ -149,7 +152,7 @@ Eend:
 	ext2_error (sb, "ext2_check_page",
 		"entry in directory #%lu spans the page boundary"
 		"offset=%lu, inode=%lu",
-		dir->i_ino, (page->index<<PAGE_CACHE_SHIFT)+offs,
+		dir->i_ino, page_cache_pos(mapping, page->index, offs),
 		(unsigned long) le32_to_cpu(p->inode));
 fail:
 	SetPageChecked(page);
@@ -250,8 +253,9 @@ ext2_readdir (struct file * filp, void *
 	loff_t pos = filp->f_pos;
 	struct inode *inode = filp->f_path.dentry->d_inode;
 	struct super_block *sb = inode->i_sb;
-	unsigned int offset = pos & ~PAGE_CACHE_MASK;
-	unsigned long n = pos >> PAGE_CACHE_SHIFT;
+	struct address_space *mapping = inode->i_mapping;
+	unsigned int offset = page_cache_offset(mapping, pos);
+	unsigned long n = page_cache_index(mapping, pos);
 	unsigned long npages = dir_pages(inode);
 	unsigned chunk_mask = ~(ext2_chunk_size(inode)-1);
 	unsigned char *types = NULL;
@@ -272,14 +276,14 @@ ext2_readdir (struct file * filp, void *
 			ext2_error(sb, __FUNCTION__,
 				   "bad page in #%lu",
 				   inode->i_ino);
-			filp->f_pos += PAGE_CACHE_SIZE - offset;
+			filp->f_pos += page_cache_size(mapping) - offset;
 			return -EIO;
 		}
 		kaddr = page_address(page);
 		if (unlikely(need_revalidate)) {
 			if (offset) {
 				offset = ext2_validate_entry(kaddr, offset, chunk_mask);
-				filp->f_pos = (n<<PAGE_CACHE_SHIFT) + offset;
+				filp->f_pos = page_cache_pos(mapping, n, offset);
 			}
 			filp->f_version = inode->i_version;
 			need_revalidate = 0;
@@ -302,7 +306,7 @@ ext2_readdir (struct file * filp, void *
 
 				offset = (char *)de - kaddr;
 				over = filldir(dirent, de->name, de->name_len,
-						(n<<PAGE_CACHE_SHIFT) | offset,
+						page_cache_pos(mapping, n, offset),
 						le32_to_cpu(de->inode), d_type);
 				if (over) {
 					ext2_put_page(page);
@@ -328,6 +332,7 @@ struct ext2_dir_entry_2 * ext2_find_entr
 			struct dentry *dentry, struct page ** res_page)
 {
 	const char *name = dentry->d_name.name;
+	struct address_space *mapping = dir->i_mapping;
 	int namelen = dentry->d_name.len;
 	unsigned reclen = EXT2_DIR_REC_LEN(namelen);
 	unsigned long start, n;
@@ -369,7 +374,7 @@ struct ext2_dir_entry_2 * ext2_find_entr
 		if (++n >= npages)
 			n = 0;
 		/* next page is past the blocks we've got */
-		if (unlikely(n > (dir->i_blocks >> (PAGE_CACHE_SHIFT - 9)))) {
+		if (unlikely(n > (dir->i_blocks >> (page_cache_shift(mapping) - 9)))) {
 			ext2_error(dir->i_sb, __FUNCTION__,
 				"dir %lu size %lld exceeds block count %llu",
 				dir->i_ino, dir->i_size,
@@ -438,6 +443,7 @@ void ext2_set_link(struct inode *dir, st
 int ext2_add_link (struct dentry *dentry, struct inode *inode)
 {
 	struct inode *dir = dentry->d_parent->d_inode;
+	struct address_space *mapping = inode->i_mapping;
 	const char *name = dentry->d_name.name;
 	int namelen = dentry->d_name.len;
 	unsigned chunk_size = ext2_chunk_size(dir);
@@ -467,7 +473,7 @@ int ext2_add_link (struct dentry *dentry
 		kaddr = page_address(page);
 		dir_end = kaddr + ext2_last_byte(dir, n);
 		de = (ext2_dirent *)kaddr;
-		kaddr += PAGE_CACHE_SIZE - reclen;
+		kaddr += page_cache_size(mapping) - reclen;
 		while ((char *)de <= kaddr) {
 			if ((char *)de == dir_end) {
 				/* We hit i_size */
Index: linux-2.6.21-rc7/fs/ext2/ext2.h
===================================================================
--- linux-2.6.21-rc7.orig/fs/ext2/ext2.h	2007-04-22 19:43:05.000000000 -0700
+++ linux-2.6.21-rc7/fs/ext2/ext2.h	2007-04-22 19:44:22.000000000 -0700
@@ -160,6 +160,7 @@ extern const struct file_operations ext2
 /* file.c */
 extern const struct inode_operations ext2_file_inode_operations;
 extern const struct file_operations ext2_file_operations;
+extern const struct file_operations ext2_no_mmap_file_operations;
 extern const struct file_operations ext2_xip_file_operations;
 
 /* inode.c */
Index: linux-2.6.21-rc7/fs/ext2/file.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/ext2/file.c	2007-04-22 19:43:05.000000000 -0700
+++ linux-2.6.21-rc7/fs/ext2/file.c	2007-04-22 19:44:22.000000000 -0700
@@ -58,6 +58,24 @@ const struct file_operations ext2_file_o
 	.splice_write	= generic_file_splice_write,
 };
 
+const struct file_operations ext2_no_mmap_file_operations = {
+	.llseek		= generic_file_llseek,
+	.read		= do_sync_read,
+	.write		= do_sync_write,
+	.aio_read	= generic_file_aio_read,
+	.aio_write	= generic_file_aio_write,
+	.ioctl		= ext2_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= ext2_compat_ioctl,
+#endif
+	.open		= generic_file_open,
+	.release	= ext2_release_file,
+	.fsync		= ext2_sync_file,
+	.sendfile	= generic_file_sendfile,
+	.splice_read	= generic_file_splice_read,
+	.splice_write	= generic_file_splice_write,
+};
+
 #ifdef CONFIG_EXT2_FS_XIP
 const struct file_operations ext2_xip_file_operations = {
 	.llseek		= generic_file_llseek,
Index: linux-2.6.21-rc7/fs/ext2/inode.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/ext2/inode.c	2007-04-22 19:43:05.000000000 -0700
+++ linux-2.6.21-rc7/fs/ext2/inode.c	2007-04-22 19:44:22.000000000 -0700
@@ -1128,10 +1128,16 @@ void ext2_read_inode (struct inode * ino
 			inode->i_fop = &ext2_xip_file_operations;
 		} else if (test_opt(inode->i_sb, NOBH)) {
 			inode->i_mapping->a_ops = &ext2_nobh_aops;
-			inode->i_fop = &ext2_file_operations;
+			if (inode->i_mapping->order)
+				inode->i_fop = &ext2_no_mmap_file_operations;
+			else
+				inode->i_fop = &ext2_file_operations;
 		} else {
 			inode->i_mapping->a_ops = &ext2_aops;
-			inode->i_fop = &ext2_file_operations;
+			if (inode->i_mapping->order)
+				inode->i_fop = &ext2_no_mmap_file_operations;
+			else
+				inode->i_fop = &ext2_file_operations;
 		}
 	} else if (S_ISDIR(inode->i_mode)) {
 		inode->i_op = &ext2_dir_inode_operations;
Index: linux-2.6.21-rc7/fs/ext2/namei.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/ext2/namei.c	2007-04-22 19:43:05.000000000 -0700
+++ linux-2.6.21-rc7/fs/ext2/namei.c	2007-04-22 19:44:22.000000000 -0700
@@ -114,10 +114,16 @@ static int ext2_create (struct inode * d
 			inode->i_fop = &ext2_xip_file_operations;
 		} else if (test_opt(inode->i_sb, NOBH)) {
 			inode->i_mapping->a_ops = &ext2_nobh_aops;
-			inode->i_fop = &ext2_file_operations;
+			if (inode->i_mapping->order)
+				inode->i_fop = &ext2_no_mmap_file_operations;
+			else
+				inode->i_fop = &ext2_file_operations;
 		} else {
 			inode->i_mapping->a_ops = &ext2_aops;
-			inode->i_fop = &ext2_file_operations;
+			if (inode->i_mapping->order)
+				inode->i_fop = &ext2_no_mmap_file_operations;
+			else
+				inode->i_fop = &ext2_file_operations;
 		}
 		mark_inode_dirty(inode);
 		err = ext2_add_nondir(dentry, inode);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC 16/16] Variable Order Page Cache: Alternate implementation of page cache macros
  2007-04-23  6:48 [RFC 00/16] Variable Order Page Cache Patchset V2 Christoph Lameter
                   ` (14 preceding siblings ...)
  2007-04-23  6:50 ` [RFC 15/16] ext2: Add variable page size support Christoph Lameter
@ 2007-04-23  6:50 ` Christoph Lameter
  2007-04-25 13:16   ` Mel Gorman
  2007-04-23  9:23 ` [RFC 00/16] Variable Order Page Cache Patchset V2 David Chinner
  2007-04-23  9:31 ` David Chinner
  17 siblings, 1 reply; 40+ messages in thread
From: Christoph Lameter @ 2007-04-23  6:50 UTC (permalink / raw)
  To: linux-mm
  Cc: William Lee Irwin III, Badari Pulavarty, David Chinner,
	Jens Axboe, Adam Litke, Christoph Lameter, Dave Hansen,
	Mel Gorman, Avi Kivity

Variable Order Page Cache: Alternate implementation of page cache macros

Implement the page cache macros in a more efficient way by storing key
values in the mapping. This reduces code size but increases inode size.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/fs.h      |    4 +++-
 include/linux/pagemap.h |   13 +++++++------
 2 files changed, 10 insertions(+), 7 deletions(-)

Index: linux-2.6.21-rc7/include/linux/fs.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/fs.h	2007-04-22 19:43:01.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/fs.h	2007-04-22 19:44:29.000000000 -0700
@@ -435,7 +435,9 @@ struct address_space {
 	struct inode		*host;		/* owner: inode, block_device */
 	struct radix_tree_root	page_tree;	/* radix tree of all pages */
 	rwlock_t		tree_lock;	/* and rwlock protecting it */
-	unsigned int		order;		/* Page order in this space */
+	unsigned int		shift;		/* Shift for to get to the page number */
+	unsigned int		order;		/* Page order for allocations */
+	loff_t			offset_mask;	/* To mask out offset in page */
 	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
 	struct prio_tree_root	i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
Index: linux-2.6.21-rc7/include/linux/pagemap.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/pagemap.h	2007-04-22 19:44:16.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/pagemap.h	2007-04-22 19:46:23.000000000 -0700
@@ -42,7 +42,8 @@ static inline void mapping_set_gfp_mask(
 static inline void set_mapping_order(struct address_space *m, int order)
 {
 	m->order = order;
-
+	m->shift = order + PAGE_SHIFT;
+	m->offset_mask = (1UL << m->shift) -1;
 	if (order)
 		m->flags |= __GFP_COMP;
 	else
@@ -64,23 +65,23 @@ static inline void set_mapping_order(str
 
 static inline int page_cache_shift(struct address_space *a)
 {
-	return a->order + PAGE_SHIFT;
+	return a->shift;
 }
 
 static inline unsigned int page_cache_size(struct address_space *a)
 {
-	return PAGE_SIZE << a->order;
+	return a->offset_mask + 1;
 }
 
 static inline loff_t page_cache_mask(struct address_space *a)
 {
-	return (loff_t)PAGE_MASK << a->order;
+	return ~(loff_t)a->offset_mask;
 }
 
 static inline unsigned int page_cache_offset(struct address_space *a,
 		loff_t pos)
 {
-	return pos & ~(PAGE_MASK << a->order);
+	return pos & a->offset_mask;
 }
 
 static inline pgoff_t page_cache_index(struct address_space *a,
@@ -95,7 +96,7 @@ static inline pgoff_t page_cache_index(s
 static inline pgoff_t page_cache_next(struct address_space *a,
 		loff_t pos)
 {
-	return page_cache_index(a, pos + page_cache_size(a) - 1);
+	return page_cache_index(a, pos + a->offset_mask);
 }
 
 static inline loff_t page_cache_pos(struct address_space *a,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 00/16] Variable Order Page Cache Patchset V2
  2007-04-23  6:48 [RFC 00/16] Variable Order Page Cache Patchset V2 Christoph Lameter
                   ` (15 preceding siblings ...)
  2007-04-23  6:50 ` [RFC 16/16] Variable Order Page Cache: Alternate implementation of page cache macros Christoph Lameter
@ 2007-04-23  9:23 ` David Chinner
  2007-04-23  9:31 ` David Chinner
  17 siblings, 0 replies; 40+ messages in thread
From: David Chinner @ 2007-04-23  9:23 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, William Lee Irwin III, Badari Pulavarty, David Chinner,
	Jens Axboe, Adam Litke, Dave Hansen, Mel Gorman, Avi Kivity

On Sun, Apr 22, 2007 at 11:48:45PM -0700, Christoph Lameter wrote:
> Sorry for the earlier mail. quilt and exim not cooperating.
> 
> RFC V1->V2
> - Some ext2 support
> - Some block layer, fs layer support etc.
> - Better page cache macros
> - Use macros to clean up code.

I have this running on x86_64 UML with XFS. I've tested 16k and 64k
block size using fsx with mmap operations turned off. It survives
at least 100,000 operations without problems now.

You need to apply a fix to memclear_highpage_flush() otherwise
it bugs out on the first partial page truncate. I've attached
my hack below. Christoph, there's header file inclusion order
problems with using your new wrappers here, which is why I
open coded it. I'll leave it for you to solve ;)

I'll attach the XFS patch in another email.....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group


---
 include/linux/highmem.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.21-rc7/include/linux/highmem.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/highmem.h	2007-04-23 18:46:20.917655632 +1000
+++ linux-2.6.21-rc7/include/linux/highmem.h	2007-04-23 18:48:20.047323146 +1000
@@ -88,7 +88,7 @@ static inline void memclear_highpage_flu
 {
 	void *kaddr;
 
-	BUG_ON(offset + size > PAGE_SIZE);
+	BUG_ON(offset + size > (PAGE_SIZE << page->mapping->order));
 
 	kaddr = kmap_atomic(page, KM_USER0);
 	memset((char *)kaddr + offset, 0, size);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 00/16] Variable Order Page Cache Patchset V2
  2007-04-23  6:48 [RFC 00/16] Variable Order Page Cache Patchset V2 Christoph Lameter
                   ` (16 preceding siblings ...)
  2007-04-23  9:23 ` [RFC 00/16] Variable Order Page Cache Patchset V2 David Chinner
@ 2007-04-23  9:31 ` David Chinner
  17 siblings, 0 replies; 40+ messages in thread
From: David Chinner @ 2007-04-23  9:31 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, William Lee Irwin III, Badari Pulavarty, David Chinner,
	Jens Axboe, Adam Litke, Dave Hansen, Mel Gorman, Avi Kivity

On Sun, Apr 22, 2007 at 11:48:45PM -0700, Christoph Lameter wrote:
> Sorry for the earlier mail. quilt and exim not cooperating.
> 
> RFC V1->V2
> - Some ext2 support
> - Some block layer, fs layer support etc.
> - Better page cache macros
> - Use macros to clean up code.

....

> Future:
> - Expect several more RFCs
> - We hope for XFS support soon

Patch is attached that converts the XFS data path to use large order
page cache pages.

I haven't tested this on a real system yet but it works on UML. I've
tested it with fsx and it seems to do everything it is supposed to.
Data is actually written to the block device as it persists across
mount and unmount, so that appears to be working as well.

> - Lets try to keep scope as small as possible.

Hence I haven't tried to convert anything on the metadata side
of XFS to use the high order page cache - the XFS buffer cache
takes care of that for us right now and it's not a simple
change like the data path is.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

---
 fs/xfs/linux-2.6/xfs_aops.c  |   53 ++++++++++++++++++++++---------------------
 fs/xfs/linux-2.6/xfs_file.c  |   22 +++++++++++++++++
 fs/xfs/linux-2.6/xfs_iops.h  |    1 
 fs/xfs/linux-2.6/xfs_lrw.c   |    6 ++--
 fs/xfs/linux-2.6/xfs_super.c |    5 +++-
 fs/xfs/xfs_mount.c           |   13 ----------
 6 files changed, 58 insertions(+), 42 deletions(-)

Index: linux-2.6.21-rc7/fs/xfs/linux-2.6/xfs_aops.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/xfs/linux-2.6/xfs_aops.c	2007-04-23 17:09:54.719098744 +1000
+++ linux-2.6.21-rc7/fs/xfs/linux-2.6/xfs_aops.c	2007-04-23 19:03:32.300063145 +1000
@@ -74,7 +74,7 @@ xfs_page_trace(
 	xfs_inode_t	*ip;
 	bhv_vnode_t	*vp = vn_from_inode(inode);
 	loff_t		isize = i_size_read(inode);
-	loff_t		offset = page_offset(page);
+	loff_t		offset = page_cache_offset(page->mapping);
 	int		delalloc = -1, unmapped = -1, unwritten = -1;
 
 	if (page_has_buffers(page))
@@ -547,7 +547,7 @@ xfs_probe_page(
 					break;
 			} while ((bh = bh->b_this_page) != head);
 		} else
-			ret = mapped ? 0 : PAGE_CACHE_SIZE;
+			ret = mapped ? 0 : page_cache_size(page->mapping);
 	}
 
 	return ret;
@@ -574,7 +574,7 @@ xfs_probe_cluster(
 	} while ((bh = bh->b_this_page) != head);
 
 	/* if we reached the end of the page, sum forwards in following pages */
-	tlast = i_size_read(inode) >> PAGE_CACHE_SHIFT;
+	tlast = page_cache_index(inode->i_mapping, i_size_read(inode));
 	tindex = startpage->index + 1;
 
 	/* Prune this back to avoid pathological behavior */
@@ -592,14 +592,14 @@ xfs_probe_cluster(
 			size_t pg_offset, len = 0;
 
 			if (tindex == tlast) {
-				pg_offset =
-				    i_size_read(inode) & (PAGE_CACHE_SIZE - 1);
+				pg_offset = page_cache_offset(inode->i_mapping,
+							i_size_read(inode));
 				if (!pg_offset) {
 					done = 1;
 					break;
 				}
 			} else
-				pg_offset = PAGE_CACHE_SIZE;
+				pg_offset = page_cache_size(inode->i_mapping);
 
 			if (page->index == tindex && !TestSetPageLocked(page)) {
 				len = xfs_probe_page(page, pg_offset, mapped);
@@ -681,7 +681,8 @@ xfs_convert_page(
 	int			bbits = inode->i_blkbits;
 	int			len, page_dirty;
 	int			count = 0, done = 0, uptodate = 1;
- 	xfs_off_t		offset = page_offset(page);
+	struct address_space	*map = inode->i_mapping;
+	xfs_off_t		offset = page_cache_pos(map, page->index, 0);
 
 	if (page->index != tindex)
 		goto fail;
@@ -689,7 +690,7 @@ xfs_convert_page(
 		goto fail;
 	if (PageWriteback(page))
 		goto fail_unlock_page;
-	if (page->mapping != inode->i_mapping)
+	if (page->mapping != map)
 		goto fail_unlock_page;
 	if (!xfs_is_delayed_page(page, (*ioendp)->io_type))
 		goto fail_unlock_page;
@@ -701,20 +702,20 @@ xfs_convert_page(
 	 * Derivation:
 	 *
 	 * End offset is the highest offset that this page should represent.
-	 * If we are on the last page, (end_offset & (PAGE_CACHE_SIZE - 1))
-	 * will evaluate non-zero and be less than PAGE_CACHE_SIZE and
+	 * If we are on the last page, (end_offset & page_cache_mask())
+	 * will evaluate non-zero and be less than page_cache_size() and
 	 * hence give us the correct page_dirty count. On any other page,
 	 * it will be zero and in that case we need page_dirty to be the
 	 * count of buffers on the page.
 	 */
 	end_offset = min_t(unsigned long long,
-			(xfs_off_t)(page->index + 1) << PAGE_CACHE_SHIFT,
+			(xfs_off_t)(page->index + 1) << page_cache_shift(map),
 			i_size_read(inode));
 
 	len = 1 << inode->i_blkbits;
-	p_offset = min_t(unsigned long, end_offset & (PAGE_CACHE_SIZE - 1),
-					PAGE_CACHE_SIZE);
-	p_offset = p_offset ? roundup(p_offset, len) : PAGE_CACHE_SIZE;
+	p_offset = min_t(unsigned long, page_cache_offset(map, end_offset),
+					page_cache_size(map));
+	p_offset = p_offset ? roundup(p_offset, len) : page_cache_size(map);
 	page_dirty = p_offset / len;
 
 	bh = head = page_buffers(page);
@@ -870,6 +871,7 @@ xfs_page_state_convert(
 	int			page_dirty, count = 0;
 	int			trylock = 0;
 	int			all_bh = unmapped;
+	struct address_space	*map = inode->i_mapping;
 
 	if (startio) {
 		if (wbc->sync_mode == WB_SYNC_NONE && wbc->nonblocking)
@@ -878,11 +880,11 @@ xfs_page_state_convert(
 
 	/* Is this page beyond the end of the file? */
 	offset = i_size_read(inode);
-	end_index = offset >> PAGE_CACHE_SHIFT;
-	last_index = (offset - 1) >> PAGE_CACHE_SHIFT;
+	end_index = page_cache_index(map, offset);
+	last_index = page_cache_index(map, (offset - 1));
 	if (page->index >= end_index) {
 		if ((page->index >= end_index + 1) ||
-		    !(i_size_read(inode) & (PAGE_CACHE_SIZE - 1))) {
+		    !(page_cache_offset(map, i_size_read(inode)))) {
 			if (startio)
 				unlock_page(page);
 			return 0;
@@ -896,22 +898,23 @@ xfs_page_state_convert(
 	 * Derivation:
 	 *
 	 * End offset is the highest offset that this page should represent.
-	 * If we are on the last page, (end_offset & (PAGE_CACHE_SIZE - 1))
-	 * will evaluate non-zero and be less than PAGE_CACHE_SIZE and
+	 * If we are on the last page, (end_offset & page_cache_mask())
+	 * will evaluate non-zero and be less than page_cache_size() and
 	 * hence give us the correct page_dirty count. On any other page,
 	 * it will be zero and in that case we need page_dirty to be the
 	 * count of buffers on the page.
  	 */
 	end_offset = min_t(unsigned long long,
-			(xfs_off_t)(page->index + 1) << PAGE_CACHE_SHIFT, offset);
+			(xfs_off_t)(page->index + 1) << page_cache_shift(map),
+			offset);
 	len = 1 << inode->i_blkbits;
-	p_offset = min_t(unsigned long, end_offset & (PAGE_CACHE_SIZE - 1),
-					PAGE_CACHE_SIZE);
-	p_offset = p_offset ? roundup(p_offset, len) : PAGE_CACHE_SIZE;
+	p_offset = min_t(unsigned long, page_cache_offset(map, end_offset),
+					page_cache_size(map));
+	p_offset = p_offset ? roundup(p_offset, len) : page_cache_size(map);
 	page_dirty = p_offset / len;
 
 	bh = head = page_buffers(page);
-	offset = page_offset(page);
+	offset = page_cache_pos(map, page->index, 0);
 	flags = -1;
 	type = 0;
 
@@ -1040,7 +1043,7 @@ xfs_page_state_convert(
 
 	if (ioend && iomap_valid) {
 		offset = (iomap.iomap_offset + iomap.iomap_bsize - 1) >>
-					PAGE_CACHE_SHIFT;
+					page_cache_shift(map);
 		tlast = min_t(pgoff_t, offset, last_index);
 		xfs_cluster_write(inode, page->index + 1, &iomap, &ioend,
 					wbc, startio, all_bh, tlast);
Index: linux-2.6.21-rc7/fs/xfs/linux-2.6/xfs_lrw.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/xfs/linux-2.6/xfs_lrw.c	2007-04-23 17:18:45.757201913 +1000
+++ linux-2.6.21-rc7/fs/xfs/linux-2.6/xfs_lrw.c	2007-04-23 19:03:40.780181351 +1000
@@ -143,9 +143,9 @@ xfs_iozero(
 	do {
 		unsigned long index, offset;
 
-		offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
-		index = pos >> PAGE_CACHE_SHIFT;
-		bytes = PAGE_CACHE_SIZE - offset;
+		offset = page_cache_offset(mapping, pos); /* Within page */
+		index = page_cache_index(mapping, pos);
+		bytes = page_cache_size(mapping) - offset;
 		if (bytes > count)
 			bytes = count;
 
Index: linux-2.6.21-rc7/fs/xfs/linux-2.6/xfs_file.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/xfs/linux-2.6/xfs_file.c	2007-04-23 19:02:50.231476689 +1000
+++ linux-2.6.21-rc7/fs/xfs/linux-2.6/xfs_file.c	2007-04-23 19:03:16.283839882 +1000
@@ -469,6 +469,28 @@ const struct file_operations xfs_file_op
 #endif
 };
 
+const struct file_operations xfs_no_mmap_file_operations = {
+	.llseek		= generic_file_llseek,
+	.read		= do_sync_read,
+	.write		= do_sync_write,
+	.aio_read	= xfs_file_aio_read,
+	.aio_write	= xfs_file_aio_write,
+	.sendfile	= xfs_file_sendfile,
+	.splice_read	= xfs_file_splice_read,
+	.splice_write	= xfs_file_splice_write,
+	.unlocked_ioctl	= xfs_file_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= xfs_file_compat_ioctl,
+#endif
+	.open		= xfs_file_open,
+	.flush		= xfs_file_close,
+	.release	= xfs_file_release,
+	.fsync		= xfs_file_fsync,
+#ifdef HAVE_FOP_OPEN_EXEC
+	.open_exec	= xfs_file_open_exec,
+#endif
+};
+
 const struct file_operations xfs_invis_file_operations = {
 	.llseek		= generic_file_llseek,
 	.read		= do_sync_read,
Index: linux-2.6.21-rc7/fs/xfs/linux-2.6/xfs_iops.h
===================================================================
--- linux-2.6.21-rc7.orig/fs/xfs/linux-2.6/xfs_iops.h	2007-04-23 19:02:50.247476912 +1000
+++ linux-2.6.21-rc7/fs/xfs/linux-2.6/xfs_iops.h	2007-04-23 19:03:16.335840607 +1000
@@ -23,6 +23,7 @@ extern const struct inode_operations xfs
 extern const struct inode_operations xfs_symlink_inode_operations;
 
 extern const struct file_operations xfs_file_operations;
+extern const struct file_operations xfs_no_mmap_file_operations;
 extern const struct file_operations xfs_dir_file_operations;
 extern const struct file_operations xfs_invis_file_operations;
 
Index: linux-2.6.21-rc7/fs/xfs/linux-2.6/xfs_super.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/xfs/linux-2.6/xfs_super.c	2007-04-23 19:02:50.223476578 +1000
+++ linux-2.6.21-rc7/fs/xfs/linux-2.6/xfs_super.c	2007-04-23 19:03:16.315840329 +1000
@@ -125,8 +125,11 @@ xfs_set_inodeops(
 {
 	switch (inode->i_mode & S_IFMT) {
 	case S_IFREG:
+		if (inode->i_mapping->order)
+			inode->i_fop = &xfs_no_mmap_file_operations;
+		else
+			inode->i_fop = &xfs_file_operations;
 		inode->i_op = &xfs_inode_operations;
-		inode->i_fop = &xfs_file_operations;
 		inode->i_mapping->a_ops = &xfs_address_space_operations;
 		break;
 	case S_IFDIR:
Index: linux-2.6.21-rc7/fs/xfs/xfs_mount.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/xfs/xfs_mount.c	2007-04-23 19:02:50.215476466 +1000
+++ linux-2.6.21-rc7/fs/xfs/xfs_mount.c	2007-04-23 19:03:16.323840440 +1000
@@ -315,19 +315,6 @@ xfs_mount_validate_sb(
 		return XFS_ERROR(ENOSYS);
 	}
 
-	/*
-	 * Until this is fixed only page-sized or smaller data blocks work.
-	 */
-	if (unlikely(sbp->sb_blocksize > PAGE_SIZE)) {
-		xfs_fs_mount_cmn_err(flags,
-			"file system with blocksize %d bytes",
-			sbp->sb_blocksize);
-		xfs_fs_mount_cmn_err(flags,
-			"only pagesize (%ld) or less will currently work.",
-			PAGE_SIZE);
-		return XFS_ERROR(ENOSYS);
-	}
-
 	return 0;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 15/16] ext2: Add variable page size support
  2007-04-23  6:50 ` [RFC 15/16] ext2: Add variable page size support Christoph Lameter
@ 2007-04-23 16:30   ` Badari Pulavarty
  2007-04-24  1:11     ` Christoph Lameter
  0 siblings, 1 reply; 40+ messages in thread
From: Badari Pulavarty @ 2007-04-23 16:30 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, William Lee Irwin III, Jens Axboe, David Chinner,
	Adam Litke, Avi Kivity, Mel Gorman, Dave Hansen

On Sun, 2007-04-22 at 23:50 -0700, Christoph Lameter wrote:
> ext2: Add variable page size support
> 
> This adds variable page size support. It is then possible to mount filesystems
> that have a larger blocksize than the page size.
> 
> F.e. the following is possible on x86_64 and i386 that have only a 4k page
> size.
> 
> mke2fs -b 16384 /dev/hdd2	<Ignore warning about too large block size>
> 
> mount /dev/hdd2 /media
> ls -l /media
> 
> .... Do more things with the volume that uses a 16k page cache size on
> a 4k page sized platform..
> 
> Note that there are issues with ext2 support:
> 
> 1. Data is not writtten back correctly (block layer?)
> 2. Reclaim does not work right.

Here is the fix you need to get ext2 writeback working properly :)
I am able to run fsx with this fix (without mapped IO).

Thanks,
Badari

 fs/buffer.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.21-rc7/fs/buffer.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/buffer.c	2007-04-23 09:44:19.000000000 -0700
+++ linux-2.6.21-rc7/fs/buffer.c	2007-04-23 10:28:45.000000000 -0700
@@ -1619,7 +1619,7 @@ static int __block_write_full_page(struc
 	 * handle that here by just cleaning them.
 	 */
 
-	block = (sector_t)page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
+	block = (sector_t)page->index << (page_cache_shift(page->mapping) - inode->i_blkbits);
 	head = page_buffers(page);
 	bh = head;
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 15/16] ext2: Add variable page size support
  2007-04-23 16:30   ` Badari Pulavarty
@ 2007-04-24  1:11     ` Christoph Lameter
  0 siblings, 0 replies; 40+ messages in thread
From: Christoph Lameter @ 2007-04-24  1:11 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: linux-mm, William Lee Irwin III, Jens Axboe, David Chinner,
	Adam Litke, Avi Kivity, Mel Gorman, Dave Hansen

On Mon, 23 Apr 2007, Badari Pulavarty wrote:

> Here is the fix you need to get ext2 writeback working properly :)
> I am able to run fsx with this fix (without mapped IO).

Yes it works! Great. Now if I just had an idea why reclaim does not work 
and why the active page vanish....

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 01/16] Free up page->private for compound pages
  2007-04-23  6:48 ` [RFC 01/16] Free up page->private for compound pages Christoph Lameter
@ 2007-04-24  2:12   ` Dave Hansen
  2007-04-24  2:23     ` Christoph Lameter
  2007-04-25 10:55   ` Mel Gorman
  1 sibling, 1 reply; 40+ messages in thread
From: Dave Hansen @ 2007-04-24  2:12 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, William Lee Irwin III, Jens Axboe, David Chinner,
	Badari Pulavarty, Adam Litke, Avi Kivity, Mel Gorman

On Sun, 2007-04-22 at 23:48 -0700, Christoph Lameter wrote:
> If we add a new flag so that we can distinguish between the
> first page and the tail pages then we can avoid to use page->private
> in the first page. page->private == page for the first page, so there
> is no real information in there.

But, there _is_ real information there.  Without it, you need something
different to tell the head page from a tail page.

You're adding that something different, but you make it sound like that
information was completely superfluous before.  Thus, this patch really
does have a cost: the dedication of one more match flag.

OK, so the end result is that we're freeing up page->private for the
head page of compound pages, but not _all_ of them, right?  You might
want to make that a bit clearer in the patch description.

Can we be more clever about this, and not have to eat yet another page
flag?

static inline int page_is_tail(struct page *page)
{
	struct page *possible_head_page;
	if (!PageCompound(page))
		return 0;
	possible_head_page = (struct page *)page->private;
	/* need to make sure this comes out unsigned: */
	if ((page - possible_head_page) < MAX_ORDER_NR_PAGES)
		return 1;
	return 0;
}

The only thing we'd have to restrict was that pages couldn't be allowed
to have their ->private point to other things in the same max_order.
This could even be enforced in set_page_private().

>  static inline void get_page(struct page *page)
>  {
> -	if (unlikely(PageCompound(page)))
> -		page = (struct page *)page_private(page);
> +	page = compound_head(page);
>  	VM_BUG_ON(atomic_read(&page->_count) == 0);
>  	atomic_inc(&page->_count);
>  }
> @@ -314,6 +317,23 @@ static inline compound_page_dtor *get_co
>  	return (compound_page_dtor *)page[1].lru.next;
>  }
>  
> +static inline int compound_order(struct page *page)
> +{
> +	if (!PageCompound(page) || PageTail(page))
> +		return 0;
> +	return (unsigned long)page[1].lru.prev;
> +}
> +
> +static inline void set_compound_order(struct page *page, unsigned long order)
> +{
> +	page[1].lru.prev = (void *)order;
> +}
> +
> +static inline int base_pages(struct page *page)
> +{
> + 	return 1 << compound_order(page);
> +}

Perhaps base_pages_in_compound(), instead?  

>  /*
>   * Multiple processes may "see" the same page. E.g. for untouched
>   * mappings of /dev/null, all processes see the same page full of
> Index: linux-2.6.21-rc7/include/linux/page-flags.h
> ===================================================================
> --- linux-2.6.21-rc7.orig/include/linux/page-flags.h	2007-04-21 20:52:07.000000000 -0700
> +++ linux-2.6.21-rc7/include/linux/page-flags.h	2007-04-21 20:52:15.000000000 -0700
> @@ -91,6 +91,8 @@
>  #define PG_nosave_free		18	/* Used for system suspend/resume */
>  #define PG_buddy		19	/* Page is free, on buddy lists */
>  
> +#define PG_tail			20	/* Page is tail of a compound page */
> +
>  /* PG_owner_priv_1 users should have descriptive aliases */
>  #define PG_checked		PG_owner_priv_1 /* Used by some filesystems */
>  
> @@ -241,6 +243,10 @@ static inline void SetPageUptodate(struc
>  #define __SetPageCompound(page)	__set_bit(PG_compound, &(page)->flags)
>  #define __ClearPageCompound(page) __clear_bit(PG_compound, &(page)->flags)
>  
> +#define PageTail(page)	test_bit(PG_tail, &(page)->flags)
> +#define __SetPageTail(page)	__set_bit(PG_tail, &(page)->flags)
> +#define __ClearPageTail(page)	__clear_bit(PG_tail, &(page)->flags)
>
>  #ifdef CONFIG_SWAP
>  #define PageSwapCache(page)	test_bit(PG_swapcache, &(page)->flags)
>  #define SetPageSwapCache(page)	set_bit(PG_swapcache, &(page)->flags)
> Index: linux-2.6.21-rc7/mm/internal.h
> ===================================================================
> --- linux-2.6.21-rc7.orig/mm/internal.h	2007-04-21 20:52:07.000000000 -0700
> +++ linux-2.6.21-rc7/mm/internal.h	2007-04-21 20:52:15.000000000 -0700
> @@ -24,7 +24,7 @@ static inline void set_page_count(struct
>   */
>  static inline void set_page_refcounted(struct page *page)
>  {
> -	VM_BUG_ON(PageCompound(page) && page_private(page) != (unsigned long)page);
> +	VM_BUG_ON(PageTail(page));
>  	VM_BUG_ON(atomic_read(&page->_count));
>  	set_page_count(page, 1);
>  }
> Index: linux-2.6.21-rc7/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.21-rc7.orig/mm/page_alloc.c	2007-04-21 20:52:07.000000000 -0700
> +++ linux-2.6.21-rc7/mm/page_alloc.c	2007-04-21 20:58:32.000000000 -0700
> @@ -227,7 +227,7 @@ static void bad_page(struct page *page)
>  
>  static void free_compound_page(struct page *page)
>  {
> -	__free_pages_ok(page, (unsigned long)page[1].lru.prev);
> +	__free_pages_ok(page, compound_order(page));
>  }

These substitutions are great, even outside of this patch set.  Nice.

>  static void prep_compound_page(struct page *page, unsigned long order)
> @@ -236,12 +236,14 @@ static void prep_compound_page(struct pa
>  	int nr_pages = 1 << order;
>  
>  	set_compound_page_dtor(page, free_compound_page);
> -	page[1].lru.prev = (void *)order;
> -	for (i = 0; i < nr_pages; i++) {
> +	set_compound_order(page, order);
> +	__SetPageCompound(page);
> +	for (i = 1; i < nr_pages; i++) {
>  		struct page *p = page + i;
>  
> +		__SetPageTail(p);
>  		__SetPageCompound(p);
> -		set_page_private(p, (unsigned long)page);
> +		p->private = (unsigned long)page;
>  	}
>  }
>  
> @@ -250,15 +252,19 @@ static void destroy_compound_page(struct
>  	int i;
>  	int nr_pages = 1 << order;
>  
> -	if (unlikely((unsigned long)page[1].lru.prev != order))
> +	if (unlikely(compound_order(page) != order))
>  		bad_page(page);
>  
> -	for (i = 0; i < nr_pages; i++) {
> +	if (unlikely(!PageCompound(page)))
> +			bad_page(page);
> +	__ClearPageCompound(page);
> +	for (i = 1; i < nr_pages; i++) {
>  		struct page *p = page + i;
>  
> -		if (unlikely(!PageCompound(p) |
> -				(page_private(p) != (unsigned long)page)))
> +		if (unlikely(!PageCompound(p) | !PageTail(p) |
> +				((struct page *)p->private != page)))

Should there be a compound_page_head() function to get rid of these
open-coded references?

I guess it doesn't matter, but it might be nice to turn those binary |'s
into logical ||'s.

>  			bad_page(page);
> +		__ClearPageTail(p);
>  		__ClearPageCompound(p);
>  	}
>  }
> @@ -1438,8 +1444,17 @@ void __pagevec_free(struct pagevec *pvec
>  {
>  	int i = pagevec_count(pvec);
>  
> -	while (--i >= 0)
> -		free_hot_cold_page(pvec->pages[i], pvec->cold);
> +	while (--i >= 0) {
> +		struct page *page = pvec->pages[i];
> +
> +		if (PageCompound(page)) {
> +			compound_page_dtor *dtor;
> +
> +			dtor = get_compound_page_dtor(page);
> +			(*dtor)(page);
> +		} else
> +			free_hot_cold_page(page, pvec->cold);
> +	}
>  }
>  
>  fastcall void __free_pages(struct page *page, unsigned int order)
> Index: linux-2.6.21-rc7/mm/slab.c
> ===================================================================
> --- linux-2.6.21-rc7.orig/mm/slab.c	2007-04-21 20:52:07.000000000 -0700
> +++ linux-2.6.21-rc7/mm/slab.c	2007-04-21 20:52:15.000000000 -0700
> @@ -592,8 +592,7 @@ static inline void page_set_cache(struct
>  
>  static inline struct kmem_cache *page_get_cache(struct page *page)
>  {
> -	if (unlikely(PageCompound(page)))
> -		page = (struct page *)page_private(page);
> +	page = compound_head(page);
>  	BUG_ON(!PageSlab(page));
>  	return (struct kmem_cache *)page->lru.next;
>  }
> @@ -605,8 +604,7 @@ static inline void page_set_slab(struct 
>  
>  static inline struct slab *page_get_slab(struct page *page)
>  {
> -	if (unlikely(PageCompound(page)))
> -		page = (struct page *)page_private(page);
> +	page = compound_head(page);
>  	BUG_ON(!PageSlab(page));
>  	return (struct slab *)page->lru.prev;
>  }
> Index: linux-2.6.21-rc7/mm/swap.c
> ===================================================================
> --- linux-2.6.21-rc7.orig/mm/swap.c	2007-04-21 20:52:07.000000000 -0700
> +++ linux-2.6.21-rc7/mm/swap.c	2007-04-21 21:02:59.000000000 -0700
> @@ -55,7 +55,7 @@ static void fastcall __page_cache_releas
>  
>  static void put_compound_page(struct page *page)
>  {
> -	page = (struct page *)page_private(page);
> +	page = compound_head(page);
>  	if (put_page_testzero(page)) {
>  		compound_page_dtor *dtor;
>  
> @@ -263,7 +263,23 @@ void release_pages(struct page **pages, 
>  	for (i = 0; i < nr; i++) {
>  		struct page *page = pages[i];
>  
> -		if (unlikely(PageCompound(page))) {
> +		/*
> +		 * There is a conflict here between handling a compound
> +		 * page as a single big page or a set of smaller pages.
> +		 *
> +		 * Direct I/O wants us to treat them separately. Variable
> +		 * Page Size support means we need to treat then as
> +		 * a single unit.
> +		 *
> +		 * So we compromise here. Tail pages are handled as a
> +		 * single page (for direct I/O) but head pages are
> +		 * handled as full pages (for Variable Page Size
> +		 * Support).
> +		 *
> +		 * FIXME: That breaks direct I/O for the head page.
> +		 */
> +		if (unlikely(PageTail(page))) {
> +			/* Must treat as a single page */
>  			if (zone) {
>  				spin_unlock_irq(&zone->lru_lock);
>  				zone = NULL;
> Index: linux-2.6.21-rc7/arch/ia64/mm/init.c
> ===================================================================
> --- linux-2.6.21-rc7.orig/arch/ia64/mm/init.c	2007-04-21 20:52:07.000000000 -0700
> +++ linux-2.6.21-rc7/arch/ia64/mm/init.c	2007-04-21 20:52:15.000000000 -0700
> @@ -121,7 +121,7 @@ lazy_mmu_prot_update (pte_t pte)
>  		return;				/* i-cache is already coherent with d-cache */
>  
>  	if (PageCompound(page)) {
> -		order = (unsigned long) (page[1].lru.prev);
> +		order = compound_order(page);
>  		flush_icache_range(addr, addr + (1UL << order << PAGE_SHIFT));
>  	}
>  	else
-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 01/16] Free up page->private for compound pages
  2007-04-24  2:12   ` Dave Hansen
@ 2007-04-24  2:23     ` Christoph Lameter
  0 siblings, 0 replies; 40+ messages in thread
From: Christoph Lameter @ 2007-04-24  2:23 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, William Lee Irwin III, Jens Axboe, David Chinner,
	Badari Pulavarty, Adam Litke, Avi Kivity, Mel Gorman

On Mon, 23 Apr 2007, Dave Hansen wrote:

> OK, so the end result is that we're freeing up page->private for the
> head page of compound pages, but not _all_ of them, right?  You might
> want to make that a bit clearer in the patch description.

Correct.
 
> Can we be more clever about this, and not have to eat yet another page
> flag?

Look at the recent compound changes in mm. That one does not eat a 
page flag.

> > +static inline int base_pages(struct page *page)
> > +{
> > + 	return 1 << compound_order(page);
> > +}
> 
> Perhaps base_pages_in_compound(), instead?  

I renamed it to compound_page() for V3... But base_pages_in_compound is a 
bit long.

> >  static void free_compound_page(struct page *page)
> >  {
> > -	__free_pages_ok(page, (unsigned long)page[1].lru.prev);
> > +	__free_pages_ok(page, compound_order(page));
> >  }
> 
> These substitutions are great, even outside of this patch set.  Nice.

They are already in mm.

> > +	for (i = 1; i < nr_pages; i++) {
> >  		struct page *p = page + i;
> >  
> > -		if (unlikely(!PageCompound(p) |
> > -				(page_private(p) != (unsigned long)page)))
> > +		if (unlikely(!PageCompound(p) | !PageTail(p) |
> > +				((struct page *)p->private != page)))
> 
> Should there be a compound_page_head() function to get rid of these
> open-coded references?

There is in mm. This one is a fixup patch to get the patch to work against 
upstream.

> I guess it doesn't matter, but it might be nice to turn those binary |'s
> into logical ||'s.

That would generate more branches. But them mm is different again.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 01/16] Free up page->private for compound pages
  2007-04-23  6:48 ` [RFC 01/16] Free up page->private for compound pages Christoph Lameter
  2007-04-24  2:12   ` Dave Hansen
@ 2007-04-25 10:55   ` Mel Gorman
  1 sibling, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2007-04-25 10:55 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, William Lee Irwin III, Jens Axboe, David Chinner,
	Badari Pulavarty, Adam Litke, Avi Kivity, Dave Hansen

On (22/04/07 23:48), Christoph Lameter didst pronounce:
> [PATCH] Free up page->private for compound pages
> 
> If we add a new flag so that we can distinguish between the
> first page and the tail pages then we can avoid to use page->private
> in the first page. page->private == page for the first page, so there
> is no real information in there.
> 
> Freeing up page->private makes the use of compound pages more transparent.
> They become more usable like real pages. Right now we have to be careful f.e.
> if we are going beyond PAGE_SIZE allocations in the slab on i386 because we
> can then no longer use the private field. This is one of the issues that
> cause us not to support debugging for page size slabs in SLAB.
> 
> Also if page->private is available then a compound page may be equipped
> with buffer heads. This may free up the way for filesystems to support
> larger blocks than page size.
> 
> Note that this patch is different from the one in mm. The one in mm
> uses PG_reclaim as a PG_tail. We cannot use PG_tail since pages can
> be reclaimed now. So use a separate page flag.
> 
> We allow compound page headers on pagevec. That will break
> Direct I/O because direct I/O needs pagevecs to handle the components
> but not the whole. Ideas for a solution welcome. Maybe we should
> modify the Direct I/O layer to not operate on the individual pages
> but on the compound page as a whole.
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> ---
>  arch/ia64/mm/init.c        |    2 +-
>  include/linux/mm.h         |   32 ++++++++++++++++++++++++++------
>  include/linux/page-flags.h |    6 ++++++
>  mm/internal.h              |    2 +-
>  mm/page_alloc.c            |   35 +++++++++++++++++++++++++----------
>  mm/slab.c                  |    6 ++----
>  mm/swap.c                  |   20 ++++++++++++++++++--
>  7 files changed, 79 insertions(+), 24 deletions(-)
> 
> Index: linux-2.6.21-rc7/include/linux/mm.h
> ===================================================================
> --- linux-2.6.21-rc7.orig/include/linux/mm.h	2007-04-21 20:52:07.000000000 -0700
> +++ linux-2.6.21-rc7/include/linux/mm.h	2007-04-21 20:58:32.000000000 -0700
> @@ -263,21 +263,24 @@ static inline int put_page_testzero(stru
>   */
>  static inline int get_page_unless_zero(struct page *page)
>  {
> -	VM_BUG_ON(PageCompound(page));
>  	return atomic_inc_not_zero(&page->_count);
>  }
>  
> +static inline struct page *compound_head(struct page *page)
> +{
> +	if (unlikely(PageTail(page)))
> +		return (struct page *)page->private;
> +	return page;
> +}
> +
>  static inline int page_count(struct page *page)
>  {
> -	if (unlikely(PageCompound(page)))
> -		page = (struct page *)page_private(page);
> -	return atomic_read(&page->_count);
> +	return atomic_read(&compound_head(page)->_count);
>  }
>  
>  static inline void get_page(struct page *page)
>  {
> -	if (unlikely(PageCompound(page)))
> -		page = (struct page *)page_private(page);
> +	page = compound_head(page);
>  	VM_BUG_ON(atomic_read(&page->_count) == 0);
>  	atomic_inc(&page->_count);
>  }
> @@ -314,6 +317,23 @@ static inline compound_page_dtor *get_co
>  	return (compound_page_dtor *)page[1].lru.next;
>  }
>  
> +static inline int compound_order(struct page *page)
> +{
> +	if (!PageCompound(page) || PageTail(page))
> +		return 0;
> +	return (unsigned long)page[1].lru.prev;
> +}

If it is a PageTail(page), should it not be something like

if (PageTail(page))
	return (unsigned long)compound_head(page)[1].lru.prev;

(probably missing something stupid)

> +
> +static inline void set_compound_order(struct page *page, unsigned long order)
> +{
> +	page[1].lru.prev = (void *)order;
> +}
> +
> +static inline int base_pages(struct page *page)
> +{
> + 	return 1 << compound_order(page);
> +}
> +
>  /*
>   * Multiple processes may "see" the same page. E.g. for untouched
>   * mappings of /dev/null, all processes see the same page full of
> Index: linux-2.6.21-rc7/include/linux/page-flags.h
> ===================================================================
> --- linux-2.6.21-rc7.orig/include/linux/page-flags.h	2007-04-21 20:52:07.000000000 -0700
> +++ linux-2.6.21-rc7/include/linux/page-flags.h	2007-04-21 20:52:15.000000000 -0700
> @@ -91,6 +91,8 @@
>  #define PG_nosave_free		18	/* Used for system suspend/resume */
>  #define PG_buddy		19	/* Page is free, on buddy lists */
>  
> +#define PG_tail			20	/* Page is tail of a compound page */
> +
>  /* PG_owner_priv_1 users should have descriptive aliases */
>  #define PG_checked		PG_owner_priv_1 /* Used by some filesystems */
>  
> @@ -241,6 +243,10 @@ static inline void SetPageUptodate(struc
>  #define __SetPageCompound(page)	__set_bit(PG_compound, &(page)->flags)
>  #define __ClearPageCompound(page) __clear_bit(PG_compound, &(page)->flags)
>  
> +#define PageTail(page)	test_bit(PG_tail, &(page)->flags)
> +#define __SetPageTail(page)	__set_bit(PG_tail, &(page)->flags)
> +#define __ClearPageTail(page)	__clear_bit(PG_tail, &(page)->flags)
> +
>  #ifdef CONFIG_SWAP
>  #define PageSwapCache(page)	test_bit(PG_swapcache, &(page)->flags)
>  #define SetPageSwapCache(page)	set_bit(PG_swapcache, &(page)->flags)
> Index: linux-2.6.21-rc7/mm/internal.h
> ===================================================================
> --- linux-2.6.21-rc7.orig/mm/internal.h	2007-04-21 20:52:07.000000000 -0700
> +++ linux-2.6.21-rc7/mm/internal.h	2007-04-21 20:52:15.000000000 -0700
> @@ -24,7 +24,7 @@ static inline void set_page_count(struct
>   */
>  static inline void set_page_refcounted(struct page *page)
>  {
> -	VM_BUG_ON(PageCompound(page) && page_private(page) != (unsigned long)page);
> +	VM_BUG_ON(PageTail(page));
>  	VM_BUG_ON(atomic_read(&page->_count));
>  	set_page_count(page, 1);
>  }
> Index: linux-2.6.21-rc7/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.21-rc7.orig/mm/page_alloc.c	2007-04-21 20:52:07.000000000 -0700
> +++ linux-2.6.21-rc7/mm/page_alloc.c	2007-04-21 20:58:32.000000000 -0700
> @@ -227,7 +227,7 @@ static void bad_page(struct page *page)
>  
>  static void free_compound_page(struct page *page)
>  {
> -	__free_pages_ok(page, (unsigned long)page[1].lru.prev);
> +	__free_pages_ok(page, compound_order(page));
>  }
>  
>  static void prep_compound_page(struct page *page, unsigned long order)
> @@ -236,12 +236,14 @@ static void prep_compound_page(struct pa
>  	int nr_pages = 1 << order;
>  
>  	set_compound_page_dtor(page, free_compound_page);
> -	page[1].lru.prev = (void *)order;
> -	for (i = 0; i < nr_pages; i++) {
> +	set_compound_order(page, order);
> +	__SetPageCompound(page);
> +	for (i = 1; i < nr_pages; i++) {
>  		struct page *p = page + i;
>  
> +		__SetPageTail(p);
>  		__SetPageCompound(p);
> -		set_page_private(p, (unsigned long)page);
> +		p->private = (unsigned long)page;
>  	}
>  }
>  
> @@ -250,15 +252,19 @@ static void destroy_compound_page(struct
>  	int i;
>  	int nr_pages = 1 << order;
>  
> -	if (unlikely((unsigned long)page[1].lru.prev != order))
> +	if (unlikely(compound_order(page) != order))
>  		bad_page(page);
>  
> -	for (i = 0; i < nr_pages; i++) {
> +	if (unlikely(!PageCompound(page)))
> +			bad_page(page);
> +	__ClearPageCompound(page);
> +	for (i = 1; i < nr_pages; i++) {
>  		struct page *p = page + i;
>  
> -		if (unlikely(!PageCompound(p) |
> -				(page_private(p) != (unsigned long)page)))
> +		if (unlikely(!PageCompound(p) | !PageTail(p) |
> +				((struct page *)p->private != page)))
>  			bad_page(page);
> +		__ClearPageTail(p);
>  		__ClearPageCompound(p);
>  	}
>  }
> @@ -1438,8 +1444,17 @@ void __pagevec_free(struct pagevec *pvec
>  {
>  	int i = pagevec_count(pvec);
>  
> -	while (--i >= 0)
> -		free_hot_cold_page(pvec->pages[i], pvec->cold);
> +	while (--i >= 0) {
> +		struct page *page = pvec->pages[i];
> +
> +		if (PageCompound(page)) {
> +			compound_page_dtor *dtor;
> +
> +			dtor = get_compound_page_dtor(page);
> +			(*dtor)(page);
> +		} else
> +			free_hot_cold_page(page, pvec->cold);
> +	}
>  }
>  
>  fastcall void __free_pages(struct page *page, unsigned int order)
> Index: linux-2.6.21-rc7/mm/slab.c
> ===================================================================
> --- linux-2.6.21-rc7.orig/mm/slab.c	2007-04-21 20:52:07.000000000 -0700
> +++ linux-2.6.21-rc7/mm/slab.c	2007-04-21 20:52:15.000000000 -0700
> @@ -592,8 +592,7 @@ static inline void page_set_cache(struct
>  
>  static inline struct kmem_cache *page_get_cache(struct page *page)
>  {
> -	if (unlikely(PageCompound(page)))
> -		page = (struct page *)page_private(page);
> +	page = compound_head(page);
>  	BUG_ON(!PageSlab(page));
>  	return (struct kmem_cache *)page->lru.next;
>  }
> @@ -605,8 +604,7 @@ static inline void page_set_slab(struct 
>  
>  static inline struct slab *page_get_slab(struct page *page)
>  {
> -	if (unlikely(PageCompound(page)))
> -		page = (struct page *)page_private(page);
> +	page = compound_head(page);
>  	BUG_ON(!PageSlab(page));
>  	return (struct slab *)page->lru.prev;
>  }
> Index: linux-2.6.21-rc7/mm/swap.c
> ===================================================================
> --- linux-2.6.21-rc7.orig/mm/swap.c	2007-04-21 20:52:07.000000000 -0700
> +++ linux-2.6.21-rc7/mm/swap.c	2007-04-21 21:02:59.000000000 -0700
> @@ -55,7 +55,7 @@ static void fastcall __page_cache_releas
>  
>  static void put_compound_page(struct page *page)
>  {
> -	page = (struct page *)page_private(page);
> +	page = compound_head(page);
>  	if (put_page_testzero(page)) {
>  		compound_page_dtor *dtor;
>  
> @@ -263,7 +263,23 @@ void release_pages(struct page **pages, 
>  	for (i = 0; i < nr; i++) {
>  		struct page *page = pages[i];
>  
> -		if (unlikely(PageCompound(page))) {
> +		/*
> +		 * There is a conflict here between handling a compound
> +		 * page as a single big page or a set of smaller pages.
> +		 *
> +		 * Direct I/O wants us to treat them separately. Variable
> +		 * Page Size support means we need to treat then as
> +		 * a single unit.
> +		 *
> +		 * So we compromise here. Tail pages are handled as a
> +		 * single page (for direct I/O) but head pages are
> +		 * handled as full pages (for Variable Page Size
> +		 * Support).
> +		 *
> +		 * FIXME: That breaks direct I/O for the head page.
> +		 */
> +		if (unlikely(PageTail(page))) {
> +			/* Must treat as a single page */
>  			if (zone) {
>  				spin_unlock_irq(&zone->lru_lock);
>  				zone = NULL;
> Index: linux-2.6.21-rc7/arch/ia64/mm/init.c
> ===================================================================
> --- linux-2.6.21-rc7.orig/arch/ia64/mm/init.c	2007-04-21 20:52:07.000000000 -0700
> +++ linux-2.6.21-rc7/arch/ia64/mm/init.c	2007-04-21 20:52:15.000000000 -0700
> @@ -121,7 +121,7 @@ lazy_mmu_prot_update (pte_t pte)
>  		return;				/* i-cache is already coherent with d-cache */
>  
>  	if (PageCompound(page)) {
> -		order = (unsigned long) (page[1].lru.prev);
> +		order = compound_order(page);
>  		flush_icache_range(addr, addr + (1UL << order << PAGE_SHIFT));
>  	}
>  	else

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 02/16] vmstat.c: Support accounting for compound pages
  2007-04-23  6:48 ` [RFC 02/16] vmstat.c: Support accounting " Christoph Lameter
@ 2007-04-25 10:59   ` Mel Gorman
  2007-04-25 15:43     ` Christoph Lameter
  0 siblings, 1 reply; 40+ messages in thread
From: Mel Gorman @ 2007-04-25 10:59 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, William Lee Irwin III, Badari Pulavarty, David Chinner,
	Jens Axboe, Adam Litke, Dave Hansen, Avi Kivity

On (22/04/07 23:48), Christoph Lameter didst pronounce:
> vmstat.c: Support accounting for compound pages
> 
> Compound pages must increment the counters in terms of base pages.
> If we detect a compound page then add the number of base pages that
> a compound page has to the counter.
> 
> This will avoid numerous changes in the VM to fix up page accounting
> as we add more support for  compound pages.
> 
> Also fix up the accounting for active / inactive pages.
> 

Should this patch be split in two then? The active/inactive looks like
it's worth doing anyway

> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> ---
>  include/linux/mm_inline.h |   12 ++++++------
>  mm/vmstat.c               |    8 +++-----
>  2 files changed, 9 insertions(+), 11 deletions(-)
> 
> Index: linux-2.6.21-rc7/mm/vmstat.c
> ===================================================================
> --- linux-2.6.21-rc7.orig/mm/vmstat.c	2007-04-21 23:35:49.000000000 -0700
> +++ linux-2.6.21-rc7/mm/vmstat.c	2007-04-21 23:35:59.000000000 -0700
> @@ -223,7 +223,7 @@ void __inc_zone_state(struct zone *zone,
>  
>  void __inc_zone_page_state(struct page *page, enum zone_stat_item item)
>  {
> -	__inc_zone_state(page_zone(page), item);
> +	__mod_zone_page_state(page_zone(page), item, base_pages(page));
>  }
>  EXPORT_SYMBOL(__inc_zone_page_state);
>  
> @@ -244,7 +244,7 @@ void __dec_zone_state(struct zone *zone,
>  
>  void __dec_zone_page_state(struct page *page, enum zone_stat_item item)
>  {
> -	__dec_zone_state(page_zone(page), item);
> +	__mod_zone_page_state(page_zone(page), item, -base_pages(page));
>  }
>  EXPORT_SYMBOL(__dec_zone_page_state);
>  
> @@ -260,11 +260,9 @@ void inc_zone_state(struct zone *zone, e
>  void inc_zone_page_state(struct page *page, enum zone_stat_item item)
>  {
>  	unsigned long flags;
> -	struct zone *zone;
>  
> -	zone = page_zone(page);
>  	local_irq_save(flags);
> -	__inc_zone_state(zone, item);
> +	__inc_zone_page_state(page, item);
>  	local_irq_restore(flags);
>  }
>  EXPORT_SYMBOL(inc_zone_page_state);

Everything after here looks like a standalone cleanup.

> Index: linux-2.6.21-rc7/include/linux/mm_inline.h
> ===================================================================
> --- linux-2.6.21-rc7.orig/include/linux/mm_inline.h	2007-04-22 00:20:15.000000000 -0700
> +++ linux-2.6.21-rc7/include/linux/mm_inline.h	2007-04-22 00:21:12.000000000 -0700
> @@ -2,28 +2,28 @@ static inline void
>  add_page_to_active_list(struct zone *zone, struct page *page)
>  {
>  	list_add(&page->lru, &zone->active_list);
> -	__inc_zone_state(zone, NR_ACTIVE);
> +	__inc_zone_page_state(page, NR_ACTIVE);
>  }
>  
>  static inline void
>  add_page_to_inactive_list(struct zone *zone, struct page *page)
>  {
>  	list_add(&page->lru, &zone->inactive_list);
> -	__inc_zone_state(zone, NR_INACTIVE);
> +	__inc_zone_page_state(page, NR_INACTIVE);
>  }
>  
>  static inline void
>  del_page_from_active_list(struct zone *zone, struct page *page)
>  {
>  	list_del(&page->lru);
> -	__dec_zone_state(zone, NR_ACTIVE);
> +	__dec_zone_page_state(page, NR_ACTIVE);
>  }
>  
>  static inline void
>  del_page_from_inactive_list(struct zone *zone, struct page *page)
>  {
>  	list_del(&page->lru);
> -	__dec_zone_state(zone, NR_INACTIVE);
> +	__dec_zone_page_state(page, NR_INACTIVE);
>  }
>  
>  static inline void
> @@ -32,9 +32,9 @@ del_page_from_lru(struct zone *zone, str
>  	list_del(&page->lru);
>  	if (PageActive(page)) {
>  		__ClearPageActive(page);
> -		__dec_zone_state(zone, NR_ACTIVE);
> +		__dec_zone_page_state(page, NR_ACTIVE);
>  	} else {
> -		__dec_zone_state(zone, NR_INACTIVE);
> +		__dec_zone_page_state(page, NR_INACTIVE);
>  	}
>  }
>  

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 03/16] Variable Order Page Cache: Add order field in mapping
  2007-04-23  6:49 ` [RFC 03/16] Variable Order Page Cache: Add order field in mapping Christoph Lameter
@ 2007-04-25 11:05   ` Mel Gorman
  0 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2007-04-25 11:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, William Lee Irwin III, Jens Axboe, David Chinner,
	Badari Pulavarty, Adam Litke, Avi Kivity, Dave Hansen

On (22/04/07 23:49), Christoph Lameter didst pronounce:
> Variable Order Page Cache: Add order field in mapping
> 
> Add an "order" field in the address space structure that
> specifies the page order of pages in an address space.
> 
> Set the field to zero by default so that filesystems not prepared to
> deal with higher pages can be left as is.
> 
> Putting page order in the address space structure means that the order of the
> pages in the page cache can be varied per file that a filesystem creates.
> This means we can keep small 4k pages for small files. Larger files can
> be configured by the file system to use a higher order.


It may be desirable later to record when a filesystem does that so that bugs
related to compound-page-in-page-cache stand out a bit more.

> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> ---
>  fs/inode.c         |    1 +
>  include/linux/fs.h |    1 +
>  2 files changed, 2 insertions(+)
> 
> Index: linux-2.6.21-rc7/fs/inode.c
> ===================================================================
> --- linux-2.6.21-rc7.orig/fs/inode.c	2007-04-18 21:21:56.000000000 -0700
> +++ linux-2.6.21-rc7/fs/inode.c	2007-04-18 21:26:31.000000000 -0700
> @@ -145,6 +145,7 @@ static struct inode *alloc_inode(struct 
>  		mapping->a_ops = &empty_aops;
>   		mapping->host = inode;
>  		mapping->flags = 0;
> +		mapping->order = 0;
>  		mapping_set_gfp_mask(mapping, GFP_HIGHUSER);

Just as a heads-up, grouping pages by mobility changes the
mapping_set_gfp_mask() flag so you may run into merge conflicts there. It
might make life easier if you set the order earlier so that it merges with
fuzz instead of going blamo. It's functionally identical.

>  		mapping->assoc_mapping = NULL;
>  		mapping->backing_dev_info = &default_backing_dev_info;
> Index: linux-2.6.21-rc7/include/linux/fs.h
> ===================================================================
> --- linux-2.6.21-rc7.orig/include/linux/fs.h	2007-04-18 21:21:56.000000000 -0700
> +++ linux-2.6.21-rc7/include/linux/fs.h	2007-04-18 21:26:31.000000000 -0700
> @@ -435,6 +435,7 @@ struct address_space {
>  	struct inode		*host;		/* owner: inode, block_device */
>  	struct radix_tree_root	page_tree;	/* radix tree of all pages */
>  	rwlock_t		tree_lock;	/* and rwlock protecting it */
> +	unsigned int		order;		/* Page order in this space */
>  	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
>  	struct prio_tree_root	i_mmap;		/* tree of private and shared mappings */
>  	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 05/16] Variable Order Page Cache: Add functions to establish sizes
  2007-04-23  6:49 ` [RFC 05/16] Variable Order Page Cache: Add functions to establish sizes Christoph Lameter
@ 2007-04-25 11:20   ` Mel Gorman
  2007-04-25 15:54     ` Christoph Lameter
  0 siblings, 1 reply; 40+ messages in thread
From: Mel Gorman @ 2007-04-25 11:20 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, William Lee Irwin III, Jens Axboe, David Chinner,
	Badari Pulavarty, Adam Litke, Avi Kivity, Dave Hansen

On (22/04/07 23:49), Christoph Lameter didst pronounce:
> Variable Order Page Cache: Add functions to establish sizes
> 
> We use the macros PAGE_CACHE_SIZE PAGE_CACHE_SHIFT PAGE_CACHE_MASK
> and PAGE_CACHE_ALIGN in various places in the kernel. These are now
> the base page size but we do not have a means to calculating these
> values for higher order pages.
> 
> Provide these functions. An address_space pointer must be passed
> to them. Also add a set of extended functions that will be used
> to consolidate the hand crafted shifts and adds in use right
> now for the page cache.
> 
> New function			Related base page constant
> ---------------------------------------------------
> page_cache_shift(a)		PAGE_CACHE_SHIFT
> page_cache_size(a)		PAGE_CACHE_SIZE
> page_cache_mask(a)		PAGE_CACHE_MASK
> page_cache_index(a, pos)	Calculate page number from position
> page_cache_next(addr, pos)	Page number of next page
> page_cache_offset(a, pos)	Calculate offset into a page
> page_cache_pos(a, index, offset)
> 				Form position based on page number
> 				and an offset.

These all need comments in the source, particularly page_cache_index() so
that it is clear that the index is "number of compound pages", not number
of base pages. With the name as-is, it could be either.  page_cache_offset()
requires similar mental gymnastics to understand without some sort of comment.

The comments will help break people away from page == base page mental
models.

> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> ---
>  include/linux/pagemap.h |   42 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 42 insertions(+)
> 
> Index: linux-2.6.21-rc7/include/linux/pagemap.h
> ===================================================================
> --- linux-2.6.21-rc7.orig/include/linux/pagemap.h	2007-04-22 17:30:50.000000000 -0700
> +++ linux-2.6.21-rc7/include/linux/pagemap.h	2007-04-22 19:44:12.000000000 -0700
> @@ -62,6 +62,48 @@ static inline void set_mapping_order(str
>  #define PAGE_CACHE_MASK		PAGE_MASK
>  #define PAGE_CACHE_ALIGN(addr)	(((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK)
>  
> +static inline int page_cache_shift(struct address_space *a)
> +{
> +	return a->order + PAGE_SHIFT;
> +}
> +
> +static inline unsigned int page_cache_size(struct address_space *a)
> +{
> +	return PAGE_SIZE << a->order;
> +}
> +
> +static inline loff_t page_cache_mask(struct address_space *a)
> +{
> +	return (loff_t)PAGE_MASK << a->order;
> +}
> +
> +static inline unsigned int page_cache_offset(struct address_space *a,
> +		loff_t pos)
> +{
> +	return pos & ~(PAGE_MASK << a->order);
> +}
> +
> +static inline pgoff_t page_cache_index(struct address_space *a,
> +		loff_t pos)
> +{
> +	return pos >> page_cache_shift(a);
> +}

Like that needs peering at without a comment.

> +
> +/*
> + * Index of the page starting on or after the given position.
> + */
> +static inline pgoff_t page_cache_next(struct address_space *a,
> +		loff_t pos)
> +{
> +	return page_cache_index(a, pos + page_cache_size(a) - 1);
> +}
> +

Would help if "Index of the page" read as "Index of the compound page" with
an additional note saying that the compound page size will be a base page
in the majority of cases. Otherwise, someone unfamiliar with this idea will
wonder what's wrong with page++.

> +static inline loff_t page_cache_pos(struct address_space *a,
> +		pgoff_t index, unsigned long offset)
> +{
> +	return ((loff_t)index << page_cache_shift(a)) + offset;
> +}
> +
>  #define page_cache_get(page)		get_page(page)
>  #define page_cache_release(page)	put_page(page)
>  void release_pages(struct page **pages, int nr, int cold);

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 06/16] Variable Page Cache: Add VM_BUG_ONs to check for correct page order
  2007-04-23  6:49 ` [RFC 06/16] Variable Page Cache: Add VM_BUG_ONs to check for correct page order Christoph Lameter
@ 2007-04-25 11:22   ` Mel Gorman
  0 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2007-04-25 11:22 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, William Lee Irwin III, Badari Pulavarty, David Chinner,
	Jens Axboe, Adam Litke, Dave Hansen, Avi Kivity

On (22/04/07 23:49), Christoph Lameter didst pronounce:
> Variable Page Cache: Add VM_BUG_ONs to check for correct page order
> 
> Before we start changing the page order we better get some debugging
> in there that trips us up whenever a wrong order page shows up in a
> mapping. This will be helpful for converting new filesystems to
> utilize higher orders.
> 

Oops, ignore earlier comments about flagging bugs related to compound
pages differently. This patch looks like it'll catch many of the
mistakes

> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> ---
>  mm/filemap.c |   19 ++++++++++++++++---
>  1 file changed, 16 insertions(+), 3 deletions(-)
> 
> Index: linux-2.6.21-rc7/mm/filemap.c
> ===================================================================
> --- linux-2.6.21-rc7.orig/mm/filemap.c	2007-04-22 21:54:00.000000000 -0700
> +++ linux-2.6.21-rc7/mm/filemap.c	2007-04-22 21:59:15.000000000 -0700
> @@ -127,6 +127,7 @@ void remove_from_page_cache(struct page 
>  	struct address_space *mapping = page->mapping;
>  
>  	BUG_ON(!PageLocked(page));
> +	VM_BUG_ON(mapping->order != compound_order(page));
>  
>  	write_lock_irq(&mapping->tree_lock);
>  	__remove_from_page_cache(page);
> @@ -268,6 +269,7 @@ int wait_on_page_writeback_range(struct 
>  			if (page->index > end)
>  				continue;
>  
> +			VM_BUG_ON(mapping->order != compound_order(page));
>  			wait_on_page_writeback(page);
>  			if (PageError(page))
>  				ret = -EIO;
> @@ -439,6 +441,7 @@ int add_to_page_cache(struct page *page,
>  {
>  	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
>  
> +	VM_BUG_ON(mapping->order != compound_order(page));
>  	if (error == 0) {
>  		write_lock_irq(&mapping->tree_lock);
>  		error = radix_tree_insert(&mapping->page_tree, offset, page);
> @@ -598,8 +601,10 @@ struct page * find_get_page(struct addre
>  
>  	read_lock_irq(&mapping->tree_lock);
>  	page = radix_tree_lookup(&mapping->page_tree, offset);
> -	if (page)
> +	if (page) {
> +		VM_BUG_ON(mapping->order != compound_order(page));
>  		page_cache_get(page);
> +	}
>  	read_unlock_irq(&mapping->tree_lock);
>  	return page;
>  }
> @@ -624,6 +629,7 @@ struct page *find_lock_page(struct addre
>  repeat:
>  	page = radix_tree_lookup(&mapping->page_tree, offset);
>  	if (page) {
> +		VM_BUG_ON(mapping->order != compound_order(page));
>  		page_cache_get(page);
>  		if (TestSetPageLocked(page)) {
>  			read_unlock_irq(&mapping->tree_lock);
> @@ -683,6 +689,7 @@ repeat:
>  		} else if (err == -EEXIST)
>  			goto repeat;
>  	}
> +	VM_BUG_ON(mapping->order != compound_order(page));
>  	if (cached_page)
>  		page_cache_release(cached_page);
>  	return page;
> @@ -714,8 +721,10 @@ unsigned find_get_pages(struct address_s
>  	read_lock_irq(&mapping->tree_lock);
>  	ret = radix_tree_gang_lookup(&mapping->page_tree,
>  				(void **)pages, start, nr_pages);
> -	for (i = 0; i < ret; i++)
> +	for (i = 0; i < ret; i++) {
> +		VM_BUG_ON(mapping->order != compound_order(pages[i]));
>  		page_cache_get(pages[i]);
> +	}
>  	read_unlock_irq(&mapping->tree_lock);
>  	return ret;
>  }
> @@ -745,6 +754,7 @@ unsigned find_get_pages_contig(struct ad
>  		if (pages[i]->mapping == NULL || pages[i]->index != index)
>  			break;
>  
> +		VM_BUG_ON(mapping->order != compound_order(pages[i]));
>  		page_cache_get(pages[i]);
>  		index++;
>  	}
> @@ -772,8 +782,10 @@ unsigned find_get_pages_tag(struct addre
>  	read_lock_irq(&mapping->tree_lock);
>  	ret = radix_tree_gang_lookup_tag(&mapping->page_tree,
>  				(void **)pages, *index, nr_pages, tag);
> -	for (i = 0; i < ret; i++)
> +	for (i = 0; i < ret; i++) {
> +		VM_BUG_ON(mapping->order != compound_order(pages[i]));
>  		page_cache_get(pages[i]);
> +	}
>  	if (ret)
>  		*index = pages[ret - 1]->index + 1;
>  	read_unlock_irq(&mapping->tree_lock);
> @@ -2454,6 +2466,7 @@ int try_to_release_page(struct page *pag
>  	struct address_space * const mapping = page->mapping;
>  
>  	BUG_ON(!PageLocked(page));
> +	VM_BUG_ON(mapping->order != compound_order(page));
>  	if (PageWriteback(page))
>  		return 0;
>  

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 10/16] Variable Order Page Cache: Readahead fixups
  2007-04-23  6:49 ` [RFC 10/16] Variable Order Page Cache: Readahead fixups Christoph Lameter
@ 2007-04-25 11:36   ` Mel Gorman
  2007-04-25 15:56     ` Christoph Lameter
  0 siblings, 1 reply; 40+ messages in thread
From: Mel Gorman @ 2007-04-25 11:36 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, William Lee Irwin III, Badari Pulavarty, David Chinner,
	Jens Axboe, Adam Litke, Dave Hansen, Avi Kivity

On (22/04/07 23:49), Christoph Lameter didst pronounce:
> Variable Order Page Cache: Readahead fixups
> 
> Readahead is now dependent on the page size. For larger page sizes
> we want less readahead.
> 
> Add a parameter to max_sane_readahead specifying the page order
> and update the code in mm/readahead.c to be aware of variant
> page sizes.
> 
> Mark the 2M readahead constant as a potential future problem.
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> ---
>  include/linux/mm.h |    2 +-
>  mm/fadvise.c       |    5 +++--
>  mm/filemap.c       |    5 +++--
>  mm/madvise.c       |    4 +++-
>  mm/readahead.c     |   20 +++++++++++++-------
>  5 files changed, 23 insertions(+), 13 deletions(-)
> 
> Index: linux-2.6.21-rc7/include/linux/mm.h
> ===================================================================
> --- linux-2.6.21-rc7.orig/include/linux/mm.h	2007-04-22 21:48:22.000000000 -0700
> +++ linux-2.6.21-rc7/include/linux/mm.h	2007-04-22 22:04:44.000000000 -0700
> @@ -1104,7 +1104,7 @@ unsigned long page_cache_readahead(struc
>  			  unsigned long size);
>  void handle_ra_miss(struct address_space *mapping, 
>  		    struct file_ra_state *ra, pgoff_t offset);
> -unsigned long max_sane_readahead(unsigned long nr);
> +unsigned long max_sane_readahead(unsigned long nr, int order);
>  
>  /* Do stack extension */
>  extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
> Index: linux-2.6.21-rc7/mm/fadvise.c
> ===================================================================
> --- linux-2.6.21-rc7.orig/mm/fadvise.c	2007-04-22 21:47:41.000000000 -0700
> +++ linux-2.6.21-rc7/mm/fadvise.c	2007-04-22 22:04:44.000000000 -0700
> @@ -86,10 +86,11 @@ asmlinkage long sys_fadvise64_64(int fd,
>  		nrpages = end_index - start_index + 1;
>  		if (!nrpages)
>  			nrpages = ~0UL;
> -		
> +

Whitespace mangling. Your update is right, but maybe not the patch for
it.

>  		ret = force_page_cache_readahead(mapping, file,
>  				start_index,
> -				max_sane_readahead(nrpages));
> +				max_sane_readahead(nrpages,
> +					mapping->order));
>  		if (ret > 0)
>  			ret = 0;
>  		break;
> Index: linux-2.6.21-rc7/mm/filemap.c
> ===================================================================
> --- linux-2.6.21-rc7.orig/mm/filemap.c	2007-04-22 22:03:09.000000000 -0700
> +++ linux-2.6.21-rc7/mm/filemap.c	2007-04-22 22:04:44.000000000 -0700
> @@ -1256,7 +1256,7 @@ do_readahead(struct address_space *mappi
>  		return -EINVAL;
>  
>  	force_page_cache_readahead(mapping, filp, index,
> -					max_sane_readahead(nr));
> +				max_sane_readahead(nr, mapping->order));
>  	return 0;
>  }
>  
> @@ -1391,7 +1391,8 @@ retry_find:
>  			count_vm_event(PGMAJFAULT);
>  		}
>  		did_readaround = 1;
> -		ra_pages = max_sane_readahead(file->f_ra.ra_pages);
> +		ra_pages = max_sane_readahead(file->f_ra.ra_pages,
> +							mapping->order);
>  		if (ra_pages) {
>  			pgoff_t start = 0;
>  
> Index: linux-2.6.21-rc7/mm/madvise.c
> ===================================================================
> --- linux-2.6.21-rc7.orig/mm/madvise.c	2007-04-22 21:47:41.000000000 -0700
> +++ linux-2.6.21-rc7/mm/madvise.c	2007-04-22 22:04:44.000000000 -0700
> @@ -105,7 +105,9 @@ static long madvise_willneed(struct vm_a
>  	end = ((end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
>  
>  	force_page_cache_readahead(file->f_mapping,
> -			file, start, max_sane_readahead(end - start));
> +			file, start,
> +			max_sane_readahead(end - start,
> +				file->f_mapping->order));
>  	return 0;
>  }
>  
> Index: linux-2.6.21-rc7/mm/readahead.c
> ===================================================================
> --- linux-2.6.21-rc7.orig/mm/readahead.c	2007-04-22 21:47:41.000000000 -0700
> +++ linux-2.6.21-rc7/mm/readahead.c	2007-04-22 22:06:47.000000000 -0700
> @@ -152,7 +152,7 @@ int read_cache_pages(struct address_spac
>  			put_pages_list(pages);
>  			break;
>  		}
> -		task_io_account_read(PAGE_CACHE_SIZE);
> +		task_io_account_read(page_cache_size(mapping));
>  	}
>  	pagevec_lru_add(&lru_pvec);
>  	return ret;
> @@ -276,7 +276,7 @@ __do_page_cache_readahead(struct address
>  	if (isize == 0)
>  		goto out;
>  
> - 	end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
> + 	end_index = page_cache_index(mapping, isize - 1);
>  
>  	/*
>  	 * Preallocate as many pages as we will need.
> @@ -330,7 +330,11 @@ int force_page_cache_readahead(struct ad
>  	while (nr_to_read) {
>  		int err;
>  
> -		unsigned long this_chunk = (2 * 1024 * 1024) / PAGE_CACHE_SIZE;
> +		/*
> +		 * FIXME: Note the 2M constant here that may prove to
> +		 * be a problem if page sizes become bigger than one megabyte.
> +		 */
> +		unsigned long this_chunk = page_cache_index(mapping, 2 * 1024 * 1024);
>

Should readahead just be disabled when the compound page size is as
large or larger than what readahead normally reads?

>  		if (this_chunk > nr_to_read)
>  			this_chunk = nr_to_read;
> @@ -570,11 +574,13 @@ void handle_ra_miss(struct address_space
>  }
>  
>  /*
> - * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
> + * Given a desired number of page order readahead pages, return a
>   * sensible upper limit.
>   */
> -unsigned long max_sane_readahead(unsigned long nr)
> +unsigned long max_sane_readahead(unsigned long nr, int order)
>  {
> -	return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE)
> -		+ node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
> +	unsigned long base_pages = node_page_state(numa_node_id(), NR_INACTIVE)
> +			+ node_page_state(numa_node_id(), NR_FREE_PAGES);
> +
> +	return min(nr, (base_pages / 2) >> order);
>  }

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 11/16] Variable Page Cache Size: Fix up reclaim counters
  2007-04-23  6:49 ` [RFC 11/16] Variable Page Cache Size: Fix up reclaim counters Christoph Lameter
@ 2007-04-25 13:08   ` Mel Gorman
  0 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2007-04-25 13:08 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, William Lee Irwin III, Jens Axboe, David Chinner,
	Badari Pulavarty, Adam Litke, Avi Kivity, Dave Hansen

On (22/04/07 23:49), Christoph Lameter didst pronounce:
> Variable Page Cache Size: Fix up reclaim counters
> 
> We can now reclaim larger pages. Adjust the VM counters
> to deal with it.
> 
> Note that this does currently not make things work.
> For some reason we keep loosing pages off the active lists
> and reclaim stalls at some point attempting to remove
> active pages from an empty active list.
> It seems that the removal from the active lists happens
> outside of reclaim ?!?
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> ---
>  mm/vmscan.c |   15 ++++++++-------
>  1 file changed, 8 insertions(+), 7 deletions(-)
> 
> Index: linux-2.6.21-rc7/mm/vmscan.c
> ===================================================================
> --- linux-2.6.21-rc7.orig/mm/vmscan.c	2007-04-22 06:50:03.000000000 -0700
> +++ linux-2.6.21-rc7/mm/vmscan.c	2007-04-22 17:19:35.000000000 -0700
> @@ -471,14 +471,14 @@ static unsigned long shrink_page_list(st
>  
>  		VM_BUG_ON(PageActive(page));
>  
> -		sc->nr_scanned++;
> +		sc->nr_scanned += base_pages(page);
>  
>  		if (!sc->may_swap && page_mapped(page))
>  			goto keep_locked;
>  
>  		/* Double the slab pressure for mapped and swapcache pages */
>  		if (page_mapped(page) || PageSwapCache(page))
> -			sc->nr_scanned++;
> +			sc->nr_scanned += base_pages(page);
>  
>  		if (PageWriteback(page))
>  			goto keep_locked;
> @@ -581,7 +581,7 @@ static unsigned long shrink_page_list(st
>  
>  free_it:
>  		unlock_page(page);
> -		nr_reclaimed++;
> +		nr_reclaimed += base_pages(page);
>  		if (!pagevec_add(&freed_pvec, page))
>  			__pagevec_release_nonlru(&freed_pvec);
>  		continue;
> @@ -627,7 +627,7 @@ static unsigned long isolate_lru_pages(u
>  	struct page *page;
>  	unsigned long scan;
>  
> -	for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
> +	for (scan = 0; scan < nr_to_scan && !list_empty(src); ) {
>  		struct list_head *target;
>  		page = lru_to_page(src);
>  		prefetchw_prev_lru_page(page, src, flags);
> @@ -644,10 +644,11 @@ static unsigned long isolate_lru_pages(u
>  			 */
>  			ClearPageLRU(page);
>  			target = dst;
> -			nr_taken++;
> +			nr_taken += base_pages(page);
>  		} /* else it is being freed elsewhere */
>  
>  		list_add(&page->lru, target);
> +		scan += base_pages(page);
>  	}

Be careful here when lumpy reclaim is also involved. By moving scan++
out of the loop, the scan counter increment may be missed because a
continue is involved. Just watch out for it.

I am of two minds on whether we should be counting base pages or not but
it's probably best from an IO perspective....

>  
>  	*scanned = scan;
> @@ -856,7 +857,7 @@ force_reclaim_mapped:
>  		ClearPageActive(page);
>  
>  		list_move(&page->lru, &zone->inactive_list);
> -		pgmoved++;
> +		pgmoved += base_pages(page);
>  		if (!pagevec_add(&pvec, page)) {
>  			__mod_zone_page_state(zone, NR_INACTIVE, pgmoved);
>  			spin_unlock_irq(&zone->lru_lock);
> @@ -884,7 +885,7 @@ force_reclaim_mapped:
>  		SetPageLRU(page);
>  		VM_BUG_ON(!PageActive(page));
>  		list_move(&page->lru, &zone->active_list);
> -		pgmoved++;
> +		pgmoved += base_pages(page);
>  		if (!pagevec_add(&pvec, page)) {
>  			__mod_zone_page_state(zone, NR_ACTIVE, pgmoved);
>  			pgmoved = 0;

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 16/16] Variable Order Page Cache: Alternate implementation of page cache macros
  2007-04-23  6:50 ` [RFC 16/16] Variable Order Page Cache: Alternate implementation of page cache macros Christoph Lameter
@ 2007-04-25 13:16   ` Mel Gorman
  0 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2007-04-25 13:16 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, William Lee Irwin III, Badari Pulavarty, David Chinner,
	Jens Axboe, Adam Litke, Dave Hansen, Avi Kivity

On (22/04/07 23:50), Christoph Lameter didst pronounce:
> Variable Order Page Cache: Alternate implementation of page cache macros
> 
> Implement the page cache macros in a more efficient way by storing key
> values in the mapping. This reduces code size but increases inode size.
> 

Considering the hilarity with large inode-related caches and updatedb, it
may be best to keep the inode size down for the moment and do a performance
comparison later to see if anything is gained by the reduced codesize.

> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> ---
>  include/linux/fs.h      |    4 +++-
>  include/linux/pagemap.h |   13 +++++++------
>  2 files changed, 10 insertions(+), 7 deletions(-)
> 
> Index: linux-2.6.21-rc7/include/linux/fs.h
> ===================================================================
> --- linux-2.6.21-rc7.orig/include/linux/fs.h	2007-04-22 19:43:01.000000000 -0700
> +++ linux-2.6.21-rc7/include/linux/fs.h	2007-04-22 19:44:29.000000000 -0700
> @@ -435,7 +435,9 @@ struct address_space {
>  	struct inode		*host;		/* owner: inode, block_device */
>  	struct radix_tree_root	page_tree;	/* radix tree of all pages */
>  	rwlock_t		tree_lock;	/* and rwlock protecting it */
> -	unsigned int		order;		/* Page order in this space */
> +	unsigned int		shift;		/* Shift for to get to the page number */
> +	unsigned int		order;		/* Page order for allocations */
> +	loff_t			offset_mask;	/* To mask out offset in page */
>  	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
>  	struct prio_tree_root	i_mmap;		/* tree of private and shared mappings */
>  	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
> Index: linux-2.6.21-rc7/include/linux/pagemap.h
> ===================================================================
> --- linux-2.6.21-rc7.orig/include/linux/pagemap.h	2007-04-22 19:44:16.000000000 -0700
> +++ linux-2.6.21-rc7/include/linux/pagemap.h	2007-04-22 19:46:23.000000000 -0700
> @@ -42,7 +42,8 @@ static inline void mapping_set_gfp_mask(
>  static inline void set_mapping_order(struct address_space *m, int order)
>  {
>  	m->order = order;
> -
> +	m->shift = order + PAGE_SHIFT;
> +	m->offset_mask = (1UL << m->shift) -1;
>  	if (order)
>  		m->flags |= __GFP_COMP;
>  	else
> @@ -64,23 +65,23 @@ static inline void set_mapping_order(str
>  
>  static inline int page_cache_shift(struct address_space *a)
>  {
> -	return a->order + PAGE_SHIFT;
> +	return a->shift;
>  }
>  
>  static inline unsigned int page_cache_size(struct address_space *a)
>  {
> -	return PAGE_SIZE << a->order;
> +	return a->offset_mask + 1;
>  }
>  
>  static inline loff_t page_cache_mask(struct address_space *a)
>  {
> -	return (loff_t)PAGE_MASK << a->order;
> +	return ~(loff_t)a->offset_mask;
>  }
>  
>  static inline unsigned int page_cache_offset(struct address_space *a,
>  		loff_t pos)
>  {
> -	return pos & ~(PAGE_MASK << a->order);
> +	return pos & a->offset_mask;
>  }
>  
>  static inline pgoff_t page_cache_index(struct address_space *a,
> @@ -95,7 +96,7 @@ static inline pgoff_t page_cache_index(s
>  static inline pgoff_t page_cache_next(struct address_space *a,
>  		loff_t pos)
>  {
> -	return page_cache_index(a, pos + page_cache_size(a) - 1);
> +	return page_cache_index(a, pos + a->offset_mask);
>  }
>  
>  static inline loff_t page_cache_pos(struct address_space *a,

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 02/16] vmstat.c: Support accounting for compound pages
  2007-04-25 10:59   ` Mel Gorman
@ 2007-04-25 15:43     ` Christoph Lameter
  0 siblings, 0 replies; 40+ messages in thread
From: Christoph Lameter @ 2007-04-25 15:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, William Lee Irwin III, Badari Pulavarty, David Chinner,
	Jens Axboe, Adam Litke, Dave Hansen, Avi Kivity

On Wed, 25 Apr 2007, Mel Gorman wrote:

> > This will avoid numerous changes in the VM to fix up page accounting
> > as we add more support for  compound pages.
> > 
> > Also fix up the accounting for active / inactive pages.
> Should this patch be split in two then? The active/inactive looks like
> it's worth doing anyway

We could split it but both pieces are only necessary for higher order 
compound pages on the LRU.

> >  EXPORT_SYMBOL(inc_zone_page_state);
> 
> Everything after here looks like a standalone cleanup.

Its not sorry. __inc_zone_page_state has a bit more overhead than 
__inc_zone_state. Needs to determine the zone again. Maybe we need to 
create a __inc_zone_compound_state or so that does not repeat the zone 
determination.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 05/16] Variable Order Page Cache: Add functions to establish sizes
  2007-04-25 11:20   ` Mel Gorman
@ 2007-04-25 15:54     ` Christoph Lameter
  0 siblings, 0 replies; 40+ messages in thread
From: Christoph Lameter @ 2007-04-25 15:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, William Lee Irwin III, Jens Axboe, David Chinner,
	Badari Pulavarty, Adam Litke, Avi Kivity, Dave Hansen

On Wed, 25 Apr 2007, Mel Gorman wrote:

> These all need comments in the source, particularly page_cache_index() so
> that it is clear that the index is "number of compound pages", not number
> of base pages. With the name as-is, it could be either.  page_cache_offset()
> requires similar mental gymnastics to understand without some sort of comment.

I added some comments explaining it for V4.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 10/16] Variable Order Page Cache: Readahead fixups
  2007-04-25 11:36   ` Mel Gorman
@ 2007-04-25 15:56     ` Christoph Lameter
       [not found]       ` <20070521104204.GA8795@mail.ustc.edu.cn>
  0 siblings, 1 reply; 40+ messages in thread
From: Christoph Lameter @ 2007-04-25 15:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, William Lee Irwin III, Badari Pulavarty, David Chinner,
	Jens Axboe, Adam Litke, Dave Hansen, Avi Kivity

On Wed, 25 Apr 2007, Mel Gorman wrote:

> > +		/*
> > +		 * FIXME: Note the 2M constant here that may prove to
> > +		 * be a problem if page sizes become bigger than one megabyte.
> > +		 */
> > +		unsigned long this_chunk = page_cache_index(mapping, 2 * 1024 * 1024);
> >
> 
> Should readahead just be disabled when the compound page size is as
> large or larger than what readahead normally reads?

I am not sure how to solve that one yet. With the above fix we stay at the 
2M sized readahead. As the compound order increases so the number of pages
is reduced. We could keep the number of pages constant but then very high
orders may cause a excessive use of memory for readahead.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 10/16] Variable Order Page Cache: Readahead fixups
       [not found]       ` <20070521104204.GA8795@mail.ustc.edu.cn>
@ 2007-05-21 10:42         ` Fengguang Wu
  2007-05-21 16:53           ` Christoph Lameter
  0 siblings, 1 reply; 40+ messages in thread
From: Fengguang Wu @ 2007-05-21 10:42 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, linux-mm, William Lee Irwin III, Badari Pulavarty,
	David Chinner, Jens Axboe, Adam Litke, Dave Hansen, Avi Kivity

On Wed, Apr 25, 2007 at 08:56:12AM -0700, Christoph Lameter wrote:
> On Wed, 25 Apr 2007, Mel Gorman wrote:
> 
> > > +		/*
> > > +		 * FIXME: Note the 2M constant here that may prove to
> > > +		 * be a problem if page sizes become bigger than one megabyte.
> > > +		 */
> > > +		unsigned long this_chunk = page_cache_index(mapping, 2 * 1024 * 1024);
> > >
> > 
> > Should readahead just be disabled when the compound page size is as
> > large or larger than what readahead normally reads?
> 
> I am not sure how to solve that one yet. With the above fix we stay at the 
> 2M sized readahead. As the compound order increases so the number of pages
> is reduced. We could keep the number of pages constant but then very high
> orders may cause a excessive use of memory for readahead.

Do we need to support very high orders(i.e. >2MB)?
If not, we can define a MAX_PAGE_CACHE_SIZE=2MB, and limit page orders
under that threshold. Now large readahead can be done in
MAX_PAGE_CACHE_SIZE chunks.

The attached patch is derived from yours, hope you like it :)
(not tested/compiled yet)

Changes include:

- Introduce MAX_PAGE_CACHE_SIZE, the upper limit of compound page size.
- scale readahead size with the page cache size in file_ra_state_init().
- simplify max_sane_readahead() calls by moving them into readahead routines.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
 include/linux/mm.h      |    2 +-
 include/linux/pagemap.h |    1 +
 mm/fadvise.c            |    4 ++--
 mm/filemap.c            |    5 ++---
 mm/madvise.c            |    2 +-
 mm/readahead.c          |   25 +++++++++++++++----------
 6 files changed, 22 insertions(+), 17 deletions(-)

--- linux-2.6.22-rc1-mm1.orig/mm/fadvise.c
+++ linux-2.6.22-rc1-mm1/mm/fadvise.c
@@ -86,10 +86,10 @@ asmlinkage long sys_fadvise64_64(int fd,
 		nrpages = end_index - start_index + 1;
 		if (!nrpages)
 			nrpages = ~0UL;
-		
+
 		ret = force_page_cache_readahead(mapping, file,
 				start_index,
-				max_sane_readahead(nrpages));
+				nrpages);
 		if (ret > 0)
 			ret = 0;
 		break;
--- linux-2.6.22-rc1-mm1.orig/mm/filemap.c
+++ linux-2.6.22-rc1-mm1/mm/filemap.c
@@ -1287,8 +1287,7 @@ do_readahead(struct address_space *mappi
 	if (!mapping || !mapping->a_ops || !mapping->a_ops->readpage)
 		return -EINVAL;
 
-	force_page_cache_readahead(mapping, filp, index,
-					max_sane_readahead(nr));
+	force_page_cache_readahead(mapping, filp, index, nr);
 	return 0;
 }
 
@@ -1426,7 +1425,7 @@ retry_find:
 			count_vm_event(PGMAJFAULT);
 		}
 		did_readaround = 1;
-		ra_pages = max_sane_readahead(file->f_ra.ra_pages);
+		ra_pages = file->f_ra.ra_pages;
 		if (ra_pages) {
 			pgoff_t start = 0;
 
--- linux-2.6.22-rc1-mm1.orig/mm/madvise.c
+++ linux-2.6.22-rc1-mm1/mm/madvise.c
@@ -124,7 +124,7 @@ static long madvise_willneed(struct vm_a
 	end = ((end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
 
 	force_page_cache_readahead(file->f_mapping,
-			file, start, max_sane_readahead(end - start));
+			file, start, end - start);
 	return 0;
 }
 
--- linux-2.6.22-rc1-mm1.orig/mm/readahead.c
+++ linux-2.6.22-rc1-mm1/mm/readahead.c
@@ -44,7 +44,8 @@ EXPORT_SYMBOL_GPL(default_backing_dev_in
 void
 file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping)
 {
-	ra->ra_pages = mapping->backing_dev_info->ra_pages;
+	ra->ra_pages = DIV_ROUND_UP(mapping->backing_dev_info->ra_pages,
+				    page_cache_size(mapping));
 	ra->prev_index = -1;
 }
 EXPORT_SYMBOL_GPL(file_ra_state_init);
@@ -84,7 +85,7 @@ int read_cache_pages(struct address_spac
 			put_pages_list(pages);
 			break;
 		}
-		task_io_account_read(PAGE_CACHE_SIZE);
+		task_io_account_read(page_cache_size(mapping));
 	}
 	pagevec_lru_add(&lru_pvec);
 	return ret;
@@ -151,7 +152,7 @@ __do_page_cache_readahead(struct address
 	if (isize == 0)
 		goto out;
 
-	end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
+ 	end_index = page_cache_index(mapping, isize - 1);
 
 	/*
 	 * Preallocate as many pages as we will need.
@@ -193,8 +194,8 @@ out:
 }
 
 /*
- * Chunk the readahead into 2 megabyte units, so that we don't pin too much
- * memory at once.
+ * Chunk the readahead into MAX_PAGE_CACHE_SIZE(2M) units, so that we don't pin
+ * too much memory at once.
  */
 int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
 		pgoff_t offset, unsigned long nr_to_read)
@@ -204,10 +205,11 @@ int force_page_cache_readahead(struct ad
 	if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages))
 		return -EINVAL;
 
+	nr_to_read = max_sane_readahead(nr_to_read, mapping_order(mapping));
 	while (nr_to_read) {
 		int err;
 
-		unsigned long this_chunk = (2 * 1024 * 1024) / PAGE_CACHE_SIZE;
+		unsigned long this_chunk = page_cache_index(mapping, MAX_PAGE_CACHE_SIZE);
 
 		if (this_chunk > nr_to_read)
 			this_chunk = nr_to_read;
@@ -237,17 +239,20 @@ int do_page_cache_readahead(struct addre
 	if (bdi_read_congested(mapping->backing_dev_info))
 		return -1;
 
+	nr_to_read = max_sane_readahead(nr_to_read, mapping_order(mapping));
 	return __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
 }
 
 /*
- * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
+ * Given a desired number of page order readahead pages, return a
  * sensible upper limit.
  */
-unsigned long max_sane_readahead(unsigned long nr)
+unsigned long max_sane_readahead(unsigned long nr, int order)
 {
-	return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE)
-		+ node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
+	unsigned long base_pages = node_page_state(numa_node_id(), NR_INACTIVE)
+			+ node_page_state(numa_node_id(), NR_FREE_PAGES);
+
+	return min(nr, (base_pages / 2) >> order);
 }
 
 /*
--- linux-2.6.22-rc1-mm1.orig/include/linux/mm.h
+++ linux-2.6.22-rc1-mm1/include/linux/mm.h
@@ -1163,7 +1163,7 @@ unsigned long page_cache_readahead_ondem
 			  struct page *page,
 			  pgoff_t offset,
 			  unsigned long size);
-unsigned long max_sane_readahead(unsigned long nr);
+unsigned long max_sane_readahead(unsigned long nr, int order);
 
 /* Do stack extension */
 extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
--- linux-2.6.22-rc1-mm1.orig/include/linux/pagemap.h
+++ linux-2.6.22-rc1-mm1/include/linux/pagemap.h
@@ -57,6 +57,7 @@ static inline void mapping_set_gfp_mask(
 #define PAGE_CACHE_SIZE		PAGE_SIZE
 #define PAGE_CACHE_MASK		PAGE_MASK
 #define PAGE_CACHE_ALIGN(addr)	(((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK)
+#define MAX_PAGE_CACHE_SIZE	(2 * 1024 * 1024)
 
 #define page_cache_get(page)		get_page(page)
 #define page_cache_release(page)	put_page(page)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 10/16] Variable Order Page Cache: Readahead fixups
  2007-05-21 10:42         ` Fengguang Wu
@ 2007-05-21 16:53           ` Christoph Lameter
       [not found]             ` <20070522005903.GA6184@mail.ustc.edu.cn>
       [not found]             ` <20070524040453.GA10662@mail.ustc.edu.cn>
  0 siblings, 2 replies; 40+ messages in thread
From: Christoph Lameter @ 2007-05-21 16:53 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Mel Gorman, linux-mm, William Lee Irwin III, Badari Pulavarty,
	David Chinner, Jens Axboe, Adam Litke, Dave Hansen, Avi Kivity

On Mon, 21 May 2007, Fengguang Wu wrote:

> > I am not sure how to solve that one yet. With the above fix we stay at the 
> > 2M sized readahead. As the compound order increases so the number of pages
> > is reduced. We could keep the number of pages constant but then very high
> > orders may cause a excessive use of memory for readahead.
> 
> Do we need to support very high orders(i.e. >2MB)?

Yes actually we could potentially be using up to 1 TB page size on our 
new machines that can support several petabytes of RAM. But the read 
ahead is likely irrelevant in that case. And this is an extreme case that 
will be rarely used but a customer has required that we will be able to 
handle such a situation. I think 2-4 megabytes may be more typical.

> If not, we can define a MAX_PAGE_CACHE_SIZE=2MB, and limit page orders
> under that threshold. Now large readahead can be done in
> MAX_PAGE_CACHE_SIZE chunks.

Maybe we can just logarithmically decrease the pages for readahead? 
Readahead should possibly depend on the overall memory of the machine. If 
the machine has several terabytes of main memory then a couple megs of 
readahead may be necessary.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 10/16] Variable Order Page Cache: Readahead fixups
       [not found]             ` <20070522005903.GA6184@mail.ustc.edu.cn>
@ 2007-05-22  0:59               ` Fengguang Wu
  0 siblings, 0 replies; 40+ messages in thread
From: Fengguang Wu @ 2007-05-22  0:59 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, linux-mm, William Lee Irwin III, Badari Pulavarty,
	David Chinner, Jens Axboe, Adam Litke, Dave Hansen, Avi Kivity

On Mon, May 21, 2007 at 09:53:18AM -0700, Christoph Lameter wrote:
> On Mon, 21 May 2007, Fengguang Wu wrote:
> 
> > > I am not sure how to solve that one yet. With the above fix we stay at the 
> > > 2M sized readahead. As the compound order increases so the number of pages
> > > is reduced. We could keep the number of pages constant but then very high
> > > orders may cause a excessive use of memory for readahead.
> > 
> > Do we need to support very high orders(i.e. >2MB)?
> 
> Yes actually we could potentially be using up to 1 TB page size on our 
> new machines that can support several petabytes of RAM. But the read 
> ahead is likely irrelevant in that case. And this is an extreme case that 
> will be rarely used but a customer has required that we will be able to 
> handle such a situation. I think 2-4 megabytes may be more typical.

hehe, 1TB page size is amazing.

> > If not, we can define a MAX_PAGE_CACHE_SIZE=2MB, and limit page orders
> > under that threshold. Now large readahead can be done in
> > MAX_PAGE_CACHE_SIZE chunks.
> 
> Maybe we can just logarithmically decrease the pages for readahead? 
> Readahead should possibly depend on the overall memory of the machine. If 
> the machine has several terabytes of main memory then a couple megs of 
> readahead may be necessary.

Readahead size can be easily scale down by:

 void file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping)
 {
-       ra->ra_pages = mapping->backing_dev_info->ra_pages;
+       ra->ra_pages = DIV_ROUND_UP(mapping->backing_dev_info->ra_pages,
+                                   page_cache_size(mapping));
        ra->prev_index = -1;
 }


But it's not about simply decreasing/disabling readahead.

The problem is, we at least bring in one page at a time.
It's not a problem for 2-4MB page sizes.
But to support page size up to 1TB, this behavior must be changed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 10/16] Variable Order Page Cache: Readahead fixups
       [not found]             ` <20070524040453.GA10662@mail.ustc.edu.cn>
@ 2007-05-24  4:04               ` Fengguang Wu
  2007-05-24  4:06                 ` Christoph Lameter
  0 siblings, 1 reply; 40+ messages in thread
From: Fengguang Wu @ 2007-05-24  4:04 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, linux-mm, William Lee Irwin III, Badari Pulavarty,
	David Chinner, Jens Axboe, Adam Litke, Dave Hansen, Avi Kivity

On Mon, May 21, 2007 at 09:53:18AM -0700, Christoph Lameter wrote:
> On Mon, 21 May 2007, Fengguang Wu wrote:
> 
> > > I am not sure how to solve that one yet. With the above fix we stay at the 
> > > 2M sized readahead. As the compound order increases so the number of pages
> > > is reduced. We could keep the number of pages constant but then very high
> > > orders may cause a excessive use of memory for readahead.
> > 
> > Do we need to support very high orders(i.e. >2MB)?
> 
> Yes actually we could potentially be using up to 1 TB page size on our 
> new machines that can support several petabytes of RAM. But the read 
> ahead is likely irrelevant in that case. And this is an extreme case that 
> will be rarely used but a customer has required that we will be able to 
> handle such a situation. I think 2-4 megabytes may be more typical.

So we do not want to enforce a maximum page size.
The patch is updated to only decrease the readahead pages on increased
page size, until it falls to 1. If page size continues to increase,
the I/O size will increase anyway.

===================================================================
---
 include/linux/mm.h |    2 +-
 mm/fadvise.c       |    4 ++--
 mm/filemap.c       |    5 ++---
 mm/madvise.c       |    2 +-
 mm/readahead.c     |   22 ++++++++++++++--------
 5 files changed, 20 insertions(+), 15 deletions(-)

--- linux-2.6.22-rc1-mm1.orig/mm/fadvise.c
+++ linux-2.6.22-rc1-mm1/mm/fadvise.c
@@ -86,10 +86,10 @@ asmlinkage long sys_fadvise64_64(int fd,
 		nrpages = end_index - start_index + 1;
 		if (!nrpages)
 			nrpages = ~0UL;
-		
+
 		ret = force_page_cache_readahead(mapping, file,
 				start_index,
-				max_sane_readahead(nrpages));
+				nrpages);
 		if (ret > 0)
 			ret = 0;
 		break;
--- linux-2.6.22-rc1-mm1.orig/mm/filemap.c
+++ linux-2.6.22-rc1-mm1/mm/filemap.c
@@ -1287,8 +1287,7 @@ do_readahead(struct address_space *mappi
 	if (!mapping || !mapping->a_ops || !mapping->a_ops->readpage)
 		return -EINVAL;
 
-	force_page_cache_readahead(mapping, filp, index,
-					max_sane_readahead(nr));
+	force_page_cache_readahead(mapping, filp, index, nr);
 	return 0;
 }
 
@@ -1426,7 +1425,7 @@ retry_find:
 			count_vm_event(PGMAJFAULT);
 		}
 		did_readaround = 1;
-		ra_pages = max_sane_readahead(file->f_ra.ra_pages);
+		ra_pages = file->f_ra.ra_pages;
 		if (ra_pages) {
 			pgoff_t start = 0;
 
--- linux-2.6.22-rc1-mm1.orig/mm/madvise.c
+++ linux-2.6.22-rc1-mm1/mm/madvise.c
@@ -124,7 +124,7 @@ static long madvise_willneed(struct vm_a
 	end = ((end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
 
 	force_page_cache_readahead(file->f_mapping,
-			file, start, max_sane_readahead(end - start));
+			file, start, end - start);
 	return 0;
 }
 
--- linux-2.6.22-rc1-mm1.orig/mm/readahead.c
+++ linux-2.6.22-rc1-mm1/mm/readahead.c
@@ -44,7 +44,8 @@ EXPORT_SYMBOL_GPL(default_backing_dev_in
 void
 file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping)
 {
-	ra->ra_pages = mapping->backing_dev_info->ra_pages;
+	ra->ra_pages = DIV_ROUND_UP(mapping->backing_dev_info->ra_pages,
+				    page_cache_size(mapping));
 	ra->prev_index = -1;
 }
 EXPORT_SYMBOL_GPL(file_ra_state_init);
@@ -84,7 +85,7 @@ int read_cache_pages(struct address_spac
 			put_pages_list(pages);
 			break;
 		}
-		task_io_account_read(PAGE_CACHE_SIZE);
+		task_io_account_read(page_cache_size(mapping));
 	}
 	pagevec_lru_add(&lru_pvec);
 	return ret;
@@ -151,7 +152,7 @@ __do_page_cache_readahead(struct address
 	if (isize == 0)
 		goto out;
 
-	end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
+ 	end_index = page_cache_index(mapping, isize - 1);
 
 	/*
 	 * Preallocate as many pages as we will need.
@@ -204,10 +205,12 @@ int force_page_cache_readahead(struct ad
 	if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages))
 		return -EINVAL;
 
+	nr_to_read = max_sane_readahead(nr_to_read, mapping_order(mapping));
 	while (nr_to_read) {
 		int err;
 
-		unsigned long this_chunk = (2 * 1024 * 1024) / PAGE_CACHE_SIZE;
+		unsigned long this_chunk = DIV_ROUND_UP(2 * 1024 * 1024,
+						page_cache_size(mapping));
 
 		if (this_chunk > nr_to_read)
 			this_chunk = nr_to_read;
@@ -237,17 +240,20 @@ int do_page_cache_readahead(struct addre
 	if (bdi_read_congested(mapping->backing_dev_info))
 		return -1;
 
+	nr_to_read = max_sane_readahead(nr_to_read, mapping_order(mapping));
 	return __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
 }
 
 /*
- * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
+ * Given a desired number of page order readahead pages, return a
  * sensible upper limit.
  */
-unsigned long max_sane_readahead(unsigned long nr)
+unsigned long max_sane_readahead(unsigned long nr, int order)
 {
-	return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE)
-		+ node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
+	unsigned long base_pages = node_page_state(numa_node_id(), NR_INACTIVE)
+			+ node_page_state(numa_node_id(), NR_FREE_PAGES);
+
+	return min(nr, (base_pages / 2) >> order);
 }
 
 /*
--- linux-2.6.22-rc1-mm1.orig/include/linux/mm.h
+++ linux-2.6.22-rc1-mm1/include/linux/mm.h
@@ -1163,7 +1163,7 @@ unsigned long page_cache_readahead_ondem
 			  struct page *page,
 			  pgoff_t offset,
 			  unsigned long size);
-unsigned long max_sane_readahead(unsigned long nr);
+unsigned long max_sane_readahead(unsigned long nr, int order);
 
 /* Do stack extension */
 extern int expand_stack(struct vm_area_struct *vma, unsigned long address);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 10/16] Variable Order Page Cache: Readahead fixups
  2007-05-24  4:04               ` Fengguang Wu
@ 2007-05-24  4:06                 ` Christoph Lameter
  0 siblings, 0 replies; 40+ messages in thread
From: Christoph Lameter @ 2007-05-24  4:06 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Mel Gorman, linux-mm, William Lee Irwin III, Badari Pulavarty,
	David Chinner, Jens Axboe, Adam Litke, Dave Hansen, Avi Kivity

On Thu, 24 May 2007, Fengguang Wu wrote:

> So we do not want to enforce a maximum page size.
> The patch is updated to only decrease the readahead pages on increased
> page size, until it falls to 1. If page size continues to increase,
> the I/O size will increase anyway.

Ahh Great! I will put that into the next rollup.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC 08/16] Variable Order Page Cache: Fixup fallback functions
  2007-04-23  6:21 clameter
@ 2007-04-23  6:21 ` clameter
  0 siblings, 0 replies; 40+ messages in thread
From: clameter @ 2007-04-23  6:21 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, William Lee Irwin III, Adam Litke, David Chinner,
	Jens Axboe, Avi Kivity, Dave Hansen, Badari Pulavarty,
	Maxim Levitsky

[-- Attachment #1: var_pc_libfs --]
[-- Type: text/plain, Size: 2082 bytes --]

Fixup the fallback function in fs/libfs.c to be able to handle
higher order page cache pages.

FIXME: There is a use of kmap here that we leave unchanged
(none of my testing platforms use highmem). There needs to
be some way to clear higher order partial pages if a platform
supports HIGHMEM.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/libfs.c |   19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

Index: linux-2.6.21-rc7/fs/libfs.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/libfs.c	2007-04-22 17:28:04.000000000 -0700
+++ linux-2.6.21-rc7/fs/libfs.c	2007-04-22 17:38:58.000000000 -0700
@@ -320,8 +320,8 @@ int simple_rename(struct inode *old_dir,
 
 int simple_readpage(struct file *file, struct page *page)
 {
-	clear_highpage(page);
-	flush_dcache_page(page);
+	clear_mapping_page(page);
+	flush_mapping_page(page);
 	SetPageUptodate(page);
 	unlock_page(page);
 	return 0;
@@ -331,11 +331,15 @@ int simple_prepare_write(struct file *fi
 			unsigned from, unsigned to)
 {
 	if (!PageUptodate(page)) {
-		if (to - from != PAGE_CACHE_SIZE) {
+		if (to - from != page_cache_size(file->f_mapping)) {
+			/*
+			 * Mapping to higher order pages need to be supported
+			 * if higher order pages can be in highmem
+			 */
 			void *kaddr = kmap_atomic(page, KM_USER0);
 			memset(kaddr, 0, from);
-			memset(kaddr + to, 0, PAGE_CACHE_SIZE - to);
-			flush_dcache_page(page);
+			memset(kaddr + to, 0, page_cache_size(file->f_mapping) - to);
+			flush_mapping_page(page);
 			kunmap_atomic(kaddr, KM_USER0);
 		}
 	}
@@ -345,8 +349,9 @@ int simple_prepare_write(struct file *fi
 int simple_commit_write(struct file *file, struct page *page,
 			unsigned from, unsigned to)
 {
-	struct inode *inode = page->mapping->host;
-	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+	struct address_space *mapping = page->mapping;
+	struct inode *inode = mapping->host;
+	loff_t pos = page_cache_pos(mapping, page->index, to);
 
 	if (!PageUptodate(page))
 		SetPageUptodate(page);

--

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2007-05-24  4:06 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-04-23  6:48 [RFC 00/16] Variable Order Page Cache Patchset V2 Christoph Lameter
2007-04-23  6:48 ` [RFC 01/16] Free up page->private for compound pages Christoph Lameter
2007-04-24  2:12   ` Dave Hansen
2007-04-24  2:23     ` Christoph Lameter
2007-04-25 10:55   ` Mel Gorman
2007-04-23  6:48 ` [RFC 02/16] vmstat.c: Support accounting " Christoph Lameter
2007-04-25 10:59   ` Mel Gorman
2007-04-25 15:43     ` Christoph Lameter
2007-04-23  6:49 ` [RFC 03/16] Variable Order Page Cache: Add order field in mapping Christoph Lameter
2007-04-25 11:05   ` Mel Gorman
2007-04-23  6:49 ` [RFC 04/16] Variable Order Page Cache: Add basic allocation functions Christoph Lameter
2007-04-23  6:49 ` [RFC 05/16] Variable Order Page Cache: Add functions to establish sizes Christoph Lameter
2007-04-25 11:20   ` Mel Gorman
2007-04-25 15:54     ` Christoph Lameter
2007-04-23  6:49 ` [RFC 06/16] Variable Page Cache: Add VM_BUG_ONs to check for correct page order Christoph Lameter
2007-04-25 11:22   ` Mel Gorman
2007-04-23  6:49 ` [RFC 07/16] Variable Order Page Cache: Add clearing and flushing function Christoph Lameter
2007-04-23  6:49 ` [RFC 08/16] Variable Order Page Cache: Fixup fallback functions Christoph Lameter
2007-04-23  6:49 ` [RFC 09/16] Variable Order Page Cache: Fix up mm/filemap.c Christoph Lameter
2007-04-23  6:49 ` [RFC 10/16] Variable Order Page Cache: Readahead fixups Christoph Lameter
2007-04-25 11:36   ` Mel Gorman
2007-04-25 15:56     ` Christoph Lameter
     [not found]       ` <20070521104204.GA8795@mail.ustc.edu.cn>
2007-05-21 10:42         ` Fengguang Wu
2007-05-21 16:53           ` Christoph Lameter
     [not found]             ` <20070522005903.GA6184@mail.ustc.edu.cn>
2007-05-22  0:59               ` Fengguang Wu
     [not found]             ` <20070524040453.GA10662@mail.ustc.edu.cn>
2007-05-24  4:04               ` Fengguang Wu
2007-05-24  4:06                 ` Christoph Lameter
2007-04-23  6:49 ` [RFC 11/16] Variable Page Cache Size: Fix up reclaim counters Christoph Lameter
2007-04-25 13:08   ` Mel Gorman
2007-04-23  6:49 ` [RFC 12/16] Variable Order Page Cache: Fix up the writeback logic Christoph Lameter
2007-04-23  6:49 ` [RFC 13/16] Variable Order Page Cache: Fixed to block layer Christoph Lameter
2007-04-23  6:49 ` [RFC 14/16] Variable Order Page Cache: Add support to ramfs Christoph Lameter
2007-04-23  6:50 ` [RFC 15/16] ext2: Add variable page size support Christoph Lameter
2007-04-23 16:30   ` Badari Pulavarty
2007-04-24  1:11     ` Christoph Lameter
2007-04-23  6:50 ` [RFC 16/16] Variable Order Page Cache: Alternate implementation of page cache macros Christoph Lameter
2007-04-25 13:16   ` Mel Gorman
2007-04-23  9:23 ` [RFC 00/16] Variable Order Page Cache Patchset V2 David Chinner
2007-04-23  9:31 ` David Chinner
  -- strict thread matches above, loose matches on Subject: below --
2007-04-23  6:21 clameter
2007-04-23  6:21 ` [RFC 08/16] Variable Order Page Cache: Fixup fallback functions clameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox