[RFC][patch 0/2] mm: remove PageReserved

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC][patch 0/2] mm: remove PageReserved
@ 2005-08-07  3:28 Nick Piggin
  2005-08-07  3:29 ` [patch 1/2] mm: remap ZERO_PAGE mappings Nick Piggin
  2005-08-08 21:09 ` [RFC][patch 0/2] mm: " Daniel Phillips
  0 siblings, 2 replies; 91+ messages in thread
From: Nick Piggin @ 2005-08-07  3:28 UTC (permalink / raw)
  To: linux-kernel, Linux Memory Management, Hugh Dickins,
	Linus Torvalds, Andrew Morton, Andrea Arcangeli,
	Benjamin Herrenschmidt

Hi,

I'll be looking to send these off to Andrew after 2.6.14 opens,
with the aim of having them merged by 2.6.15 hopefully.

It doesn't look like they'll be able to easily free up a page
flag for 2 reasons. First, PageReserved will probably be kept
around for at least one release. Second, swsusp and some arch
code (ioremap) wants to know about struct pages that don't point
to valid RAM - currently they use PageReserved, but we'll probably
just introduce a PageValidRAM or something when PageReserved goes.

I believe this makes memory management cleaner and easier to
understand. My other reason behind this is that the lockless
pagecache patches needs it for sane page refcounting.

If anyone has an issue with the patches or my merge plan, let's
get some discussion going.

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch 1/2] mm: remap ZERO_PAGE mappings
  2005-08-07  3:28 [RFC][patch 0/2] mm: remove PageReserved Nick Piggin
@ 2005-08-07  3:29 ` Nick Piggin
  2005-08-07  3:30   ` [patch 2/2] mm: core remove PageReserved Nick Piggin
  2005-08-08 21:09 ` [RFC][patch 0/2] mm: " Daniel Phillips
  1 sibling, 1 reply; 91+ messages in thread
From: Nick Piggin @ 2005-08-07  3:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linux Memory Management, Hugh Dickins, Linus Torvalds,
	Andrew Morton, Andrea Arcangeli, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 132 bytes --]

1/2

I think this is already in -mm (and can probably go
into 2.6.14). Included here for completeness.

-- 
SUSE Labs, Novell Inc.


[-- Attachment #2: mm-remap-ZERO_PAGE-mappings.patch --]
[-- Type: text/plain, Size: 1067 bytes --]

Remap ZERO_PAGE ptes when remapping memory. This is currently just an
optimisation for MIPS, which is the only architecture with multiple
zero pages - it now retains the mapping it needs for good cache performance,
and as well do_wp_page is now able to always correctly detect and
optimise zero page COW faults.

This change is required in order to be able to detect whether a pte
points to a ZERO_PAGE using only its (pte, vaddr) pair.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/mremap.c
===================================================================
--- linux-2.6.orig/mm/mremap.c
+++ linux-2.6/mm/mremap.c
@@ -141,6 +141,10 @@ move_one_page(struct vm_area_struct *vma
 			if (dst) {
 				pte_t pte;
 				pte = ptep_clear_flush(vma, old_addr, src);
+				/* ZERO_PAGE can be dependant on virtual addr */
+				if (pfn_valid(pte_pfn(pte)) &&
+					pte_page(pte) == ZERO_PAGE(old_addr))
+					pte = pte_wrprotect(mk_pte(ZERO_PAGE(new_addr), new_vma->vm_page_prot));
 				set_pte_at(mm, new_addr, dst, pte);
 			} else
 				error = -ENOMEM;

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch 2/2] mm: core remove PageReserved
  2005-08-07  3:29 ` [patch 1/2] mm: remap ZERO_PAGE mappings Nick Piggin
@ 2005-08-07  3:30   ` Nick Piggin
  0 siblings, 0 replies; 91+ messages in thread
From: Nick Piggin @ 2005-08-07  3:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linux Memory Management, Hugh Dickins, Linus Torvalds,
	Andrew Morton, Andrea Arcangeli, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 33 bytes --]

2/2

-- 
SUSE Labs, Novell Inc.


[-- Attachment #2: mm-core-remove-PageReserved.patch --]
[-- Type: text/plain, Size: 26121 bytes --]

Remove PageReserved() calls from core code by tightening VM_RESERVED
handling in mm/ to cover PageReserved functionality.

PageReserved special casing is removed from get_page and put_page.

All setting and clearning of PageReserved is retained, and it is now
flagged in the page_alloc checks to help ensure we don't introduce
any refcount based freeing of Reserved pages.

MAP_PRIVATE, PROT_WRITE of VM_RESERVED regions is tentatively being
deprecated. We never completely handled it correctly anyway, and is
difficult to handle nicely - difficult but not impossible, it could
be reintroduced in future if required (Hugh has a proof of concept).

Once PageReserved() calls are removed from kernel/power/swsusp.c, and
all arch/ and driver code, the Set and Clear calls, and the PG_reserved
bit can be trivially removed.

Last real user of PageReserved is swsusp, which uses PageReserved to
determine whether a struct page points to valid memory or not. This
still needs to be addressed.

Many thanks to Hugh Dickins for input.

Signed-off-by: Nick Piggin <npiggin@suse.de>


Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -156,7 +156,8 @@ extern unsigned int kobjsize(const void 
 
 #define VM_DONTCOPY	0x00020000      /* Do not copy this vma on fork */
 #define VM_DONTEXPAND	0x00040000	/* Cannot expand with mremap() */
-#define VM_RESERVED	0x00080000	/* Don't unmap it from swap_out */
+#define VM_RESERVED	0x00080000	/* Pages and ptes in region aren't managed with regular pagecache or rmap routines */
+
 #define VM_ACCOUNT	0x00100000	/* Is a VM accounted object */
 #define VM_HUGETLB	0x00400000	/* Huge TLB Page VM */
 #define VM_NONLINEAR	0x00800000	/* Is non-linear (remap_file_pages) */
@@ -337,7 +338,7 @@ static inline void get_page(struct page 
 
 static inline void put_page(struct page *page)
 {
-	if (!PageReserved(page) && put_page_testzero(page))
+	if (put_page_testzero(page))
 		__page_cache_release(page);
 }
 
@@ -723,6 +724,9 @@ void install_arg_page(struct vm_area_str
 
 int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start,
 		int len, int write, int force, struct page **pages, struct vm_area_struct **vmas);
+#define invalid_pfn(pte, vm_flags, vaddr)		\
+		__invalid_pfn(__FUNCTION__, pte, vm_flags, vaddr)
+void __invalid_pfn(const char *, pte_t, unsigned long, unsigned long);
 
 int __set_page_dirty_buffers(struct page *page);
 int __set_page_dirty_nobuffers(struct page *page);
Index: linux-2.6/mm/madvise.c
===================================================================
--- linux-2.6.orig/mm/madvise.c
+++ linux-2.6/mm/madvise.c
@@ -123,7 +123,7 @@ static long madvise_dontneed(struct vm_a
 			     unsigned long start, unsigned long end)
 {
 	*prev = vma;
-	if ((vma->vm_flags & VM_LOCKED) || is_vm_hugetlb_page(vma))
+	if ((vma->vm_flags & (VM_LOCKED|VM_RESERVED)) || is_vm_hugetlb_page(vma))
 		return -EINVAL;
 
 	if (unlikely(vma->vm_flags & VM_NONLINEAR)) {
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -333,6 +333,21 @@ out:
 }
 
 /*
+ * This function is called to print an error when a pte in a
+ * !VM_RESERVED region is found pointing to an invalid pfn (which
+ * is an error.
+ *
+ * The calling function must still handle the error.
+ */
+void __invalid_pfn(const char *errfunc, pte_t pte,
+				unsigned long vm_flags, unsigned long vaddr)
+{
+	printk(KERN_ERR "%s: pte does not point to valid memory. "
+		"process = %s, pte = %08lx, vm_flags = %lx, vaddr = %lx\n",
+		errfunc, current->comm, (long)pte_val(pte), vm_flags, vaddr);
+}
+
+/*
  * copy one vm_area from one task to the other. Assumes the page tables
  * already present in the new task to be cleared in the whole range
  * covered by this vma.
@@ -361,25 +376,29 @@ copy_one_pte(struct mm_struct *dst_mm, s
 				spin_unlock(&mmlist_lock);
 			}
 		}
-		set_pte_at(dst_mm, addr, dst_pte, pte);
-		return;
+		goto out_set_pte;
 	}
 
+	/* If the region is VM_RESERVED, the mapping is not
+	 * mapped via rmap - duplicate the pte as is.
+	 */
+	if (vm_flags & VM_RESERVED)
+		goto out_set_pte;
+
+	/* If the pte points outside of valid memory but
+	 * the region is not VM_RESERVED, we have a problem.
+	 */
 	pfn = pte_pfn(pte);
-	/* the pte points outside of valid memory, the
-	 * mapping is assumed to be good, meaningful
-	 * and not mapped via rmap - duplicate the
-	 * mapping as is.
-	 */
-	page = NULL;
-	if (pfn_valid(pfn))
-		page = pfn_to_page(pfn);
-
-	if (!page || PageReserved(page)) {
-		set_pte_at(dst_mm, addr, dst_pte, pte);
-		return;
+	if (unlikely(!pfn_valid(pfn))) {
+		invalid_pfn(pte, vm_flags, addr);
+		goto out_set_pte; /* try to do something sane */
 	}
 
+	page = pfn_to_page(pfn);
+	/* Mappings to zero pages aren't covered by rmap either. */
+	if (page == ZERO_PAGE(addr))
+		goto out_set_pte;
+
 	/*
 	 * If it's a COW mapping, write protect it both
 	 * in the parent and the child
@@ -400,8 +419,9 @@ copy_one_pte(struct mm_struct *dst_mm, s
 	inc_mm_counter(dst_mm, rss);
 	if (PageAnon(page))
 		inc_mm_counter(dst_mm, anon_rss);
-	set_pte_at(dst_mm, addr, dst_pte, pte);
 	page_dup_rmap(page);
+out_set_pte:
+	set_pte_at(dst_mm, addr, dst_pte, pte);
 }
 
 static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -514,7 +534,8 @@ int copy_page_range(struct mm_struct *ds
 	return 0;
 }
 
-static void zap_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
+static void zap_pte_range(struct mmu_gather *tlb,
+				struct vm_area_struct *vma, pmd_t *pmd,
 				unsigned long addr, unsigned long end,
 				struct zap_details *details)
 {
@@ -528,10 +549,14 @@ static void zap_pte_range(struct mmu_gat
 		if (pte_present(ptent)) {
 			struct page *page = NULL;
 			unsigned long pfn = pte_pfn(ptent);
-			if (pfn_valid(pfn)) {
-				page = pfn_to_page(pfn);
-				if (PageReserved(page))
-					page = NULL;
+			if (!(vma->vm_flags & VM_RESERVED)) {
+				if (unlikely(!pfn_valid(pfn))) {
+					invalid_pfn(ptent, vma->vm_flags, addr);
+				} else {
+					page = pfn_to_page(pfn);
+					if (page == ZERO_PAGE(addr))
+						page = NULL;
+				}
 			}
 			if (unlikely(details) && page) {
 				/*
@@ -584,7 +609,8 @@ static void zap_pte_range(struct mmu_gat
 	pte_unmap(pte - 1);
 }
 
-static inline void zap_pmd_range(struct mmu_gather *tlb, pud_t *pud,
+static inline void zap_pmd_range(struct mmu_gather *tlb,
+				struct vm_area_struct *vma, pud_t *pud,
 				unsigned long addr, unsigned long end,
 				struct zap_details *details)
 {
@@ -596,11 +622,12 @@ static inline void zap_pmd_range(struct 
 		next = pmd_addr_end(addr, end);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
-		zap_pte_range(tlb, pmd, addr, next, details);
+		zap_pte_range(tlb, vma, pmd, addr, next, details);
 	} while (pmd++, addr = next, addr != end);
 }
 
-static inline void zap_pud_range(struct mmu_gather *tlb, pgd_t *pgd,
+static inline void zap_pud_range(struct mmu_gather *tlb,
+				struct vm_area_struct *vma, pgd_t *pgd,
 				unsigned long addr, unsigned long end,
 				struct zap_details *details)
 {
@@ -612,7 +639,7 @@ static inline void zap_pud_range(struct 
 		next = pud_addr_end(addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
-		zap_pmd_range(tlb, pud, addr, next, details);
+		zap_pmd_range(tlb, vma, pud, addr, next, details);
 	} while (pud++, addr = next, addr != end);
 }
 
@@ -633,7 +660,7 @@ static void unmap_page_range(struct mmu_
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
-		zap_pud_range(tlb, pgd, addr, next, details);
+		zap_pud_range(tlb, vma, pgd, addr, next, details);
 	} while (pgd++, addr = next, addr != end);
 	tlb_end_vma(tlb, vma);
 }
@@ -933,7 +960,7 @@ int get_user_pages(struct task_struct *t
 			continue;
 		}
 
-		if (!vma || (vma->vm_flags & VM_IO)
+		if (!vma || (vma->vm_flags & (VM_IO | VM_RESERVED))
 				|| !(flags & vma->vm_flags))
 			return i ? : -EFAULT;
 
@@ -993,8 +1020,7 @@ int get_user_pages(struct task_struct *t
 			if (pages) {
 				pages[i] = page;
 				flush_dcache_page(page);
-				if (!PageReserved(page))
-					page_cache_get(page);
+				page_cache_get(page);
 			}
 			if (vmas)
 				vmas[i] = vma;
@@ -1098,8 +1124,7 @@ static int remap_pte_range(struct mm_str
 		return -ENOMEM;
 	do {
 		BUG_ON(!pte_none(*pte));
-		if (!pfn_valid(pfn) || PageReserved(pfn_to_page(pfn)))
-			set_pte_at(mm, addr, pte, pfn_pte(pfn, prot));
+		set_pte_at(mm, addr, pte, pfn_pte(pfn, prot));
 		pfn++;
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	pte_unmap(pte - 1);
@@ -1239,6 +1264,8 @@ static int do_wp_page(struct mm_struct *
 	pte_t entry;
 	int ret;
 
+	BUG_ON(vma->vm_flags & VM_RESERVED);
+
 	if (unlikely(!pfn_valid(pfn))) {
 		/*
 		 * This should really halt the system so it can be debugged or
@@ -1246,9 +1273,8 @@ static int do_wp_page(struct mm_struct *
 		 * data, but for the moment just pretend this is OOM.
 		 */
 		pte_unmap(page_table);
-		printk(KERN_ERR "do_wp_page: bogus page at address %08lx\n",
-				address);
 		spin_unlock(&mm->page_table_lock);
+		invalid_pfn(pte, vma->vm_flags, address);
 		return VM_FAULT_OOM;
 	}
 	old_page = pfn_to_page(pfn);
@@ -1273,13 +1299,16 @@ static int do_wp_page(struct mm_struct *
 	/*
 	 * Ok, we need to copy. Oh, well..
 	 */
-	if (!PageReserved(old_page))
+	if (old_page == ZERO_PAGE(address))
+		old_page = NULL;
+	else
 		page_cache_get(old_page);
+
 	spin_unlock(&mm->page_table_lock);
 
 	if (unlikely(anon_vma_prepare(vma)))
 		goto no_new_page;
-	if (old_page == ZERO_PAGE(address)) {
+	if (old_page == NULL) {
 		new_page = alloc_zeroed_user_highpage(vma, address);
 		if (!new_page)
 			goto no_new_page;
@@ -1296,12 +1325,13 @@ static int do_wp_page(struct mm_struct *
 	spin_lock(&mm->page_table_lock);
 	page_table = pte_offset_map(pmd, address);
 	if (likely(pte_same(*page_table, pte))) {
-		if (PageAnon(old_page))
-			dec_mm_counter(mm, anon_rss);
-		if (PageReserved(old_page))
+		if (old_page == NULL)
 			inc_mm_counter(mm, rss);
-		else
+		else {
 			page_remove_rmap(old_page);
+			if (PageAnon(old_page))
+				dec_mm_counter(mm, anon_rss);
+		}
 		flush_cache_page(vma, address, pfn);
 		break_cow(vma, new_page, address, page_table);
 		lru_cache_add_active(new_page);
@@ -1312,13 +1342,16 @@ static int do_wp_page(struct mm_struct *
 		ret |= VM_FAULT_WRITE;
 	}
 	pte_unmap(page_table);
-	page_cache_release(new_page);
-	page_cache_release(old_page);
+	if (old_page) {
+		page_cache_release(new_page);
+		page_cache_release(old_page);
+	}
 	spin_unlock(&mm->page_table_lock);
 	return ret;
 
 no_new_page:
-	page_cache_release(old_page);
+	if (old_page)
+		page_cache_release(old_page);
 	return VM_FAULT_OOM;
 }
 
@@ -1755,7 +1788,7 @@ do_anonymous_page(struct mm_struct *mm, 
 	struct page * page = ZERO_PAGE(addr);
 
 	/* Read-only mapping of ZERO_PAGE. */
-	entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
+	entry = pte_wrprotect(mk_pte(page, vma->vm_page_prot));
 
 	/* ..except if it's a write access */
 	if (write_access) {
@@ -1894,9 +1927,6 @@ retry:
 	 */
 	/* Only go through if we didn't race with anybody else... */
 	if (pte_none(*page_table)) {
-		if (!PageReserved(new_page))
-			inc_mm_counter(mm, rss);
-
 		flush_icache_page(vma, new_page);
 		entry = mk_pte(new_page, vma->vm_page_prot);
 		if (write_access)
@@ -1905,8 +1935,10 @@ retry:
 		if (anon) {
 			lru_cache_add_active(new_page);
 			page_add_anon_rmap(new_page, vma, address);
-		} else
+		} else if (!(vma->vm_flags & VM_RESERVED)) {
 			page_add_file_rmap(new_page);
+			inc_mm_counter(mm, rss);
+		}
 		pte_unmap(page_table);
 	} else {
 		/* One of our sibling threads was faster, back out. */
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -113,7 +113,8 @@ static void bad_page(const char *functio
 			1 << PG_reclaim |
 			1 << PG_slab    |
 			1 << PG_swapcache |
-			1 << PG_writeback);
+			1 << PG_writeback |
+			1 << PG_reserved );
 	set_page_count(page, 0);
 	reset_page_mapcount(page);
 	page->mapping = NULL;
@@ -243,7 +244,6 @@ static inline int page_is_buddy(struct p
 {
        if (PagePrivate(page)           &&
            (page_order(page) == order) &&
-           !PageReserved(page)         &&
             page_count(page) == 0)
                return 1;
        return 0;
@@ -326,7 +326,8 @@ static inline void free_pages_check(cons
 			1 << PG_reclaim	|
 			1 << PG_slab	|
 			1 << PG_swapcache |
-			1 << PG_writeback )))
+			1 << PG_writeback |
+			1 << PG_reserved )))
 		bad_page(function, page);
 	if (PageDirty(page))
 		__ClearPageDirty(page);
@@ -454,7 +455,8 @@ static void prep_new_page(struct page *p
 			1 << PG_reclaim	|
 			1 << PG_slab    |
 			1 << PG_swapcache |
-			1 << PG_writeback )))
+			1 << PG_writeback |
+			1 << PG_reserved )))
 		bad_page(__FUNCTION__, page);
 
 	page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
@@ -1011,7 +1013,7 @@ void __pagevec_free(struct pagevec *pvec
 
 fastcall void __free_pages(struct page *page, unsigned int order)
 {
-	if (!PageReserved(page) && put_page_testzero(page)) {
+	if (put_page_testzero(page)) {
 		if (order == 0)
 			free_hot_page(page);
 		else
@@ -1653,7 +1655,7 @@ void __init memmap_init_zone(unsigned lo
 			continue;
 		page = pfn_to_page(pfn);
 		set_page_links(page, zone, nid, pfn);
-		set_page_count(page, 0);
+		set_page_count(page, 1);
 		reset_page_mapcount(page);
 		SetPageReserved(page);
 		INIT_LIST_HEAD(&page->lru);
Index: linux-2.6/mm/swap.c
===================================================================
--- linux-2.6.orig/mm/swap.c
+++ linux-2.6/mm/swap.c
@@ -48,7 +48,7 @@ void put_page(struct page *page)
 		}
 		return;
 	}
-	if (!PageReserved(page) && put_page_testzero(page))
+	if (put_page_testzero(page))
 		__page_cache_release(page);
 }
 EXPORT_SYMBOL(put_page);
@@ -215,7 +215,7 @@ void release_pages(struct page **pages, 
 		struct page *page = pages[i];
 		struct zone *pagezone;
 
-		if (PageReserved(page) || !put_page_testzero(page))
+		if (!put_page_testzero(page))
 			continue;
 
 		pagezone = page_zone(page);
Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c
+++ linux-2.6/mm/fremap.c
@@ -29,18 +29,21 @@ static inline void zap_pte(struct mm_str
 		return;
 	if (pte_present(pte)) {
 		unsigned long pfn = pte_pfn(pte);
+		struct page *page;
 
 		flush_cache_page(vma, addr, pfn);
 		pte = ptep_clear_flush(vma, addr, ptep);
-		if (pfn_valid(pfn)) {
-			struct page *page = pfn_to_page(pfn);
-			if (!PageReserved(page)) {
-				if (pte_dirty(pte))
-					set_page_dirty(page);
-				page_remove_rmap(page);
-				page_cache_release(page);
-				dec_mm_counter(mm, rss);
-			}
+		if (unlikely(!pfn_valid(pfn))) {
+			invalid_pfn(pte, vma->vm_flags, addr);
+			return;
+		}
+		page = pfn_to_page(pfn);
+		if (page != ZERO_PAGE(addr)) {
+			if (pte_dirty(pte))
+				set_page_dirty(page);
+			page_remove_rmap(page);
+			dec_mm_counter(mm, rss);
+			page_cache_release(page);
 		}
 	} else {
 		if (!pte_file(pte))
@@ -65,6 +68,8 @@ int install_page(struct mm_struct *mm, s
 	pgd_t *pgd;
 	pte_t pte_val;
 
+	BUG_ON(vma->vm_flags & VM_RESERVED);
+
 	pgd = pgd_offset(mm, addr);
 	spin_lock(&mm->page_table_lock);
 	
@@ -122,6 +127,8 @@ int install_file_pte(struct mm_struct *m
 	pgd_t *pgd;
 	pte_t pte_val;
 
+	BUG_ON(vma->vm_flags & VM_RESERVED);
+
 	pgd = pgd_offset(mm, addr);
 	spin_lock(&mm->page_table_lock);
 	
Index: linux-2.6/mm/msync.c
===================================================================
--- linux-2.6.orig/mm/msync.c
+++ linux-2.6/mm/msync.c
@@ -37,11 +37,11 @@ static void sync_pte_range(struct vm_are
 		if (!pte_maybe_dirty(*pte))
 			continue;
 		pfn = pte_pfn(*pte);
-		if (!pfn_valid(pfn))
+		if (unlikely(!pfn_valid(pfn))) {
+			invalid_pfn(*pte, vma->vm_flags, addr);
 			continue;
+		}
 		page = pfn_to_page(pfn);
-		if (PageReserved(page))
-			continue;
 
 		if (ptep_clear_flush_dirty(vma, addr, pte) ||
 		    page_test_and_clear_dirty(page))
@@ -149,6 +149,9 @@ static int msync_interval(struct vm_area
 	if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED))
 		return -EBUSY;
 
+	if (vma->vm_flags & VM_RESERVED)
+		return -EINVAL;
+
 	if (file && (vma->vm_flags & VM_SHARED)) {
 		filemap_sync(vma, addr, end);
 
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -442,8 +442,6 @@ int page_referenced(struct page *page, i
 void page_add_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address)
 {
-	BUG_ON(PageReserved(page));
-
 	inc_mm_counter(vma->vm_mm, anon_rss);
 
 	if (atomic_inc_and_test(&page->_mapcount)) {
@@ -469,8 +467,7 @@ void page_add_anon_rmap(struct page *pag
 void page_add_file_rmap(struct page *page)
 {
 	BUG_ON(PageAnon(page));
-	if (!pfn_valid(page_to_pfn(page)) || PageReserved(page))
-		return;
+	BUG_ON(!pfn_valid(page_to_pfn(page)));
 
 	if (atomic_inc_and_test(&page->_mapcount))
 		inc_page_state(nr_mapped);
@@ -484,8 +481,6 @@ void page_add_file_rmap(struct page *pag
  */
 void page_remove_rmap(struct page *page)
 {
-	BUG_ON(PageReserved(page));
-
 	if (atomic_add_negative(-1, &page->_mapcount)) {
 		BUG_ON(page_mapcount(page) < 0);
 		/*
@@ -643,13 +638,13 @@ static void try_to_unmap_cluster(unsigne
 			continue;
 
 		pfn = pte_pfn(*pte);
-		if (!pfn_valid(pfn))
+		if (unlikely(!pfn_valid(pfn))) {
+			invalid_pfn(*pte, vma->vm_flags, address);
 			continue;
+		}
 
 		page = pfn_to_page(pfn);
 		BUG_ON(PageAnon(page));
-		if (PageReserved(page))
-			continue;
 
 		if (ptep_clear_flush_young(vma, address, pte))
 			continue;
@@ -812,7 +807,6 @@ int try_to_unmap(struct page *page)
 {
 	int ret;
 
-	BUG_ON(PageReserved(page));
 	BUG_ON(!PageLocked(page));
 
 	if (PageAnon(page))
Index: linux-2.6/drivers/scsi/sg.c
===================================================================
--- linux-2.6.orig/drivers/scsi/sg.c
+++ linux-2.6/drivers/scsi/sg.c
@@ -1887,13 +1887,17 @@ st_unmap_user_pages(struct scatterlist *
 	int i;
 
 	for (i=0; i < nr_pages; i++) {
-		if (dirtied && !PageReserved(sgl[i].page))
-			SetPageDirty(sgl[i].page);
-		/* unlock_page(sgl[i].page); */
+		struct page *page = sgl[i].page;
+
+		/* XXX: just for debug. Remove when PageReserved is removed */
+		BUG_ON(PageReserved(page));
+		if (dirtied)
+			SetPageDirty(page);
+		/* unlock_page(page); */
 		/* FIXME: cache flush missing for rw==READ
 		 * FIXME: call the correct reference counting function
 		 */
-		page_cache_release(sgl[i].page);
+		page_cache_release(page);
 	}
 
 	return 0;
Index: linux-2.6/drivers/scsi/st.c
===================================================================
--- linux-2.6.orig/drivers/scsi/st.c
+++ linux-2.6/drivers/scsi/st.c
@@ -4431,12 +4431,16 @@ static int sgl_unmap_user_pages(struct s
 	int i;
 
 	for (i=0; i < nr_pages; i++) {
-		if (dirtied && !PageReserved(sgl[i].page))
-			SetPageDirty(sgl[i].page);
+		struct page *page = sgl[i].page;
+
+		/* XXX: just for debug. Remove when PageReserved is removed */
+		BUG_ON(PageReserved(page));
+		if (dirtied)
+			SetPageDirty(page);
 		/* FIXME: cache flush missing for rw==READ
 		 * FIXME: call the correct reference counting function
 		 */
-		page_cache_release(sgl[i].page);
+		page_cache_release(page);
 	}
 
 	return 0;
Index: linux-2.6/sound/core/pcm_native.c
===================================================================
--- linux-2.6.orig/sound/core/pcm_native.c
+++ linux-2.6/sound/core/pcm_native.c
@@ -2944,8 +2944,7 @@ static struct page * snd_pcm_mmap_status
 		return NOPAGE_OOM;
 	runtime = substream->runtime;
 	page = virt_to_page(runtime->status);
-	if (!PageReserved(page))
-		get_page(page);
+	get_page(page);
 	if (type)
 		*type = VM_FAULT_MINOR;
 	return page;
@@ -2987,8 +2986,7 @@ static struct page * snd_pcm_mmap_contro
 		return NOPAGE_OOM;
 	runtime = substream->runtime;
 	page = virt_to_page(runtime->control);
-	if (!PageReserved(page))
-		get_page(page);
+	get_page(page);
 	if (type)
 		*type = VM_FAULT_MINOR;
 	return page;
@@ -3061,8 +3059,7 @@ static struct page *snd_pcm_mmap_data_no
 		vaddr = runtime->dma_area + offset;
 		page = virt_to_page(vaddr);
 	}
-	if (!PageReserved(page))
-		get_page(page);
+	get_page(page);
 	if (type)
 		*type = VM_FAULT_MINOR;
 	return page;
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -1077,6 +1077,17 @@ munmap_back:
 		error = file->f_op->mmap(file, vma);
 		if (error)
 			goto unmap_and_free_vma;
+		if ((vma->vm_flags & (VM_SHARED | VM_WRITE | VM_RESERVED))
+						== (VM_WRITE | VM_RESERVED)) {
+			printk(KERN_WARNING "program %s is using MAP_PRIVATE, "
+				"PROT_WRITE mmap of VM_RESERVED memory, which "
+				"is deprecated. Please report this to "
+				"linux-kernel@vger.kernel.org\n",current->comm);
+			if (vma->vm_ops && vma->vm_ops->close)
+				vma->vm_ops->close(vma);
+			error = -EACCES;
+			goto unmap_and_free_vma;
+		}
 	} else if (vm_flags & VM_SHARED) {
 		error = shmem_zero_setup(vma);
 		if (error)
Index: linux-2.6/mm/mprotect.c
===================================================================
--- linux-2.6.orig/mm/mprotect.c
+++ linux-2.6/mm/mprotect.c
@@ -131,6 +131,14 @@ mprotect_fixup(struct vm_area_struct *vm
 				return -ENOMEM;
 			newflags |= VM_ACCOUNT;
 		}
+		if (oldflags & VM_RESERVED) {
+			BUG_ON(oldflags & VM_WRITE);
+			printk(KERN_WARNING "program %s is using MAP_PRIVATE, "
+				"PROT_WRITE mprotect of VM_RESERVED memory, "
+				"which is deprecated. Please report this to "
+				"linux-kernel@vger.kernel.org\n",current->comm);
+			return -EACCES;
+		}
 	}
 
 	newprot = protection_map[newflags & 0xf];
Index: linux-2.6/mm/bootmem.c
===================================================================
--- linux-2.6.orig/mm/bootmem.c
+++ linux-2.6/mm/bootmem.c
@@ -297,6 +297,7 @@ static unsigned long __init free_all_boo
 				if (j + 16 < BITS_PER_LONG)
 					prefetchw(page + j + 16);
 				__ClearPageReserved(page + j);
+				set_page_count(page + j, 0);
 			}
 			__free_pages(page, order);
 			i += BITS_PER_LONG;
Index: linux-2.6/mm/mempolicy.c
===================================================================
--- linux-2.6.orig/mm/mempolicy.c
+++ linux-2.6/mm/mempolicy.c
@@ -253,8 +253,10 @@ static int check_pte_range(struct mm_str
 		if (!pte_present(*pte))
 			continue;
 		pfn = pte_pfn(*pte);
-		if (!pfn_valid(pfn))
+		if (unlikely(!pfn_valid(pfn))) {
+			invalid_pfn(*pte, -1UL, addr);
 			continue;
+		}
 		nid = pfn_to_nid(pfn);
 		if (!test_bit(nid, nodes))
 			break;
@@ -326,6 +328,8 @@ check_range(struct mm_struct *mm, unsign
 	first = find_vma(mm, start);
 	if (!first)
 		return ERR_PTR(-EFAULT);
+	if (first->vm_flags & VM_RESERVED)
+		return ERR_PTR(-EACCES);
 	prev = NULL;
 	for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
 		if (!vma->vm_next && vma->vm_end < end)
Index: linux-2.6/arch/ppc64/kernel/vdso.c
===================================================================
--- linux-2.6.orig/arch/ppc64/kernel/vdso.c
+++ linux-2.6/arch/ppc64/kernel/vdso.c
@@ -176,13 +176,13 @@ static struct page * vdso_vma_nopage(str
 		return NOPAGE_SIGBUS;
 
 	/*
-	 * Last page is systemcfg, special handling here, no get_page() a
-	 * this is a reserved page
+	 * Last page is systemcfg.
 	 */
 	if ((vma->vm_end - address) <= PAGE_SIZE)
-		return virt_to_page(systemcfg);
+		pg = virt_to_page(systemcfg);
+	else
+		pg = virt_to_page(vbase + offset);
 
-	pg = virt_to_page(vbase + offset);
 	get_page(pg);
 	DBG(" ->page count: %d\n", page_count(pg));
 
@@ -600,6 +600,8 @@ void __init vdso_init(void)
 		ClearPageReserved(pg);
 		get_page(pg);
 	}
+
+	get_page(virt_to_page(systemcfg));
 }
 
 int in_gate_area_no_task(unsigned long addr)
Index: linux-2.6/kernel/power/swsusp.c
===================================================================
--- linux-2.6.orig/kernel/power/swsusp.c
+++ linux-2.6/kernel/power/swsusp.c
@@ -434,15 +434,23 @@ static int save_highmem_zone(struct zone
 			continue;
 		page = pfn_to_page(pfn);
 		/*
-		 * This condition results from rvmalloc() sans vmalloc_32()
-		 * and architectural memory reservations. This should be
-		 * corrected eventually when the cases giving rise to this
-		 * are better understood.
+		 * PageReserved results from rvmalloc() sans vmalloc_32()
+		 * and architectural memory reservations.
+		 *
+		 * rvmalloc should not cause this, because all implementations
+		 * appear to always be using vmalloc_32 on architectures with
+		 * highmem. This is a good thing, because we would like to save
+		 * rvmalloc pages.
+		 *
+		 * It appears to be triggered by pages which do not point to
+		 * valid memory (see arch/i386/mm/init.c:one_highpage_init(),
+		 * which sets PageReserved if the page does not point to valid
+		 * RAM.
+		 *
+		 * XXX: must remove usage of PageReserved!
 		 */
-		if (PageReserved(page)) {
-			printk("highmem reserved page?!\n");
+		if (PageReserved(page))
 			continue;
-		}
 		BUG_ON(PageNosave(page));
 		if (PageNosaveFree(page))
 			continue;
@@ -528,10 +536,9 @@ static int saveable(struct zone * zone, 
 		return 0;
 
 	page = pfn_to_page(pfn);
-	BUG_ON(PageReserved(page) && PageNosave(page));
 	if (PageNosave(page))
 		return 0;
-	if (PageReserved(page) && pfn_is_nosave(pfn)) {
+	if (pfn_is_nosave(pfn)) {
 		pr_debug("[nosave pfn 0x%lx]", pfn);
 		return 0;
 	}
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c
+++ linux-2.6/mm/shmem.c
@@ -1523,7 +1523,8 @@ static void do_shmem_file_read(struct fi
 		index += offset >> PAGE_CACHE_SHIFT;
 		offset &= ~PAGE_CACHE_MASK;
 
-		page_cache_release(page);
+		if (page != ZERO_PAGE(0))
+			page_cache_release(page);
 		if (ret != nr || !desc->count)
 			break;
 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-07  3:28 [RFC][patch 0/2] mm: remove PageReserved Nick Piggin
  2005-08-07  3:29 ` [patch 1/2] mm: remap ZERO_PAGE mappings Nick Piggin
@ 2005-08-08 21:09 ` Daniel Phillips
  2005-08-08 21:24   ` Daniel Phillips
                     ` (2 more replies)
  1 sibling, 3 replies; 91+ messages in thread
From: Daniel Phillips @ 2005-08-08 21:09 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-kernel, Linux Memory Management, Hugh Dickins,
	Linus Torvalds, Andrew Morton, Andrea Arcangeli,
	Benjamin Herrenschmidt

On Sunday 07 August 2005 13:28, Nick Piggin wrote:
> Hi,
>
> I'll be looking to send these off to Andrew after 2.6.14 opens,
> with the aim of having them merged by 2.6.15 hopefully.
>
> It doesn't look like they'll be able to easily free up a page
> flag for 2 reasons. First, PageReserved will probably be kept
> around for at least one release. Second, swsusp and some arch
> code (ioremap) wants to know about struct pages that don't point
> to valid RAM - currently they use PageReserved, but we'll probably
> just introduce a PageValidRAM or something when PageReserved goes.
>
> I believe this makes memory management cleaner and easier to
> understand.

Agreed, I've always looked askance at that particular page flag.  (Suggestion 
for your next act: 

> My other reason behind this is that the lockless 
> pagecache patches needs it for sane page refcounting.
>
> If anyone has an issue with the patches or my merge plan, let's
> get some discussion going.

You forgot to mention what replaces PageReserved: the VM_RESERVED vma flag, 
which is now added to the whole zap_pte call chain.  A slight efficiency win?  
Anyway, it looks like forward progress because some inner loops are a little 
straighter.  I've always wondered what PG_reserved was actually doing, and 
now I know: compensating for the missing vma parameter in the zap call 
chains.

Why don't you pass the vma in zap_details?  For that matter, why are addr and 
end still passed down the zap chain when zap_details appears to duplicate 
that information?  OK, it is because zap_details is NULL in about twice as 
many places as it carries data.  But since the details parameter is already 
there, would it not make sense to press it into service to slim down those 
parameter lists a little?

What stops swsusp from also using the vma flag?  Why does swsusp need both 
PG_reserved and PG_nosave?

Is there automated testing planned for this one?  It looks right as closely as 
I've read, but it tickles an awful lot of code.

Regards,

Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-08 21:09 ` [RFC][patch 0/2] mm: " Daniel Phillips
@ 2005-08-08 21:24   ` Daniel Phillips
  2005-08-08 21:54     ` Andrew Morton
  2005-08-10 13:13     ` [RFC][patch 0/2] mm: remove PageReserved David Howells
  2005-08-09  0:15   ` Nick Piggin
  2005-08-09  4:39   ` Nigel Cunningham
  2 siblings, 2 replies; 91+ messages in thread
From: Daniel Phillips @ 2005-08-08 21:24 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-kernel, Linux Memory Management, Hugh Dickins,
	Linus Torvalds, Andrew Morton, Andrea Arcangeli,
	Benjamin Herrenschmidt

'Scuse me:

On Tuesday 09 August 2005 07:09, Daniel Phillips wrote:
> Suggestion for your next act:

...kill PG_checked please :)  Or at least keep it from spreading.

Regards,

Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-08 21:24   ` Daniel Phillips
@ 2005-08-08 21:54     ` Andrew Morton
  2005-08-09 23:23       ` [RFC][PATCH] Rename PageChecked as PageMiscFS Daniel Phillips
                         ` (2 more replies)
  2005-08-10 13:13     ` [RFC][patch 0/2] mm: remove PageReserved David Howells
  1 sibling, 3 replies; 91+ messages in thread
From: Andrew Morton @ 2005-08-08 21:54 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: nickpiggin, linux-kernel, linux-mm, hugh, torvalds, andrea, benh

Daniel Phillips <phillips@arcor.de> wrote:
>
> 'Scuse me:
> 
> On Tuesday 09 August 2005 07:09, Daniel Phillips wrote:
> > Suggestion for your next act:
> 
> ...kill PG_checked please :)  Or at least keep it from spreading.
> 

It already spread - ext3 is using it and I think reiser4.  I thought I had
a patch to rename it to PG_misc1 or somesuch, but no.  It's mandate becomes
"filesystem-specific page flag".

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-08 21:09 ` [RFC][patch 0/2] mm: " Daniel Phillips
  2005-08-08 21:24   ` Daniel Phillips
@ 2005-08-09  0:15   ` Nick Piggin
  2005-08-09  8:51     ` Benjamin Herrenschmidt
  2005-08-09 19:14     ` Daniel Phillips
  2005-08-09  4:39   ` Nigel Cunningham
  2 siblings, 2 replies; 91+ messages in thread
From: Nick Piggin @ 2005-08-09  0:15 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: linux-kernel, Linux Memory Management, Hugh Dickins,
	Linus Torvalds, Andrew Morton, Andrea Arcangeli,
	Benjamin Herrenschmidt

Daniel Phillips wrote:
> On Sunday 07 August 2005 13:28, Nick Piggin wrote:

>>If anyone has an issue with the patches or my merge plan, let's
>>get some discussion going.
> 
> 
> You forgot to mention what replaces PageReserved: the VM_RESERVED vma flag, 
> which is now added to the whole zap_pte call chain.  A slight efficiency win?  
> Anyway, it looks like forward progress because some inner loops are a little 
> straighter.  I've always wondered what PG_reserved was actually doing, and 
> now I know: compensating for the missing vma parameter in the zap call 
> chains.
> 

Basically, it was doing a whole lot of vaguely related things. It
was set for ZERO_PAGE pages. It was (and still is) set for struct
pages that don't point to valid ram. Drivers set it, hoping it will
do something magical for them.

And yes, the VM_RESERVED flag is able to replace most usages.
Checking (pte_page(pte) == ZERO_PAGE(addr)) picks up others.

What we don't have is something to indicate the page does not point
to valid ram.

> Why don't you pass the vma in zap_details?  For that matter, why are addr and 
> end still passed down the zap chain when zap_details appears to duplicate 
> that information?  OK, it is because zap_details is NULL in about twice as 
> many places as it carries data.  But since the details parameter is already 
> there, would it not make sense to press it into service to slim down those 
> parameter lists a little?
> 

Possibly. I initially did it that way, but it ended up fattening
paths that don't use details. And this way is less intrusive.

> What stops swsusp from also using the vma flag?  Why does swsusp need both 
> PG_reserved and PG_nosave?
> 

Because swsusp isn't looking through a process mapping.

swsusp also uses PG_nosave_free, believe it or not. Though I think
PG_nosave and PG_nosave_free can be consolidated quite easily.

> Is there automated testing planned for this one?  It looks right as closely as 
> I've read, but it tickles an awful lot of code.
> 

I haven't planned anything. I've tested it on machines here,
but I should probably do so a bit more heavily (ie. thrashing
swap, reclaiming pagecache, etc for hours).

Thanks for having a look!

Nick

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-08 21:09 ` [RFC][patch 0/2] mm: " Daniel Phillips
  2005-08-08 21:24   ` Daniel Phillips
  2005-08-09  0:15   ` Nick Piggin
@ 2005-08-09  4:39   ` Nigel Cunningham
  2005-08-09  4:59     ` Nick Piggin
  2 siblings, 1 reply; 91+ messages in thread
From: Nigel Cunningham @ 2005-08-09  4:39 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Nick Piggin, Linux Kernel Mailing List, Linux Memory Management,
	Hugh Dickins, Linus Torvalds, Andrew Morton, Andrea Arcangeli,
	Benjamin Herrenschmidt

Hi.

On Tue, 2005-08-09 at 07:09, Daniel Phillips wrote:
> > It doesn't look like they'll be able to easily free up a page
> > flag for 2 reasons. First, PageReserved will probably be kept
> > around for at least one release. Second, swsusp and some arch
> > code (ioremap) wants to know about struct pages that don't point
> > to valid RAM - currently they use PageReserved, but we'll probably
> > just introduce a PageValidRAM or something when PageReserved goes.

Changing the e820 code so it sets PageNosave instead of PageReserved,
along with a couple of modifications in swsusp itself should get rid of
the swsusp dependency.

Regards,

Nigel
-- 
Evolution.
Enumerate the requirements.
Consider the interdependencies.
Calculate the probabilities.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09  4:39   ` Nigel Cunningham
@ 2005-08-09  4:59     ` Nick Piggin
  2005-08-09  5:11       ` Nigel Cunningham
  2005-08-09  7:08       ` Russell King
  0 siblings, 2 replies; 91+ messages in thread
From: Nick Piggin @ 2005-08-09  4:59 UTC (permalink / raw)
  To: ncunningham
  Cc: Daniel Phillips, Linux Kernel Mailing List,
	Linux Memory Management, Hugh Dickins, Linus Torvalds,
	Andrew Morton, Andrea Arcangeli, Benjamin Herrenschmidt

Nigel Cunningham wrote:
> Hi.
> 
> On Tue, 2005-08-09 at 07:09, Daniel Phillips wrote:
> 
>>>It doesn't look like they'll be able to easily free up a page
>>>flag for 2 reasons. First, PageReserved will probably be kept
>>>around for at least one release. Second, swsusp and some arch
>>>code (ioremap) wants to know about struct pages that don't point
>>>to valid RAM - currently they use PageReserved, but we'll probably
>>>just introduce a PageValidRAM or something when PageReserved goes.
> 
> 
> Changing the e820 code so it sets PageNosave instead of PageReserved,
> along with a couple of modifications in swsusp itself should get rid of
> the swsusp dependency.
> 

That would work for swsusp, but there are other users that want to
know if a struct page is valid ram (eg. ioremap), so in that case
swsusp would not be able to mess with the flag.

I do think swsusp should (and can, quite easily though I may have
missed something) consolidate PG_nosave and PG_nosave_free, however
that's out of the scope of this patch.

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09  4:59     ` Nick Piggin
@ 2005-08-09  5:11       ` Nigel Cunningham
  2005-08-09  5:20         ` Nick Piggin
  2005-08-09  7:08       ` Russell King
  1 sibling, 1 reply; 91+ messages in thread
From: Nigel Cunningham @ 2005-08-09  5:11 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Daniel Phillips, Linux Kernel Mailing List,
	Linux Memory Management, Hugh Dickins, Linus Torvalds,
	Andrew Morton, Andrea Arcangeli, Benjamin Herrenschmidt

Hi Nick et al.

On Tue, 2005-08-09 at 14:59, Nick Piggin wrote:
> Nigel Cunningham wrote:
> > Hi.
> > 
> > On Tue, 2005-08-09 at 07:09, Daniel Phillips wrote:
> > 
> >>>It doesn't look like they'll be able to easily free up a page
> >>>flag for 2 reasons. First, PageReserved will probably be kept
> >>>around for at least one release. Second, swsusp and some arch
> >>>code (ioremap) wants to know about struct pages that don't point
> >>>to valid RAM - currently they use PageReserved, but we'll probably
> >>>just introduce a PageValidRAM or something when PageReserved goes.
> > 
> > 
> > Changing the e820 code so it sets PageNosave instead of PageReserved,
> > along with a couple of modifications in swsusp itself should get rid of
> > the swsusp dependency.
> > 
> 
> That would work for swsusp, but there are other users that want to
> know if a struct page is valid ram (eg. ioremap), so in that case
> swsusp would not be able to mess with the flag.

Um. Mess with which flag? I guess you mean Reserved. I was saying that
imaging Reserved going away, so for the short term I'd be meaning making
the e820 set both Nosave and Reserved for those pages (which is what the
Suspend2 patches do so as to play nicely with swsusp - I don't use
Reserved at all).

> I do think swsusp should (and can, quite easily though I may have
> missed something) consolidate PG_nosave and PG_nosave_free, however
> that's out of the scope of this patch.

I won't comment here. I don't start at swsusp code that much :>

Nigel
-- 
Evolution.
Enumerate the requirements.
Consider the interdependencies.
Calculate the probabilities.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09  5:11       ` Nigel Cunningham
@ 2005-08-09  5:20         ` Nick Piggin
  2005-08-09  5:30           ` Nigel Cunningham
  0 siblings, 1 reply; 91+ messages in thread
From: Nick Piggin @ 2005-08-09  5:20 UTC (permalink / raw)
  To: ncunningham
  Cc: Daniel Phillips, Linux Kernel Mailing List,
	Linux Memory Management, Hugh Dickins, Linus Torvalds,
	Andrew Morton, Andrea Arcangeli, Benjamin Herrenschmidt

Nigel Cunningham wrote:
> Hi Nick et al.
> 
> On Tue, 2005-08-09 at 14:59, Nick Piggin wrote:

>>>Changing the e820 code so it sets PageNosave instead of PageReserved,
>>>along with a couple of modifications in swsusp itself should get rid of
>>>the swsusp dependency.
>>>
>>
>>That would work for swsusp, but there are other users that want to
>>know if a struct page is valid ram (eg. ioremap), so in that case
>>swsusp would not be able to mess with the flag.
> 
> 
> Um. Mess with which flag? I guess you mean Reserved. I was saying that

Mess with PageNosave (if that is what we used to denote a struct page
not pointing to valid RAM).

Ie. when swsusp allocates its save map (or whatever it calls it), setting
PageNosave would make other parts of the kernel think the area is not
valid ram.

In other words - we can't combine swsusp's PageNosave with our mythical
PageValidRAM.

> imaging Reserved going away, so for the short term I'd be meaning making
> the e820 set both Nosave and Reserved for those pages (which is what the
> Suspend2 patches do so as to play nicely with swsusp - I don't use
> Reserved at all).
> 

In the short term, PageReserved can stay around - swsusp does still
work if I hadn't made that clear.

By the way - how does swsusp2 handle this problem if not using
PageReserved?


-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09  5:20         ` Nick Piggin
@ 2005-08-09  5:30           ` Nigel Cunningham
  0 siblings, 0 replies; 91+ messages in thread
From: Nigel Cunningham @ 2005-08-09  5:30 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Daniel Phillips, Linux Kernel Mailing List,
	Linux Memory Management, Hugh Dickins, Linus Torvalds,
	Andrew Morton, Andrea Arcangeli, Benjamin Herrenschmidt

Hi.

On Tue, 2005-08-09 at 15:20, Nick Piggin wrote:
> Nigel Cunningham wrote:
> > Hi Nick et al.
> > 
> > On Tue, 2005-08-09 at 14:59, Nick Piggin wrote:
> 
> >>>Changing the e820 code so it sets PageNosave instead of PageReserved,
> >>>along with a couple of modifications in swsusp itself should get rid of
> >>>the swsusp dependency.
> >>>
> >>
> >>That would work for swsusp, but there are other users that want to
> >>know if a struct page is valid ram (eg. ioremap), so in that case
> >>swsusp would not be able to mess with the flag.
> > 
> > 
> > Um. Mess with which flag? I guess you mean Reserved. I was saying that
> 
> Mess with PageNosave (if that is what we used to denote a struct page
> not pointing to valid RAM).
> 
> Ie. when swsusp allocates its save map (or whatever it calls it), setting
> PageNosave would make other parts of the kernel think the area is not
> valid ram.

Ok. I guess I should have looked, but I thought the suspend
implementations were the only users of Nosave.

> In other words - we can't combine swsusp's PageNosave with our mythical
> PageValidRAM.
> 
> > imaging Reserved going away, so for the short term I'd be meaning making
> > the e820 set both Nosave and Reserved for those pages (which is what the
> > Suspend2 patches do so as to play nicely with swsusp - I don't use
> > Reserved at all).
> > 
> 
> In the short term, PageReserved can stay around - swsusp does still
> work if I hadn't made that clear.

Yeah.

> By the way - how does swsusp2 handle this problem if not using
> PageReserved?

At the places where Reserved is currently set and cleared, I set and
clear Nosave as well. Then I only use Nosave. Swsusp still works fine
(although I have to comment out a bogus bug_on()) and suspend2 can clear
and set additional pages' Nosave flags while it runs (equivalents to
Pavel's nosave_free pages and also pages allocated for checksumming when
I get paranoid).

Regards,

Nigel
-- 
Evolution.
Enumerate the requirements.
Consider the interdependencies.
Calculate the probabilities.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09  4:59     ` Nick Piggin
  2005-08-09  5:11       ` Nigel Cunningham
@ 2005-08-09  7:08       ` Russell King
  2005-08-09  8:38         ` Arjan van de Ven
                           ` (4 more replies)
  1 sibling, 5 replies; 91+ messages in thread
From: Russell King @ 2005-08-09  7:08 UTC (permalink / raw)
  To: Nick Piggin
  Cc: ncunningham, Daniel Phillips, Linux Kernel Mailing List,
	Linux Memory Management, Hugh Dickins, Linus Torvalds,
	Andrew Morton, Andrea Arcangeli, Benjamin Herrenschmidt

On Tue, Aug 09, 2005 at 02:59:53PM +1000, Nick Piggin wrote:
> That would work for swsusp, but there are other users that want to
> know if a struct page is valid ram (eg. ioremap), so in that case
> swsusp would not be able to mess with the flag.

The usage of "valid ram" here is confusing - that's not what PageReserved
is all about.  It's about valid RAM which is managed by method other
than the usual page counting.  Non-reserved RAM is also valid RAM, but
is managed by the kernel in the usual way.

The former is available for remap_pfn_range and ioremap, the latter is
not.

On the other hand, the validity of an apparant RAM address can only be
tested using its pfn with pfn_valid().

Can we straighten out the terminology so it's less confusing please?

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 Serial core
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09  7:08       ` Russell King
@ 2005-08-09  8:38         ` Arjan van de Ven
  2005-08-09  9:31           ` Nick Piggin
  2005-08-09  8:53         ` Benjamin Herrenschmidt
                           ` (3 subsequent siblings)
  4 siblings, 1 reply; 91+ messages in thread
From: Arjan van de Ven @ 2005-08-09  8:38 UTC (permalink / raw)
  To: Russell King
  Cc: Nick Piggin, ncunningham, Daniel Phillips,
	Linux Kernel Mailing List, Linux Memory Management, Hugh Dickins,
	Linus Torvalds, Andrew Morton, Andrea Arcangeli,
	Benjamin Herrenschmidt

On Tue, 2005-08-09 at 08:08 +0100, Russell King wrote:
> On Tue, Aug 09, 2005 at 02:59:53PM +1000, Nick Piggin wrote:
> > That would work for swsusp, but there are other users that want to
> > know if a struct page is valid ram (eg. ioremap), so in that case
> > swsusp would not be able to mess with the flag.
> 
> The usage of "valid ram" here is confusing - that's not what PageReserved
> is all about.  It's about valid RAM which is managed by method other
> than the usual page counting.  Non-reserved RAM is also valid RAM, but
> is managed by the kernel in the usual way.
> 
> The former is available for remap_pfn_range and ioremap, the latter is
> not.
> 
> On the other hand, the validity of an apparant RAM address can only be
> tested using its pfn with pfn_valid().
> 
> Can we straighten out the terminology so it's less confusing please?
> 

and..... can we make a general page_is_ram() function that does what it
says? on x86 it can go via the e820 table, other architectures can do
whatever they need....

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09  0:15   ` Nick Piggin
@ 2005-08-09  8:51     ` Benjamin Herrenschmidt
  2005-08-09  9:49       ` Nick Piggin
  2005-08-09 11:25       ` Hugh Dickins
  2005-08-09 19:14     ` Daniel Phillips
  1 sibling, 2 replies; 91+ messages in thread
From: Benjamin Herrenschmidt @ 2005-08-09  8:51 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Daniel Phillips, linux-kernel, Linux Memory Management,
	Hugh Dickins, Linus Torvalds, Andrew Morton, Andrea Arcangeli

> Basically, it was doing a whole lot of vaguely related things. It
> was set for ZERO_PAGE pages. It was (and still is) set for struct
> pages that don't point to valid ram. Drivers set it, hoping it will
> do something magical for them.
> 
> And yes, the VM_RESERVED flag is able to replace most usages.
> Checking (pte_page(pte) == ZERO_PAGE(addr)) picks up others.
> 
> What we don't have is something to indicate the page does not point
> to valid ram.

I have no problem keeping PG_reserved for that, and _ONLY_ for that.
(though i'd rather see it renamed then). I'm just afraid by doing so,
some drivers will jump in the gap and abuse it again... Also, we should
make sure we kill the "trick" of refcounting only in one direction.
Either we refcount both (but do nothing, or maybe just BUG_ON if the
page is "reserved" -> not valid RAM), or we don't refcount at all.

For things like Cell, We'll really end up needing struct page covering
the SPUs for example. That is not valid RAM, shouldn't be refcounted,
but we need to be able to have nopage() returning these etc...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09  7:08       ` Russell King
  2005-08-09  8:38         ` Arjan van de Ven
@ 2005-08-09  8:53         ` Benjamin Herrenschmidt
  2005-08-09  9:15         ` Hugh Dickins
                           ` (2 subsequent siblings)
  4 siblings, 0 replies; 91+ messages in thread
From: Benjamin Herrenschmidt @ 2005-08-09  8:53 UTC (permalink / raw)
  To: Russell King
  Cc: Nick Piggin, ncunningham, Daniel Phillips,
	Linux Kernel Mailing List, Linux Memory Management, Hugh Dickins,
	Linus Torvalds, Andrew Morton, Andrea Arcangeli

> Can we straighten out the terminology so it's less confusing please?

Well, RAM that isn't managed by standard page counting could be
considered a some sort of weird MMIO :) 

Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09  7:08       ` Russell King
  2005-08-09  8:38         ` Arjan van de Ven
  2005-08-09  8:53         ` Benjamin Herrenschmidt
@ 2005-08-09  9:15         ` Hugh Dickins
  2005-08-09 10:27           ` Nick Piggin
  2005-08-09 19:49           ` Roman Zippel
  2005-08-09  9:29         ` Nick Piggin
  2005-08-09 14:38         ` Martin J. Bligh
  4 siblings, 2 replies; 91+ messages in thread
From: Hugh Dickins @ 2005-08-09  9:15 UTC (permalink / raw)
  To: Russell King
  Cc: Nick Piggin, ncunningham, Daniel Phillips,
	Linux Kernel Mailing List, Linux Memory Management,
	Linus Torvalds, Andrew Morton, Andrea Arcangeli,
	Benjamin Herrenschmidt

On Tue, 9 Aug 2005, Russell King wrote:
> On Tue, Aug 09, 2005 at 02:59:53PM +1000, Nick Piggin wrote:
> > That would work for swsusp, but there are other users that want to
> > know if a struct page is valid ram (eg. ioremap), so in that case
> > swsusp would not be able to mess with the flag.
> 
> The usage of "valid ram" here is confusing - that's not what PageReserved
> is all about.  It's about valid RAM which is managed by method other
> than the usual page counting.

You're right (though I imagine might sometimes be holes rather than RAM).

PageReserved is about those pages which are managed by PageReserved.
But quite what it means is unclear, one of the reasons to eliminate it.
(Why is kernel text PageReserved?)

> Non-reserved RAM is also valid RAM, but
> is managed by the kernel in the usual way.
> 
> The former is available for remap_pfn_range and ioremap, the latter is
> not.

And the caller of remap_pfn_range (and occasionally ioremap?) uses
SetPageReserved to move pages from the latter to the former category,
so that they will work successfully on it.

Seems very silly to me.  A little key we give the caller,
so the caller can reassure us "I know what I'm doing".

I think Nick is treating the "use" of PageReserved in ioremap much too
reverentially.  Fine to leave its removal from there to a later stage,
but why shouldn't that also be removed?

With or without PageReserved, driver writers should be careful to apply
ioremap to the areas they intend.  And when they do get it wrong (setting
a window on the wrong range of RAM), the new VM_RESERVED handling makes
sure that at least those wrong pages won't be freed when unmapped.

Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09  7:08       ` Russell King
                           ` (2 preceding siblings ...)
  2005-08-09  9:15         ` Hugh Dickins
@ 2005-08-09  9:29         ` Nick Piggin
  2005-08-09 19:40           ` Russell King
  2005-08-09 14:38         ` Martin J. Bligh
  4 siblings, 1 reply; 91+ messages in thread
From: Nick Piggin @ 2005-08-09  9:29 UTC (permalink / raw)
  To: Russell King
  Cc: ncunningham, Daniel Phillips, Linux Kernel Mailing List,
	Linux Memory Management, Hugh Dickins, Linus Torvalds,
	Andrew Morton, Andrea Arcangeli, Benjamin Herrenschmidt

Russell King wrote:
> On Tue, Aug 09, 2005 at 02:59:53PM +1000, Nick Piggin wrote:
> 
>>That would work for swsusp, but there are other users that want to
>>know if a struct page is valid ram (eg. ioremap), so in that case
>>swsusp would not be able to mess with the flag.
> 
> 
> The usage of "valid ram" here is confusing - that's not what PageReserved
> is all about.  It's about valid RAM which is managed by method other
> than the usual page counting.  Non-reserved RAM is also valid RAM, but
> is managed by the kernel in the usual way.
> 

Well that is one usage of the PageReserved flag. That one tends
to be easily covered by VM_RESERVED (ie. it is no longer used that
way after the patches).

The remaining problem is, in fact, these "other" uses of PageReserved.
One usage definitely appears to be "is this page valid RAM?".

> The former is available for remap_pfn_range and ioremap, the latter is
> not.
> 

I thought ioremap was attempting to avoid remapping physical
RAM with that check. All drivers I have looked at which allocate
physical memory then SetPageReserved the pages use remap_pfn_range
but I admit that's not a huge number (that I have looked at).

> On the other hand, the validity of an apparant RAM address can only be
> tested using its pfn with pfn_valid().
> 

I'm fairly sure that's not the case on i386 at least. I think
pfn_valid will be true if the pfn points to a struct page.
See arch/i386/mm/init.c:one_highpage_init()

> Can we straighten out the terminology so it's less confusing please?
> 

That's what I'm aiming for. I admit it is confusing and I haven't
looked at all drivers or architectures, so if I'm missing a vital
clue then I wouldn't be too surprised ;)

Nick

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09  8:38         ` Arjan van de Ven
@ 2005-08-09  9:31           ` Nick Piggin
  2005-08-09  9:49             ` Arjan van de Ven
  2005-08-09 10:24             ` Rafael J. Wysocki
  0 siblings, 2 replies; 91+ messages in thread
From: Nick Piggin @ 2005-08-09  9:31 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Russell King, ncunningham, Daniel Phillips,
	Linux Kernel Mailing List, Linux Memory Management, Hugh Dickins,
	Linus Torvalds, Andrew Morton, Andrea Arcangeli,
	Benjamin Herrenschmidt

Arjan van de Ven wrote:
> On Tue, 2005-08-09 at 08:08 +0100, Russell King wrote:

>>Can we straighten out the terminology so it's less confusing please?
>>
> 
> 
> and..... can we make a general page_is_ram() function that does what it
> says? on x86 it can go via the e820 table, other architectures can do
> whatever they need....
> 

That would be very helpful. That should cover the remaining (ab)users
of PageReserved.

It would probably be fastest to implement this with a page flag,
however if swsusp and ioremap are the only users then it shouldn't
be a problem to go through slower lookups (and this would remove the
need for the PageValidRAM flag that I had worried about earlier).

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09  8:51     ` Benjamin Herrenschmidt
@ 2005-08-09  9:49       ` Nick Piggin
  2005-08-09 19:19         ` Daniel Phillips
  2005-08-09 19:22         ` Daniel Phillips
  2005-08-09 11:25       ` Hugh Dickins
  1 sibling, 2 replies; 91+ messages in thread
From: Nick Piggin @ 2005-08-09  9:49 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Daniel Phillips, linux-kernel, Linux Memory Management,
	Hugh Dickins, Linus Torvalds, Andrew Morton, Andrea Arcangeli

Benjamin Herrenschmidt wrote:

> 
> I have no problem keeping PG_reserved for that, and _ONLY_ for that.
> (though i'd rather see it renamed then). I'm just afraid by doing so,
> some drivers will jump in the gap and abuse it again...

Sure it would be renamed (better yet may be a slower page_is_valid()
that doesn't need to use a flag).

There is always the possibility for driver abuse, I guess... however
as it is now, the tree is basically past the critical mass of self
perpetuation (ie. cut-n-paste). So getting rid of that should certianly
help get things cleaner.

> Also, we should
> make sure we kill the "trick" of refcounting only in one direction.
> Either we refcount both (but do nothing, or maybe just BUG_ON if the
> page is "reserved" -> not valid RAM), or we don't refcount at all.
> 

Yep, that's done. Actually having a BUG_ON PageReserved in the refcount
functions isn't a bad idea for the initial merge, and should help allay
my fears that I might have introduced refcount leaks on PageReserved
pages.

> For things like Cell, We'll really end up needing struct page covering
> the SPUs for example. That is not valid RAM, shouldn't be refcounted,
> but we need to be able to have nopage() returning these etc...
>

In that case, remap_pfn_range should take care of it for you by
setting the VM_RESERVED flag on the vma.

Swsusp is the main "is valid ram" user I have in mind here. It
wants to know whether or not it should save and restore the
memory of a given `struct page`.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09  9:31           ` Nick Piggin
@ 2005-08-09  9:49             ` Arjan van de Ven
  2005-08-09  9:57               ` Nick Piggin
  2005-08-09 10:24             ` Rafael J. Wysocki
  1 sibling, 1 reply; 91+ messages in thread
From: Arjan van de Ven @ 2005-08-09  9:49 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Russell King, ncunningham, Daniel Phillips,
	Linux Kernel Mailing List, Linux Memory Management, Hugh Dickins,
	Linus Torvalds, Andrew Morton, Andrea Arcangeli,
	Benjamin Herrenschmidt

On Tue, 2005-08-09 at 19:31 +1000, Nick Piggin wrote:
> Arjan van de Ven wrote:
> > On Tue, 2005-08-09 at 08:08 +0100, Russell King wrote:
> 
> >>Can we straighten out the terminology so it's less confusing please?
> >>
> > 
> > 
> > and..... can we make a general page_is_ram() function that does what it
> > says? on x86 it can go via the e820 table, other architectures can do
> > whatever they need....
> > 
> 
> That would be very helpful. That should cover the remaining (ab)users
> of PageReserved.
> 
> It would probably be fastest to implement this with a page flag,
> however if swsusp and ioremap are the only users then it shouldn't
> be a problem to go through slower lookups (and this would remove the
> need for the PageValidRAM flag that I had worried about earlier).

if you want I have implementations of this for x86, x86_64 and iirc ia64
(not 100% sure about the later). None of these use a page flag, but use
the same information the kernel uses during bootup to find ram.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09  9:49             ` Arjan van de Ven
@ 2005-08-09  9:57               ` Nick Piggin
  0 siblings, 0 replies; 91+ messages in thread
From: Nick Piggin @ 2005-08-09  9:57 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Russell King, ncunningham, Daniel Phillips,
	Linux Kernel Mailing List, Linux Memory Management, Hugh Dickins,
	Linus Torvalds, Andrew Morton, Andrea Arcangeli,
	Benjamin Herrenschmidt

Arjan van de Ven wrote:
> On Tue, 2005-08-09 at 19:31 +1000, Nick Piggin wrote:
> 
>>Arjan van de Ven wrote:
>>

>>>and..... can we make a general page_is_ram() function that does what it
>>>says? on x86 it can go via the e820 table, other architectures can do
>>>whatever they need....
>>>
>>
>>That would be very helpful. That should cover the remaining (ab)users
>>of PageReserved.
>>
>>It would probably be fastest to implement this with a page flag,
>>however if swsusp and ioremap are the only users then it shouldn't
>>be a problem to go through slower lookups (and this would remove the
>>need for the PageValidRAM flag that I had worried about earlier).
> 
> 
> if you want I have implementations of this for x86, x86_64 and iirc ia64
> (not 100% sure about the later). None of these use a page flag, but use
> the same information the kernel uses during bootup to find ram.
> 

It seems like a good idea to me, if the arch guys are up for it.
If you have a copy of the patch handy, sure send it over.

Thanks
Nick

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09  9:31           ` Nick Piggin
  2005-08-09  9:49             ` Arjan van de Ven
@ 2005-08-09 10:24             ` Rafael J. Wysocki
  1 sibling, 0 replies; 91+ messages in thread
From: Rafael J. Wysocki @ 2005-08-09 10:24 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Arjan van de Ven, Russell King, ncunningham, Daniel Phillips,
	Linux Kernel Mailing List, Linux Memory Management, Hugh Dickins,
	Linus Torvalds, Andrew Morton, Andrea Arcangeli,
	Benjamin Herrenschmidt, Pavel Machek

On Tuesday, 9 of August 2005 11:31, Nick Piggin wrote:
> Arjan van de Ven wrote:
> > On Tue, 2005-08-09 at 08:08 +0100, Russell King wrote:
> 
> >>Can we straighten out the terminology so it's less confusing please?
> >>
> > 
> > 
> > and..... can we make a general page_is_ram() function that does what it
> > says? on x86 it can go via the e820 table, other architectures can do
> > whatever they need....
> > 
> 
> That would be very helpful. That should cover the remaining (ab)users
> of PageReserved.
> 
> It would probably be fastest to implement this with a page flag,
> however if swsusp and ioremap are the only users then it shouldn't
> be a problem to go through slower lookups (and this would remove the
> need for the PageValidRAM flag that I had worried about earlier).

I think swsusp can be modified to use PageNosave only and everything
that is not to be touched by swsusp should be marked as no-save.

Greets,
Rafael


-- 
- Would you tell me, please, which way I ought to go from here?
- That depends a good deal on where you want to get to.
		-- Lewis Carroll "Alice's Adventures in Wonderland"
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09  9:15         ` Hugh Dickins
@ 2005-08-09 10:27           ` Nick Piggin
  2005-08-09 11:15             ` Hugh Dickins
  2005-08-09 19:49           ` Roman Zippel
  1 sibling, 1 reply; 91+ messages in thread
From: Nick Piggin @ 2005-08-09 10:27 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Russell King, ncunningham, Daniel Phillips,
	Linux Kernel Mailing List, Linux Memory Management,
	Linus Torvalds, Andrew Morton, Andrea Arcangeli,
	Benjamin Herrenschmidt

Hugh Dickins wrote:

> 
> You're right (though I imagine might sometimes be holes rather than RAM).
> 

Yep. These holes are what I have in mind, and random other things
like the !(bad_ppro && page_kills_ppro(pfn)) check.

[...]

> I think Nick is treating the "use" of PageReserved in ioremap much too
> reverentially.  Fine to leave its removal from there to a later stage,
> but why shouldn't that also be removed?
> 

Well, as far as I had been able to gather, ioremap is trying to
ensure it does indeed only hit one of these holes, and not valid
RAM. I thought the fact that it *won't* bail out when encountering
kernel text or remap_pfn_range'ed pages was only due to PG_reserved
being the proverbial jack of all trades, master of none.

I could be wrong here though.

But in either case: I agree that it is probably not a great loss
to remove the check, although considering it will be needed for
swsusp anyway...

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09 10:27           ` Nick Piggin
@ 2005-08-09 11:15             ` Hugh Dickins
  2005-08-09 13:15               ` Nick Piggin
  2005-08-09 14:28               ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 91+ messages in thread
From: Hugh Dickins @ 2005-08-09 11:15 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Russell King, ncunningham, Daniel Phillips,
	Linux Kernel Mailing List, Linux Memory Management,
	Linus Torvalds, Andrew Morton, Andrea Arcangeli,
	Benjamin Herrenschmidt

On Tue, 9 Aug 2005, Nick Piggin wrote:
> Hugh Dickins wrote:
> > I think Nick is treating the "use" of PageReserved in ioremap much too
> > reverentially.  Fine to leave its removal from there to a later stage,
> > but why shouldn't that also be removed?
> 
> Well, as far as I had been able to gather, ioremap is trying to
> ensure it does indeed only hit one of these holes, and not valid
> RAM.

Who can tell?  rmk's mail sugggests it should work on some valid RAM.

ioremap is making a similar check to the one remap_pfn_range used
to make; but I see no good reason for it at all.  ioremap should be
allowed to map whatever the caller asked, just as memset is allowed
to set whatever the caller asked.  It's up to the caller to get it
right, not for the function to demand the added reassurance of some
mysterious page flag being set.

(But in what I said earlier about VM_RESERVE making sure wrong pages
not freed, I was confused and confusing ioremap with remap_pfn_range.)

> I thought the fact that it *won't* bail out when encountering
> kernel text or remap_pfn_range'ed pages was only due to PG_reserved
> being the proverbial jack of all trades, master of none.
> 
> I could be wrong here though.
> 
> But in either case: I agree that it is probably not a great loss
> to remove the check, although considering it will be needed for
> swsusp anyway...

swsusp (and I think crashdump has a similar need) is a very different
case: it's approaching memory from the zone/mem_map end, with no(?) idea
of how the different pages are used: needs to save all the info while
avoiding those areas which would give trouble.  I can well imagine it
needs either a page flag or a table lookup to decide that.

But ioremap and remap_pfn_range are coming from drivers which (we hope)
know what they're mapping these particular areas for.  If it's provable
that the meaning which swsusp needs is equally usable for a little sanity
check in ioremap, okay, but I'm sceptical.

Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09  8:51     ` Benjamin Herrenschmidt
  2005-08-09  9:49       ` Nick Piggin
@ 2005-08-09 11:25       ` Hugh Dickins
  2005-08-09 14:31         ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 91+ messages in thread
From: Hugh Dickins @ 2005-08-09 11:25 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Nick Piggin, Daniel Phillips, linux-kernel,
	Linux Memory Management, Linus Torvalds, Andrew Morton,
	Andrea Arcangeli

On Tue, 9 Aug 2005, Benjamin Herrenschmidt wrote:
> > 
> > What we don't have is something to indicate the page does not point
> > to valid ram.
> 
> I have no problem keeping PG_reserved for that, and _ONLY_ for that.

Yes, if a table won't suffice.

> (though i'd rather see it renamed then).

Definitely.

> I'm just afraid by doing so,
> some drivers will jump in the gap and abuse it again...

I don't think that was abuse, it was just playing by the silly rules
remap_pfn_range and ioremap demanded.

> Also, we should
> make sure we kill the "trick" of refcounting only in one direction.

Very hard to find anyone to disagree with you on that!

> Either we refcount both (but do nothing, or maybe just BUG_ON if the
> page is "reserved" -> not valid RAM), or we don't refcount at all.

We do what's most efficient for the core.  Which I think is refcount
both ways regardless, since these "page"s are exceptional, and the
majority really do need refcounting.

> For things like Cell, We'll really end up needing struct page covering
> the SPUs for example. That is not valid RAM, shouldn't be refcounted,

But you don't mind if they are refcounted, do you?
Just so long as they start out from 1 so never get freed.

> but we need to be able to have nopage() returning these etc...

You'll actually be needing nopage() on them?  That idea has come up
before, it's not out of the question (though I think wli suggested
we ought rather to change the nopage interface if so), but it's a
different topic from the current removal of PageReserved anyway.

Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09 11:15             ` Hugh Dickins
@ 2005-08-09 13:15               ` Nick Piggin
  2005-08-09 13:26                 ` Arjan van de Ven
  2005-08-09 14:28               ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 91+ messages in thread
From: Nick Piggin @ 2005-08-09 13:15 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Russell King, ncunningham, Daniel Phillips,
	Linux Kernel Mailing List, Linux Memory Management,
	Linus Torvalds, Andrew Morton, Andrea Arcangeli,
	Benjamin Herrenschmidt

Hugh Dickins wrote:
> On Tue, 9 Aug 2005, Nick Piggin wrote:

>>But in either case: I agree that it is probably not a great loss
>>to remove the check, although considering it will be needed for
>>swsusp anyway...
> 
> 
> swsusp (and I think crashdump has a similar need) is a very different
> case: it's approaching memory from the zone/mem_map end, with no(?) idea
> of how the different pages are used: needs to save all the info while
> avoiding those areas which would give trouble.  I can well imagine it
> needs either a page flag or a table lookup to decide that.
> 

Yep.

> But ioremap and remap_pfn_range are coming from drivers which (we hope)
> know what they're mapping these particular areas for.  If it's provable
> that the meaning which swsusp needs is equally usable for a little sanity
> check in ioremap, okay, but I'm sceptical.
> 

I understand what you mean, and I agree. Though as far away from the
business end of the drivers I am, I tend to get the feeling that
drivers need the most hand holding.

Anyway, I guess the way to understand the problem is finding the
reason why ioremap checks PageReserved, and whether or not ioremap
should be expected (or allowed) to remap physical RAM in use by
the kernel.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09 13:15               ` Nick Piggin
@ 2005-08-09 13:26                 ` Arjan van de Ven
  0 siblings, 0 replies; 91+ messages in thread
From: Arjan van de Ven @ 2005-08-09 13:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, Russell King, ncunningham, Daniel Phillips,
	Linux Kernel Mailing List, Linux Memory Management,
	Linus Torvalds, Andrew Morton, Andrea Arcangeli,
	Benjamin Herrenschmidt

On Tue, 2005-08-09 at 23:15 +1000, Nick Piggin wrote:

> I understand what you mean, and I agree. Though as far away from the
> business end of the drivers I am, I tend to get the feeling that
> drivers need the most hand holding.

they do. It's important to make driver APIs as fool proof as possible.

> 
> Anyway, I guess the way to understand the problem is finding the
> reason why ioremap checks PageReserved, and whether or not ioremap
> should be expected (or allowed) to remap physical RAM in use by
> the kernel.

I can't think of ANY valid reason for that, in fact, it'll break a lot
due to cache aliases etc etc, on various cpus if not even on x86


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09 11:15             ` Hugh Dickins
  2005-08-09 13:15               ` Nick Piggin
@ 2005-08-09 14:28               ` Benjamin Herrenschmidt
  2005-08-09 14:47                 ` Hugh Dickins
  1 sibling, 1 reply; 91+ messages in thread
From: Benjamin Herrenschmidt @ 2005-08-09 14:28 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nick Piggin, Russell King, ncunningham, Daniel Phillips,
	Linux Kernel Mailing List, Linux Memory Management,
	Linus Torvalds, Andrew Morton, Andrea Arcangeli

> Who can tell?  rmk's mail sugggests it should work on some valid RAM.

Not really. If I understand Russell here, that RAM has been "put aside"
for use by fancy stuff and is de-facto out of control of the normal page
allocator and refcounting. In this case, I see no reason why it couldn't
be considered as MMIO and ioremap'able :)

> ioremap is making a similar check to the one remap_pfn_range used
> to make; but I see no good reason for it at all.  ioremap should be
> allowed to map whatever the caller asked, just as memset is allowed
> to set whatever the caller asked.

This is dodgy actually. memset can't be guaranteed to work on IOs or
other non-cacheable memory (including real RAM that has been mapped
non-cacheable, typically RAM that has been "set aside" for other uses as
described above, wether it's for AGP, or for some weird processor DMA
bounce buffers or whatever ..., that is RAM that is out of the normal
kernel control).

>   It's up to the caller to get it
> right, not for the function to demand the added reassurance of some
> mysterious page flag being set.
> 
> (But in what I said earlier about VM_RESERVE making sure wrong pages
> not freed, I was confused and confusing ioremap with remap_pfn_range.)
> 
> > I thought the fact that it *won't* bail out when encountering
> > kernel text or remap_pfn_range'ed pages was only due to PG_reserved
> > being the proverbial jack of all trades, master of none.
> > 
> > I could be wrong here though.
> > 
> > But in either case: I agree that it is probably not a great loss
> > to remove the check, although considering it will be needed for
> > swsusp anyway...
> 
> swsusp (and I think crashdump has a similar need) is a very different
> case: it's approaching memory from the zone/mem_map end, with no(?) idea
> of how the different pages are used: needs to save all the info while
> avoiding those areas which would give trouble.  I can well imagine it
> needs either a page flag or a table lookup to decide that.
> 
> But ioremap and remap_pfn_range are coming from drivers which (we hope)
> know what they're mapping these particular areas for.  If it's provable
> that the meaning which swsusp needs is equally usable for a little sanity
> check in ioremap, okay, but I'm sceptical.
> 
> Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09 11:25       ` Hugh Dickins
@ 2005-08-09 14:31         ` Benjamin Herrenschmidt
  2005-08-09 14:50           ` Hugh Dickins
  0 siblings, 1 reply; 91+ messages in thread
From: Benjamin Herrenschmidt @ 2005-08-09 14:31 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nick Piggin, Daniel Phillips, linux-kernel,
	Linux Memory Management, Linus Torvalds, Andrew Morton,
	Andrea Arcangeli

> We do what's most efficient for the core.  Which I think is refcount
> both ways regardless, since these "page"s are exceptional, and the
> majority really do need refcounting.

Well, refcounting _might_ be useful for some usage of these, but we
simply must make sure that those pages are never returned back to the
pool when refcount reach 0, that's it.

> But you don't mind if they are refcounted, do you?
> Just so long as they start out from 1 so never get freed.

Well, a refcounting bug would let them be freed and kaboom ... That's
why a "PG_not_your_ram_dammit" bit would be useful. It could at least
BUG_ON when refcount reaches 0 :)

> You'll actually be needing nopage() on them? 

Yes.

> That idea has come up
> before, it's not out of the question (though I think wli suggested
> we ought rather to change the nopage interface if so), but it's a
> different topic from the current removal of PageReserved anyway.

It is a different topic indeed. Wli proposal would be useful for us
here, but in the meantime, We can just create struct pages and rely on
sparsemem to have a not-too-horrible mem_map :)

Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09  7:08       ` Russell King
                           ` (3 preceding siblings ...)
  2005-08-09  9:29         ` Nick Piggin
@ 2005-08-09 14:38         ` Martin J. Bligh
  2005-08-09 19:41           ` Russell King
  4 siblings, 1 reply; 91+ messages in thread
From: Martin J. Bligh @ 2005-08-09 14:38 UTC (permalink / raw)
  To: Russell King, Nick Piggin
  Cc: ncunningham, Daniel Phillips, Linux Kernel Mailing List,
	Linux Memory Management, Hugh Dickins, Linus Torvalds,
	Andrew Morton, Andrea Arcangeli, Benjamin Herrenschmidt

--Russell King <rmk+lkml@arm.linux.org.uk> wrote (on Tuesday, August 09, 2005 08:08:53 +0100):

> On Tue, Aug 09, 2005 at 02:59:53PM +1000, Nick Piggin wrote:
>> That would work for swsusp, but there are other users that want to
>> know if a struct page is valid ram (eg. ioremap), so in that case
>> swsusp would not be able to mess with the flag.
> 
> The usage of "valid ram" here is confusing - that's not what PageReserved
> is all about.  It's about valid RAM which is managed by method other
> than the usual page counting.  Non-reserved RAM is also valid RAM, but
> is managed by the kernel in the usual way.
> 
> The former is available for remap_pfn_range and ioremap, the latter is
> not.
> 
> On the other hand, the validity of an apparant RAM address can only be
> tested using its pfn with pfn_valid().
> 
> Can we straighten out the terminology so it's less confusing please?

pfn_valid() doesn't tell you it's RAM or not - it tells you whether you
have a backing struct page for that address. Could be an IO mapped device,
a small memory hole, whatever.

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09 14:28               ` Benjamin Herrenschmidt
@ 2005-08-09 14:47                 ` Hugh Dickins
  0 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2005-08-09 14:47 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Nick Piggin, Russell King, ncunningham, Daniel Phillips,
	Linux Kernel Mailing List, Linux Memory Management,
	Linus Torvalds, Andrew Morton, Andrea Arcangeli

On Tue, 9 Aug 2005, Benjamin Herrenschmidt wrote:
> 
> > ioremap is making a similar check to the one remap_pfn_range used
> > to make; but I see no good reason for it at all.  ioremap should be
> > allowed to map whatever the caller asked, just as memset is allowed
> > to set whatever the caller asked.
> 
> This is dodgy actually. memset can't be guaranteed to work on IOs or
> other non-cacheable memory (including real RAM that has been mapped
> non-cacheable, typically RAM that has been "set aside" for other uses as
> described above, wether it's for AGP, or for some weird processor DMA
> bounce buffers or whatever ..., that is RAM that is out of the normal
> kernel control).

That was my point: memset goes ahead without making funny little checks,
and works or not, so I don't see why ioremap needs to make these funny
little checks.  If the driver doesn't know what it's doing (not impossible,
I accept), what's the likelihood that PageReserved or not will save it?

Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09 14:50           ` Hugh Dickins
@ 2005-08-09 14:49             ` Benjamin Herrenschmidt
  2005-08-09 15:36               ` Hugh Dickins
  0 siblings, 1 reply; 91+ messages in thread
From: Benjamin Herrenschmidt @ 2005-08-09 14:49 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nick Piggin, Daniel Phillips, linux-kernel,
	Linux Memory Management, Linus Torvalds, Andrew Morton,
	Andrea Arcangeli

On Tue, 2005-08-09 at 15:50 +0100, Hugh Dickins wrote:
> On Tue, 9 Aug 2005, Benjamin Herrenschmidt wrote:
> > 
> > > But you don't mind if they are refcounted, do you?
> > > Just so long as they start out from 1 so never get freed.
> > 
> > Well, a refcounting bug would let them be freed and kaboom ... That's
> > why a "PG_not_your_ram_dammit" bit would be useful. It could at least
> > BUG_ON when refcount reaches 0 :)
> 
> Okay, great, let's give every struct page two refcounts,
> so if one of them goes wrong, the other one will save us.

You are abusing here :)

 - We already have a refcount
 - We have a field where putting a flag isn't that much of a problem
 - It can be difficult to get page refcounting right when dealing with
   such things, really.

In that case, we basically have an _easy_ way to trigger a useful BUG()
in the page free path when it's a page that should never be returned to
the pool.

Since the "PG_not_in_ram" or whatever we call it flag might be used by
swsusp or others, I suppose it could be useful.

However, I agree that if the end result is to have drivers just change
"PG_reserved" to "PG_not_in_ram" and still be bogus, then we might just
go all the way & drop the flag completely, only relying on the VMA
flags.

Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09 14:31         ` Benjamin Herrenschmidt
@ 2005-08-09 14:50           ` Hugh Dickins
  2005-08-09 14:49             ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 91+ messages in thread
From: Hugh Dickins @ 2005-08-09 14:50 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Nick Piggin, Daniel Phillips, linux-kernel,
	Linux Memory Management, Linus Torvalds, Andrew Morton,
	Andrea Arcangeli

On Tue, 9 Aug 2005, Benjamin Herrenschmidt wrote:
> 
> > But you don't mind if they are refcounted, do you?
> > Just so long as they start out from 1 so never get freed.
> 
> Well, a refcounting bug would let them be freed and kaboom ... That's
> why a "PG_not_your_ram_dammit" bit would be useful. It could at least
> BUG_ON when refcount reaches 0 :)

Okay, great, let's give every struct page two refcounts,
so if one of them goes wrong, the other one will save us.

Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09 14:49             ` Benjamin Herrenschmidt
@ 2005-08-09 15:36               ` Hugh Dickins
  2005-08-09 21:27                 ` Daniel Phillips
  0 siblings, 1 reply; 91+ messages in thread
From: Hugh Dickins @ 2005-08-09 15:36 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Nick Piggin, Daniel Phillips, linux-kernel,
	Linux Memory Management, Linus Torvalds, Andrew Morton,
	Andrea Arcangeli

On Tue, 9 Aug 2005, Benjamin Herrenschmidt wrote:
> On Tue, 2005-08-09 at 15:50 +0100, Hugh Dickins wrote:
> > On Tue, 9 Aug 2005, Benjamin Herrenschmidt wrote:
> > > 
> > > Well, a refcounting bug would let them be freed and kaboom ... That's
> > > why a "PG_not_your_ram_dammit" bit would be useful. It could at least
> > > BUG_ON when refcount reaches 0 :)
> > 
> > Okay, great, let's give every struct page two refcounts,
> > so if one of them goes wrong, the other one will save us.
> 
> You are abusing here :)

Yeah, I was: sorry!

>  - We already have a refcount
>  - We have a field where putting a flag isn't that much of a problem
>  - It can be difficult to get page refcounting right when dealing with
>    such things, really.

Probably easier to get the page refcounting right with these than with
most.  Getting refcounting wrong is always bad.

> In that case, we basically have an _easy_ way to trigger a useful BUG()
> in the page free path when it's a page that should never be returned to
> the pool.

As bad_page already does on various other flags (though it clears those,
whereas this one you'd prefer not to clear).   Hmm, okay, though I'm not
sure it's worth its own page flag if they're in short supply.

> Since the "PG_not_in_ram" or whatever we call it flag might be used by
> swsusp or others, I suppose it could be useful.

Any flag used elsewhere, which is incompatible with being freed, should
be checked for no cost in free_pages_ok/prep_new_page/bad_page, yes.

> However, I agree that if the end result is to have drivers just change
> "PG_reserved" to "PG_not_in_ram" and still be bogus, then we might just
> go all the way & drop the flag completely, only relying on the VMA
> flags.

Yes: if any driver ever has to manipulate it,
either the flag is misdefined or the driver is beyond the pale.

Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09  0:15   ` Nick Piggin
  2005-08-09  8:51     ` Benjamin Herrenschmidt
@ 2005-08-09 19:14     ` Daniel Phillips
  2005-08-09 20:17       ` Hugh Dickins
  1 sibling, 1 reply; 91+ messages in thread
From: Daniel Phillips @ 2005-08-09 19:14 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-kernel, Linux Memory Management, Hugh Dickins,
	Linus Torvalds, Andrew Morton, Andrea Arcangeli,
	Benjamin Herrenschmidt

On Tuesday 09 August 2005 10:15, Nick Piggin wrote:
> Daniel Phillips wrote:
> > Why don't you pass the vma in zap_details?  For that matter, why are addr
> > and end still passed down the zap chain when zap_details appears to
> > duplicate that information?  OK, it is because zap_details is NULL in
> > about twice as many places as it carries data.  But since the details
> > parameter is already there, would it not make sense to press it into
> > service to slim down those parameter lists a little?
>
> Possibly. I initially did it that way, but it ended up fattening
> paths that don't use details.

It should not, it only affects, hmm, less than 10 places, each at the 
beginning of a massive call chain, e.g., in madvise_dontneed:

-	zap_page_range(vma, start, end - start, NULL);
+	zap_page_range(start, end - start, &(struct zap){ .vma = vma });

> And this way is less intrusive.

Nearly the same I think, and makes forward progress in controlling this 
middle-aged belly roll of an internal API.

Regards,

Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09  9:49       ` Nick Piggin
@ 2005-08-09 19:19         ` Daniel Phillips
  2005-08-09 19:22         ` Daniel Phillips
  1 sibling, 0 replies; 91+ messages in thread
From: Daniel Phillips @ 2005-08-09 19:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Benjamin Herrenschmidt, linux-kernel, Linux Memory Management,
	Hugh Dickins, Linus Torvalds, Andrew Morton, Andrea Arcangeli

On Tuesday 09 August 2005 19:49, Nick Piggin wrote:
> Benjamin Herrenschmidt wrote:
> > I have no problem keeping PG_reserved for that, and _ONLY_ for that.
> > (though i'd rather see it renamed then). I'm just afraid by doing so,
> > some drivers will jump in the gap and abuse it again...
>
> Sure it would be renamed (better yet may be a slower page_is_valid()
> that doesn't need to use a flag).

Right!  This is the correct time to wrap all remaining users (that use the 
newly-mandated valid page sense) in an inline or macro.  And this patch set 
should change the flag name, because it quietly changes the rules.  I think 
you need a 3/3 that drops the other shoe.

Regards,

Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09  9:49       ` Nick Piggin
  2005-08-09 19:19         ` Daniel Phillips
@ 2005-08-09 19:22         ` Daniel Phillips
  2005-08-10 21:50           ` Pavel Machek
  1 sibling, 1 reply; 91+ messages in thread
From: Daniel Phillips @ 2005-08-09 19:22 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Benjamin Herrenschmidt, linux-kernel, Linux Memory Management,
	Hugh Dickins, Linus Torvalds, Andrew Morton, Andrea Arcangeli

On Tuesday 09 August 2005 19:49, Nick Piggin wrote:
> Swsusp is the main "is valid ram" user I have in mind here. It
> wants to know whether or not it should save and restore the
> memory of a given `struct page`.

Why can't it follow the rmap chain?

Regards,

Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09  9:29         ` Nick Piggin
@ 2005-08-09 19:40           ` Russell King
  0 siblings, 0 replies; 91+ messages in thread
From: Russell King @ 2005-08-09 19:40 UTC (permalink / raw)
  To: Nick Piggin
  Cc: ncunningham, Daniel Phillips, Linux Kernel Mailing List,
	Linux Memory Management, Hugh Dickins, Linus Torvalds,
	Andrew Morton, Andrea Arcangeli, Benjamin Herrenschmidt

On Tue, Aug 09, 2005 at 07:29:30PM +1000, Nick Piggin wrote:
> Russell King wrote:
> > The usage of "valid ram" here is confusing - that's not what PageReserved
> > is all about.  It's about valid RAM which is managed by method other
> > than the usual page counting.  Non-reserved RAM is also valid RAM, but
> > is managed by the kernel in the usual way.
> 
> Well that is one usage of the PageReserved flag. That one tends
> to be easily covered by VM_RESERVED (ie. it is no longer used that
> way after the patches).
> 
> The remaining problem is, in fact, these "other" uses of PageReserved.
> One usage definitely appears to be "is this page valid RAM?".

Hmm, that sounds like an architecture specific extension above the
basic requirements.

> > The former is available for remap_pfn_range and ioremap, the latter is
> > not.
> 
> I thought ioremap was attempting to avoid remapping physical
> RAM with that check. All drivers I have looked at which allocate
> physical memory then SetPageReserved the pages use remap_pfn_range
> but I admit that's not a huge number (that I have looked at).

They do this because:

1. they want to control when this RAM is freed.
2. remap_pfn_range refuses to map RAM that isn't marked reserved.

To put it another way, they fiddle with the reserved bit because
that's what the current interfaces forces upon them.  I would
dearly like that to go away though.

> > On the other hand, the validity of an apparant RAM address can only be
> > tested using its pfn with pfn_valid().
> 
> I'm fairly sure that's not the case on i386 at least. I think
> pfn_valid will be true if the pfn points to a struct page.
> See arch/i386/mm/init.c:one_highpage_init()

This sounds like i386 is doing something which is a superset of the
base requirements, which is an architecture specific extension.  No
problem with that, but that's i386 folk's problem. 8)

Ok, but I still disagree with replacing something called reserved
with something which leads one to believe that it's intended for
checking whether a struct page is "valid" RAM or not when there's
other interfaces which are supposed to be used for that.

I wonder if we can optimise out the useless "valid" RAM checks
on architectures which don't require this insanity...

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 Serial core
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09 14:38         ` Martin J. Bligh
@ 2005-08-09 19:41           ` Russell King
  2005-08-09 20:51             ` Linus Torvalds
                               ` (2 more replies)
  0 siblings, 3 replies; 91+ messages in thread
From: Russell King @ 2005-08-09 19:41 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Nick Piggin, ncunningham, Daniel Phillips,
	Linux Kernel Mailing List, Linux Memory Management, Hugh Dickins,
	Linus Torvalds, Andrew Morton, Andrea Arcangeli,
	Benjamin Herrenschmidt

On Tue, Aug 09, 2005 at 07:38:52AM -0700, Martin J. Bligh wrote:
> pfn_valid() doesn't tell you it's RAM or not - it tells you whether you
> have a backing struct page for that address. Could be an IO mapped device,
> a small memory hole, whatever.

The only things which have a struct page is RAM.  Nothing else does.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 Serial core
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09  9:15         ` Hugh Dickins
  2005-08-09 10:27           ` Nick Piggin
@ 2005-08-09 19:49           ` Roman Zippel
  1 sibling, 0 replies; 91+ messages in thread
From: Roman Zippel @ 2005-08-09 19:49 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Russell King, Nick Piggin, ncunningham, Daniel Phillips,
	Linux Kernel Mailing List, Linux Memory Management,
	Linus Torvalds, Andrew Morton, Andrea Arcangeli,
	Benjamin Herrenschmidt

Hi,

On Tue, 9 Aug 2005, Hugh Dickins wrote:

> PageReserved is about those pages which are managed by PageReserved.
> But quite what it means is unclear, one of the reasons to eliminate it.
> (Why is kernel text PageReserved?)

Short answer: /dev/mem
(Amazing that nobody mentioned it so far...)

To understand PageReserved it probably helps to know its history. One of 
the first users of this was X so it can map the video memory. This flag 
was the only way to distinguish which pages can be mapped this way, as 
remap_page_range() has no idea who owns the page. The vm also used this 
flag to skip these pages.

Later this was also used by drivers to map pages into user space using 
remap_page_range() (as alternative to the nopage handler). Only later came 
VM_RESERVED, IIRC it was first an optimization to avoid scanning these 
mappings.

So the actual meaning of this flag is that if it's set the page structure 
must not be changed and all fields must be treated as reserved. Only the 
owner of the page who did set this flag can do whatever he wants with it.

bye, Roman
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09 19:14     ` Daniel Phillips
@ 2005-08-09 20:17       ` Hugh Dickins
  2005-08-09 20:52         ` Daniel Phillips
  0 siblings, 1 reply; 91+ messages in thread
From: Hugh Dickins @ 2005-08-09 20:17 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Nick Piggin, linux-kernel, Linux Memory Management,
	Linus Torvalds, Andrew Morton, Andrea Arcangeli,
	Benjamin Herrenschmidt

On Wed, 10 Aug 2005, Daniel Phillips wrote:
> On Tuesday 09 August 2005 10:15, Nick Piggin wrote:
> > Daniel Phillips wrote:
> > > Why don't you pass the vma in zap_details?
> >
> > Possibly. I initially did it that way, but it ended up fattening
> > paths that don't use details.
> 
> It should not, it only affects, hmm, less than 10 places, each at the 
> beginning of a massive call chain, e.g., in madvise_dontneed:
> 
> -	zap_page_range(vma, start, end - start, NULL);
> +	zap_page_range(start, end - start, &(struct zap){ .vma = vma });
> 
> > And this way is less intrusive.
> 
> Nearly the same I think, and makes forward progress in controlling this 
> middle-aged belly roll of an internal API.

I much prefer how Nick has it, with vma passed separately as a regular
argument.  details is for packaging up some details only required in
unlikely circumstances, normally it's NULL and not filled in at all.

You can argue that (vma->vm_flags & VM_RESERVED) is precisely that
kind of detail.  But personally I find it rather odd that vma isn't
an explicit argument to zap_pte_range already - I find it very useful
when trying to shed light on the rmap.c:BUG, for example.

There might be a case for packaging repeated arguments into structures
(though several of these levels are inlined anyway), but that's some
other exercise entirely, shouldn't get in the way of removing Reserved.

Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09 19:41           ` Russell King
@ 2005-08-09 20:51             ` Linus Torvalds
  2005-08-09 21:16             ` Martin J. Bligh
  2005-08-10  9:27             ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 91+ messages in thread
From: Linus Torvalds @ 2005-08-09 20:51 UTC (permalink / raw)
  To: Russell King
  Cc: Martin J. Bligh, Nick Piggin, ncunningham, Daniel Phillips,
	Linux Kernel Mailing List, Linux Memory Management, Hugh Dickins,
	Andrew Morton, Andrea Arcangeli, Benjamin Herrenschmidt


On Tue, 9 Aug 2005, Russell King wrote:

> On Tue, Aug 09, 2005 at 07:38:52AM -0700, Martin J. Bligh wrote:
> > pfn_valid() doesn't tell you it's RAM or not - it tells you whether you
> > have a backing struct page for that address. Could be an IO mapped device,
> > a small memory hole, whatever.
> 
> The only things which have a struct page is RAM.  Nothing else does.

That's not true.

We have "struct page" show up for the ISA legacy MMIO region too, for 
example.

		Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09 20:17       ` Hugh Dickins
@ 2005-08-09 20:52         ` Daniel Phillips
  0 siblings, 0 replies; 91+ messages in thread
From: Daniel Phillips @ 2005-08-09 20:52 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nick Piggin, linux-kernel, Linux Memory Management,
	Linus Torvalds, Andrew Morton, Andrea Arcangeli,
	Benjamin Herrenschmidt

On Wednesday 10 August 2005 06:17, Hugh Dickins wrote:
> There might be a case for packaging repeated arguments into structures
> (though several of these levels are inlined anyway), but that's some
> other exercise entirely, shouldn't get in the way of removing Reserved.

Agreed, an entirely separate question that I'd like to return to in time.  The 
existing herd of page table walkers is unnecessarily repetitious.

Regards,

Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09 19:41           ` Russell King
  2005-08-09 20:51             ` Linus Torvalds
@ 2005-08-09 21:16             ` Martin J. Bligh
  2005-08-09 21:51               ` Martin J. Bligh
  2005-08-10  9:27             ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 91+ messages in thread
From: Martin J. Bligh @ 2005-08-09 21:16 UTC (permalink / raw)
  To: Russell King
  Cc: Nick Piggin, ncunningham, Daniel Phillips,
	Linux Kernel Mailing List, Linux Memory Management, Hugh Dickins,
	Linus Torvalds, Andrew Morton, Andrea Arcangeli,
	Benjamin Herrenschmidt


--On Tuesday, August 09, 2005 20:41:00 +0100 Russell King <rmk+lkml@arm.linux.org.uk> wrote:

> On Tue, Aug 09, 2005 at 07:38:52AM -0700, Martin J. Bligh wrote:
>> pfn_valid() doesn't tell you it's RAM or not - it tells you whether you
>> have a backing struct page for that address. Could be an IO mapped device,
>> a small memory hole, whatever.
> 
> The only things which have a struct page is RAM.  Nothing else does.

That's not true at all. Every physical address covered by the machine
that we may need to access, plus every small hole we didn't use 
discontigmem to exclude has a backing struct page. See e820 maps.

Unless you're speaking only with respect to ARM, in which case, I'll
bow to your knowledge, but it's certainly not true in general ...

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09 15:36               ` Hugh Dickins
@ 2005-08-09 21:27                 ` Daniel Phillips
  0 siblings, 0 replies; 91+ messages in thread
From: Daniel Phillips @ 2005-08-09 21:27 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Benjamin Herrenschmidt, Nick Piggin, linux-kernel,
	Linux Memory Management, Linus Torvalds, Andrew Morton,
	Andrea Arcangeli

On Wednesday 10 August 2005 01:36, Hugh Dickins wrote:
> On Tue, 9 Aug 2005, Benjamin Herrenschmidt wrote:
> >  - We already have a refcount
> >  - We have a field where putting a flag isn't that much of a problem
> >  - It can be difficult to get page refcounting right when dealing with
> >    such things, really.
>
> Probably easier to get the page refcounting right with these than with
> most.  Getting refcounting wrong is always bad.

He seems to be arguing for a new debug option.

> > In that case, we basically have an _easy_ way to trigger a useful BUG()
> > in the page free path when it's a page that should never be returned to
> > the pool.
>
> As bad_page already does on various other flags (though it clears those,
> whereas this one you'd prefer not to clear).   Hmm, okay, though I'm not
> sure it's worth its own page flag if they're in short supply.

Nineteen out of 32 officially spoken for so far, with some out of tree patches 
regarding the remainder with desirous eyes no doubt.  I think that qualifies 
as short supply.  But it is not just that, it is the extra cost of 
understanding and auditing the features implied by the flags, particularly 
bogus features.

Regards,

Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09 21:16             ` Martin J. Bligh
@ 2005-08-09 21:51               ` Martin J. Bligh
  0 siblings, 0 replies; 91+ messages in thread
From: Martin J. Bligh @ 2005-08-09 21:51 UTC (permalink / raw)
  To: Russell King
  Cc: Nick Piggin, ncunningham, Daniel Phillips,
	Linux Kernel Mailing List, Linux Memory Management, Hugh Dickins,
	Linus Torvalds, Andrew Morton, Andrea Arcangeli,
	Benjamin Herrenschmidt

>> On Tue, Aug 09, 2005 at 07:38:52AM -0700, Martin J. Bligh wrote:
>>> pfn_valid() doesn't tell you it's RAM or not - it tells you whether you
>>> have a backing struct page for that address. Could be an IO mapped device,
>>> a small memory hole, whatever.
>> 
>> The only things which have a struct page is RAM.  Nothing else does.
> 
> That's not true at all. Every physical address covered by the machine
> that we may need to access, plus every small hole we didn't use 
> discontigmem to exclude has a backing struct page. See e820 maps.

OK, on second thoughts, that's not quite true. Not every phys address
will (eg PCI window etc). but it's certianly not just RAM pages.

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [RFC][PATCH] Rename PageChecked as PageMiscFS
  2005-08-08 21:54     ` Andrew Morton
@ 2005-08-09 23:23       ` Daniel Phillips
  2005-08-10  7:48         ` Hugh Dickins
  2005-08-10 22:12       ` Daniel Phillips
  2005-08-11  9:26       ` David Howells
  2 siblings, 1 reply; 91+ messages in thread
From: Daniel Phillips @ 2005-08-09 23:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

On Tuesday 09 August 2005 07:54, Andrew Morton wrote:
> Daniel Phillips <phillips@arcor.de> wrote:
> > > Suggestion for your next act:
> >
> > ...kill PG_checked please :)  Or at least keep it from spreading.
>
> It already spread - ext3 is using it and I think reiser4.  I thought I had
> a patch to rename it to PG_misc1 or somesuch, but no.

How about this one?

This filesystem-specific flag needs to be prevented from escaping into other
subsystems that might interact, such as VM.  The current usage is exclusively
for directories, except for Reiser4, which uses it for journalling.

Signed-off-by: Daniel Phillips <phillips@istop.com>

diff -up --recursive 2.6.13-rc5-mm1.clean/fs/afs/dir.c 2.6.13-rc5-mm1/fs/afs/dir.c
--- 2.6.13-rc5-mm1.clean/fs/afs/dir.c	2005-06-17 15:48:29.000000000 -0400
+++ 2.6.13-rc5-mm1/fs/afs/dir.c	2005-08-09 18:59:49.000000000 -0400
@@ -155,11 +155,11 @@ static inline void afs_dir_check_page(st
 		}
 	}
 
-	SetPageChecked(page);
+	SetPageMiscFS(page);
 	return;
 
  error:
-	SetPageChecked(page);
+	SetPageMiscFS(page);
 	SetPageError(page);
 
 } /* end afs_dir_check_page() */
@@ -193,7 +193,7 @@ static struct page *afs_dir_get_page(str
 		kmap(page);
 		if (!PageUptodate(page))
 			goto fail;
-		if (!PageChecked(page))
+		if (!PageMiscFS(page))
 			afs_dir_check_page(dir, page);
 		if (PageError(page))
 			goto fail;
diff -up --recursive 2.6.13-rc5-mm1.clean/fs/ext2/dir.c 2.6.13-rc5-mm1/fs/ext2/dir.c
--- 2.6.13-rc5-mm1.clean/fs/ext2/dir.c	2005-06-17 15:48:29.000000000 -0400
+++ 2.6.13-rc5-mm1/fs/ext2/dir.c	2005-08-09 18:59:51.000000000 -0400
@@ -112,7 +112,7 @@ static void ext2_check_page(struct page 
 	if (offs != limit)
 		goto Eend;
 out:
-	SetPageChecked(page);
+	SetPageMiscFS(page);
 	return;
 
 	/* Too bad, we had an error */
@@ -152,7 +152,7 @@ Eend:
 		dir->i_ino, (page->index<<PAGE_CACHE_SHIFT)+offs,
 		(unsigned long) le32_to_cpu(p->inode));
 fail:
-	SetPageChecked(page);
+	SetPageMiscFS(page);
 	SetPageError(page);
 }
 
@@ -166,7 +166,7 @@ static struct page * ext2_get_page(struc
 		kmap(page);
 		if (!PageUptodate(page))
 			goto fail;
-		if (!PageChecked(page))
+		if (!PageMiscFS(page))
 			ext2_check_page(page);
 		if (PageError(page))
 			goto fail;
diff -up --recursive 2.6.13-rc5-mm1.clean/fs/ext3/inode.c 2.6.13-rc5-mm1/fs/ext3/inode.c
--- 2.6.13-rc5-mm1.clean/fs/ext3/inode.c	2005-08-09 18:23:30.000000000 -0400
+++ 2.6.13-rc5-mm1/fs/ext3/inode.c	2005-08-09 18:59:53.000000000 -0400
@@ -1369,12 +1369,12 @@ static int ext3_journalled_writepage(str
 		goto no_write;
 	}
 
-	if (!page_has_buffers(page) || PageChecked(page)) {
+	if (!page_has_buffers(page) || PageMiscFS(page)) {
 		/*
 		 * It's mmapped pagecache.  Add buffers and journal it.  There
 		 * doesn't seem much point in redirtying the page here.
 		 */
-		ClearPageChecked(page);
+		ClearPageMiscFS(page);
 		ret = block_prepare_write(page, 0, PAGE_CACHE_SIZE,
 					ext3_get_block);
 		if (ret != 0)
@@ -1429,7 +1429,7 @@ static int ext3_invalidatepage(struct pa
 	 * If it's a full truncate we just forget about the pending dirtying
 	 */
 	if (offset == 0)
-		ClearPageChecked(page);
+		ClearPageMiscFS(page);
 
 	return journal_invalidatepage(journal, page, offset);
 }
@@ -1438,7 +1438,7 @@ static int ext3_releasepage(struct page 
 {
 	journal_t *journal = EXT3_JOURNAL(page->mapping->host);
 
-	WARN_ON(PageChecked(page));
+	WARN_ON(PageMiscFS(page));
 	if (!page_has_buffers(page))
 		return 0;
 	return journal_try_to_free_buffers(journal, page, wait);
@@ -1535,7 +1535,7 @@ out:
  */
 static int ext3_journalled_set_page_dirty(struct page *page)
 {
-	SetPageChecked(page);
+	SetPageMiscFS(page);
 	return __set_page_dirty_nobuffers(page);
 }
 
diff -up --recursive 2.6.13-rc5-mm1.clean/fs/freevxfs/vxfs_subr.c 2.6.13-rc5-mm1/fs/freevxfs/vxfs_subr.c
--- 2.6.13-rc5-mm1.clean/fs/freevxfs/vxfs_subr.c	2005-08-09 18:23:11.000000000 -0400
+++ 2.6.13-rc5-mm1/fs/freevxfs/vxfs_subr.c	2005-08-09 18:59:54.000000000 -0400
@@ -79,7 +79,7 @@ vxfs_get_page(struct address_space *mapp
 		kmap(pp);
 		if (!PageUptodate(pp))
 			goto fail;
-		/** if (!PageChecked(pp)) **/
+		/** if (!PageMiscFS(pp)) **/
 			/** vxfs_check_page(pp); **/
 		if (PageError(pp))
 			goto fail;
diff -up --recursive 2.6.13-rc5-mm1.clean/fs/reiser4/page_cache.c 2.6.13-rc5-mm1/fs/reiser4/page_cache.c
--- 2.6.13-rc5-mm1.clean/fs/reiser4/page_cache.c	2005-08-09 18:23:30.000000000 -0400
+++ 2.6.13-rc5-mm1/fs/reiser4/page_cache.c	2005-08-09 18:59:58.000000000 -0400
@@ -754,7 +754,7 @@ print_page(const char *prefix, struct pa
 	       page_flag_name(page, PG_lru),
 	       page_flag_name(page, PG_slab),
 
-	       page_flag_name(page, PG_checked),
+	       page_flag_name(page, PG_miscfs),
 	       page_flag_name(page, PG_reserved),
 	       page_flag_name(page, PG_private), page_flag_name(page, PG_writeback), page_flag_name(page, PG_nosave));
 	if (jprivate(page) != NULL) {
diff -up --recursive 2.6.13-rc5-mm1.clean/fs/reiserfs/inode.c 2.6.13-rc5-mm1/fs/reiserfs/inode.c
--- 2.6.13-rc5-mm1.clean/fs/reiserfs/inode.c	2005-08-09 18:23:31.000000000 -0400
+++ 2.6.13-rc5-mm1/fs/reiserfs/inode.c	2005-08-09 19:00:02.000000000 -0400
@@ -2347,7 +2347,7 @@ static int reiserfs_write_full_page(stru
 	struct buffer_head *head, *bh;
 	int partial = 0;
 	int nr = 0;
-	int checked = PageChecked(page);
+	int checked = PageMiscFS(page);
 	struct reiserfs_transaction_handle th;
 	struct super_block *s = inode->i_sb;
 	int bh_per_page = PAGE_CACHE_SIZE / s->s_blocksize;
@@ -2409,7 +2409,7 @@ static int reiserfs_write_full_page(stru
 	 * blocks we're going to log
 	 */
 	if (checked) {
-		ClearPageChecked(page);
+		ClearPageMiscFS(page);
 		reiserfs_write_lock(s);
 		error = journal_begin(&th, s, bh_per_page + 1);
 		if (error) {
@@ -2790,7 +2790,7 @@ static int reiserfs_invalidatepage(struc
 	BUG_ON(!PageLocked(page));
 
 	if (offset == 0)
-		ClearPageChecked(page);
+		ClearPageMiscFS(page);
 
 	if (!page_has_buffers(page))
 		goto out;
@@ -2829,7 +2829,7 @@ static int reiserfs_set_page_dirty(struc
 {
 	struct inode *inode = page->mapping->host;
 	if (reiserfs_file_data_log(inode)) {
-		SetPageChecked(page);
+		SetPageMiscFS(page);
 		return __set_page_dirty_nobuffers(page);
 	}
 	return __set_page_dirty_buffers(page);
@@ -2852,7 +2852,7 @@ static int reiserfs_releasepage(struct p
 	struct buffer_head *bh;
 	int ret = 1;
 
-	WARN_ON(PageChecked(page));
+	WARN_ON(PageMiscFS(page));
 	spin_lock(&j->j_dirty_buffers_lock);
 	head = page_buffers(page);
 	bh = head;
diff -up --recursive 2.6.13-rc5-mm1.clean/include/linux/page-flags.h 2.6.13-rc5-mm1/include/linux/page-flags.h
--- 2.6.13-rc5-mm1.clean/include/linux/page-flags.h	2005-08-09 18:23:31.000000000 -0400
+++ 2.6.13-rc5-mm1/include/linux/page-flags.h	2005-08-09 18:59:57.000000000 -0400
@@ -61,7 +61,7 @@
 #define PG_active		 6
 #define PG_slab			 7	/* slab debug (Suparna wants this) */
 
-#define PG_checked		 8	/* kill me in 2.5.<early>. */
+#define PG_miscfs		 8	/* kill me in 2.5.<early>. */
 #define PG_fs_misc		 8
 #define PG_arch_1		 9
 #define PG_reserved		10
@@ -227,9 +227,9 @@ extern void __mod_page_state(unsigned lo
 #define PageHighMem(page)	0 /* needed to optimize away at compile time */
 #endif
 
-#define PageChecked(page)	test_bit(PG_checked, &(page)->flags)
-#define SetPageChecked(page)	set_bit(PG_checked, &(page)->flags)
-#define ClearPageChecked(page)	clear_bit(PG_checked, &(page)->flags)
+#define PageMiscFS(page)	test_bit(PG_miscfs, &(page)->flags)
+#define SetPageMiscFS(page)	set_bit(PG_miscfs, &(page)->flags)
+#define ClearPageMiscFS(page)	clear_bit(PG_miscfs, &(page)->flags)
 
 #define PageReserved(page)	test_bit(PG_reserved, &(page)->flags)
 #define SetPageReserved(page)	set_bit(PG_reserved, &(page)->flags)
diff -up --recursive 2.6.13-rc5-mm1.clean/mm/page_alloc.c 2.6.13-rc5-mm1/mm/page_alloc.c
--- 2.6.13-rc5-mm1.clean/mm/page_alloc.c	2005-08-09 18:23:31.000000000 -0400
+++ 2.6.13-rc5-mm1/mm/page_alloc.c	2005-08-09 18:59:48.000000000 -0400
@@ -458,7 +458,7 @@ static void prep_new_page(struct page *p
 
 	page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
 			1 << PG_referenced | 1 << PG_arch_1 |
-			1 << PG_checked | 1 << PG_mappedtodisk);
+			1 << PG_miscfs | 1 << PG_mappedtodisk);
 	page->private = 0;
 	set_page_refs(page, order);
 	kernel_map_pages(page, 1 << order, 1);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][PATCH] Rename PageChecked as PageMiscFS
  2005-08-09 23:23       ` [RFC][PATCH] Rename PageChecked as PageMiscFS Daniel Phillips
@ 2005-08-10  7:48         ` Hugh Dickins
  2005-08-10  8:06           ` Daniel Phillips
  0 siblings, 1 reply; 91+ messages in thread
From: Hugh Dickins @ 2005-08-10  7:48 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Andrew Morton, linux-kernel, linux-mm

On Wed, 10 Aug 2005, Daniel Phillips wrote:
> --- 2.6.13-rc5-mm1.clean/include/linux/page-flags.h	2005-08-09 18:23:31.000000000 -0400
> +++ 2.6.13-rc5-mm1/include/linux/page-flags.h	2005-08-09 18:59:57.000000000 -0400
> @@ -61,7 +61,7 @@
>  #define PG_active		 6
>  #define PG_slab			 7	/* slab debug (Suparna wants this) */
>  
> -#define PG_checked		 8	/* kill me in 2.5.<early>. */
> +#define PG_miscfs		 8	/* kill me in 2.5.<early>. */
>  #define PG_fs_misc		 8

And all those PageMiscFS macros you're adding to the PageFsMisc ones:
doesn't look like progress to me ;)

Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][PATCH] Rename PageChecked as PageMiscFS
  2005-08-10  7:48         ` Hugh Dickins
@ 2005-08-10  8:06           ` Daniel Phillips
  0 siblings, 0 replies; 91+ messages in thread
From: Daniel Phillips @ 2005-08-10  8:06 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel, linux-mm

On Wednesday 10 August 2005 17:48, Hugh Dickins wrote:
> On Wed, 10 Aug 2005, Daniel Phillips wrote:
> > --- 2.6.13-rc5-mm1.clean/include/linux/page-flags.h	2005-08-09
> > 18:23:31.000000000 -0400 +++
> > 2.6.13-rc5-mm1/include/linux/page-flags.h	2005-08-09 18:59:57.000000000
> > -0400 @@ -61,7 +61,7 @@
> >  #define PG_active		 6
> >  #define PG_slab			 7	/* slab debug (Suparna wants this) */
> >
> > -#define PG_checked		 8	/* kill me in 2.5.<early>. */
> > +#define PG_miscfs		 8	/* kill me in 2.5.<early>. */
> >  #define PG_fs_misc		 8
>
> And all those PageMiscFS macros you're adding to the PageFsMisc ones:
> doesn't look like progress to me ;)

Heh, it looks like part of a patch did creep into Andrew's tree already.  I'll 
fix it on the morrow.

Regards,

Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09 19:41           ` Russell King
  2005-08-09 20:51             ` Linus Torvalds
  2005-08-09 21:16             ` Martin J. Bligh
@ 2005-08-10  9:27             ` Benjamin Herrenschmidt
  2005-08-11  9:09               ` Nick Piggin
  2 siblings, 1 reply; 91+ messages in thread
From: Benjamin Herrenschmidt @ 2005-08-10  9:27 UTC (permalink / raw)
  To: Russell King
  Cc: Martin J. Bligh, Nick Piggin, ncunningham, Daniel Phillips,
	Linux Kernel Mailing List, Linux Memory Management, Hugh Dickins,
	Linus Torvalds, Andrew Morton, Andrea Arcangeli

On Tue, 2005-08-09 at 20:41 +0100, Russell King wrote:
> On Tue, Aug 09, 2005 at 07:38:52AM -0700, Martin J. Bligh wrote:
> > pfn_valid() doesn't tell you it's RAM or not - it tells you whether you
> > have a backing struct page for that address. Could be an IO mapped device,
> > a small memory hole, whatever.
> 
> The only things which have a struct page is RAM.  Nothing else does.

Well, not anymore :)

With sparsemem, you can cheat now and have struct page for non-RAM, and
this is actually useful. I want some IO space to be "context switchable"
and thus map it with nopage() functionality, etc...

Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-08 21:24   ` Daniel Phillips
  2005-08-08 21:54     ` Andrew Morton
@ 2005-08-10 13:13     ` David Howells
  2005-08-10 13:34       ` Daniel Phillips
  2005-08-10 14:27       ` David Howells
  1 sibling, 2 replies; 91+ messages in thread
From: David Howells @ 2005-08-10 13:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Daniel Phillips, nickpiggin, linux-kernel, linux-mm, hugh,
	torvalds, andrea, benh

Andrew Morton <akpm@osdl.org> wrote:

> > ...kill PG_checked please :)  Or at least keep it from spreading.
> > 
> 
> It already spread - ext3 is using it and I think reiser4.  I thought I had
> a patch to rename it to PG_misc1 or somesuch, but no.  It's mandate becomes
> "filesystem-specific page flag".

You're carrying a patch to stick a flag called PG_fs_misc, but that has the
same value as PG_checked. An extra page flag beyond PG_uptodate, PG_lock and
PG_writeback is required to make readpage through the cache non-synchronous.

David
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-10 13:13     ` [RFC][patch 0/2] mm: remove PageReserved David Howells
@ 2005-08-10 13:34       ` Daniel Phillips
  2005-08-10 14:27       ` David Howells
  1 sibling, 0 replies; 91+ messages in thread
From: Daniel Phillips @ 2005-08-10 13:34 UTC (permalink / raw)
  To: David Howells; +Cc: Andrew Morton, linux-kernel, linux-mm, hugh

On Wednesday 10 August 2005 23:13, David Howells wrote:
> Andrew Morton <akpm@osdl.org> wrote:
> > > ...kill PG_checked please :)  Or at least keep it from spreading.
> >
> > It already spread - ext3 is using it and I think reiser4.  I thought I
> > had a patch to rename it to PG_misc1 or somesuch, but no.  It's mandate
> > becomes "filesystem-specific page flag".
>
> You're carrying a patch to stick a flag called PG_fs_misc, but that has the
> same value as PG_checked. An extra page flag beyond PG_uptodate, PG_lock
> and PG_writeback is required to make readpage through the cache
> non-synchronous.

David,

Interesting, have you got a pointer to a full explanation?  Is this about aio?

Regards,

Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-10 13:13     ` [RFC][patch 0/2] mm: remove PageReserved David Howells
  2005-08-10 13:34       ` Daniel Phillips
@ 2005-08-10 14:27       ` David Howells
  2005-08-10 23:19         ` Daniel Phillips
  2005-08-11 10:49         ` David Howells
  1 sibling, 2 replies; 91+ messages in thread
From: David Howells @ 2005-08-10 14:27 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: David Howells, Andrew Morton, linux-kernel, linux-mm, hugh

Daniel Phillips <phillips@arcor.de> wrote:

> > An extra page flag beyond PG_uptodate, PG_lock and PG_writeback is
> > required to make readpage through the cache non-synchronous.

Sorry, I meant to say "filesystem cache": FS-Cache/CacheFS.

> Interesting, have you got a pointer to a full explanation?  Is this about aio?

No, it's nothing to do with AIO. This is to do with using local disk to cache
network filesystems and other relatively slow devices.

What happens is this:

 (1) readpage() is issued against NFS (for example).

 (2) NFS consults the local cache, and finds the page isn't available there.

 (3) NFS reads the page from the server.

 (4) NFS sets PG_fs_misc and tells the cache to store the page.

 (5) NFS sets PG_uptodate and unlocks the page.

Some time later, the cache finishes writing the page to disk:

 (6) The cache calls NFS to say that it's finished writing the page.

 (7) NFS calls end_page_fs_misc() - which clears PG_fs_misc - to indicate to
     any waiters that the page can now be written to.

Now: any PTEs set up to point to this page start life read-only. If they're
part of a shared-writable mapping, then the MMU will generate a WP fault when
someone attempts to write to the page through that mapping:

 (a) do_wp_page() gets called.

 (b) do_wp_page() sees that the page's host has registered an interest in
     knowing that the page is becoming writable:

	vm_operations_struct::page_mkwrite()

 (c) do_wp_page() calls out to the filesystem.

 (d) NFS sees the page is wanting to become writable and waits for the
     PG_fs_misc flag to become cleared.

 (e) NFS returns to the caller and things proceed as normal.

Doing this permits the cache state to be more predictable in the event of
power loss because we know that userspace won't have scribbled on this page
whilst the cache was trying to write it to disk.

David
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-09 19:22         ` Daniel Phillips
@ 2005-08-10 21:50           ` Pavel Machek
  2005-08-10 21:56             ` Martin J. Bligh
  2005-08-11 10:26             ` Rafael J. Wysocki
  0 siblings, 2 replies; 91+ messages in thread
From: Pavel Machek @ 2005-08-10 21:50 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Nick Piggin, Benjamin Herrenschmidt, linux-kernel,
	Linux Memory Management, Hugh Dickins, Linus Torvalds,
	Andrew Morton, Andrea Arcangeli

Hi!

> > Swsusp is the main "is valid ram" user I have in mind here. It
> > wants to know whether or not it should save and restore the
> > memory of a given `struct page`.
> 
> Why can't it follow the rmap chain?

It is walking physical memory, not memory managment chains. I need
something like:

static int saveable(struct zone * zone, unsigned long * zone_pfn)
{
        unsigned long pfn = *zone_pfn + zone->zone_start_pfn;
        struct page * page;

        if (!pfn_valid(pfn))
                return 0;

        page = pfn_to_page(pfn);
        BUG_ON(PageReserved(page) && PageNosave(page));
        if (PageNosave(page))
                return 0;
        if (PageReserved(page) && pfn_is_nosave(pfn)) {
                pr_debug("[nosave pfn 0x%lx]", pfn);
                return 0;
        }
        if (PageNosaveFree(page))
                return 0;

        return 1;
}
								Pavel
-- 
if you have sharp zaurus hardware you don't need... you know my address
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-10 21:50           ` Pavel Machek
@ 2005-08-10 21:56             ` Martin J. Bligh
  2005-08-11 10:36               ` Rafael J. Wysocki
  2005-08-11 10:26             ` Rafael J. Wysocki
  1 sibling, 1 reply; 91+ messages in thread
From: Martin J. Bligh @ 2005-08-10 21:56 UTC (permalink / raw)
  To: Pavel Machek, Daniel Phillips
  Cc: Nick Piggin, Benjamin Herrenschmidt, linux-kernel,
	Linux Memory Management, Hugh Dickins, Linus Torvalds,
	Andrew Morton, Andrea Arcangeli

--On Wednesday, August 10, 2005 23:50:22 +0200 Pavel Machek <pavel@suse.cz> wrote:

> Hi!
> 
>> > Swsusp is the main "is valid ram" user I have in mind here. It
>> > wants to know whether or not it should save and restore the
>> > memory of a given `struct page`.
>> 
>> Why can't it follow the rmap chain?
> 
> It is walking physical memory, not memory managment chains. I need
> something like:

Can you not use page_is_ram(pfn) ?

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [RFC][PATCH] Rename PageChecked as PageMiscFS
  2005-08-08 21:54     ` Andrew Morton
  2005-08-09 23:23       ` [RFC][PATCH] Rename PageChecked as PageMiscFS Daniel Phillips
@ 2005-08-10 22:12       ` Daniel Phillips
  2005-08-10 22:23         ` Daniel Phillips
  2005-08-11  9:31         ` David Howells
  2005-08-11  9:26       ` David Howells
  2 siblings, 2 replies; 91+ messages in thread
From: Daniel Phillips @ 2005-08-10 22:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, Hugh Dickins, David Howells

This filesystem-specific flag needs to be prevented from escaping into other
subsystems that might interact, such as VM.  The current usage is mainly
for directories, except for Reiser4, which uses it for journalling, and NFS,
which presses it into service in a network cache coherency role.

Also resolves the collision between two different names for the same flag bit.

Signed-off-by: Daniel Phillips <phillips@istop.com>

diff -up --recursive 2.6.13-rc5-mm1.clean/fs/afs/dir.c 2.6.13-rc5-mm1/fs/afs/dir.c
--- 2.6.13-rc5-mm1.clean/fs/afs/dir.c	2005-06-17 15:48:29.000000000 -0400
+++ 2.6.13-rc5-mm1/fs/afs/dir.c	2005-08-09 18:59:49.000000000 -0400
@@ -155,11 +155,11 @@ static inline void afs_dir_check_page(st
 		}
 	}
 
-	SetPageChecked(page);
+	SetPageMiscFS(page);
 	return;
 
  error:
-	SetPageChecked(page);
+	SetPageMiscFS(page);
 	SetPageError(page);
 
 } /* end afs_dir_check_page() */
@@ -193,7 +193,7 @@ static struct page *afs_dir_get_page(str
 		kmap(page);
 		if (!PageUptodate(page))
 			goto fail;
-		if (!PageChecked(page))
+		if (!PageMiscFS(page))
 			afs_dir_check_page(dir, page);
 		if (PageError(page))
 			goto fail;
diff -up --recursive 2.6.13-rc5-mm1.clean/fs/ext2/dir.c 2.6.13-rc5-mm1/fs/ext2/dir.c
--- 2.6.13-rc5-mm1.clean/fs/ext2/dir.c	2005-06-17 15:48:29.000000000 -0400
+++ 2.6.13-rc5-mm1/fs/ext2/dir.c	2005-08-09 18:59:51.000000000 -0400
@@ -112,7 +112,7 @@ static void ext2_check_page(struct page 
 	if (offs != limit)
 		goto Eend;
 out:
-	SetPageChecked(page);
+	SetPageMiscFS(page);
 	return;
 
 	/* Too bad, we had an error */
@@ -152,7 +152,7 @@ Eend:
 		dir->i_ino, (page->index<<PAGE_CACHE_SHIFT)+offs,
 		(unsigned long) le32_to_cpu(p->inode));
 fail:
-	SetPageChecked(page);
+	SetPageMiscFS(page);
 	SetPageError(page);
 }
 
@@ -166,7 +166,7 @@ static struct page * ext2_get_page(struc
 		kmap(page);
 		if (!PageUptodate(page))
 			goto fail;
-		if (!PageChecked(page))
+		if (!PageMiscFS(page))
 			ext2_check_page(page);
 		if (PageError(page))
 			goto fail;
diff -up --recursive 2.6.13-rc5-mm1.clean/fs/ext3/inode.c 2.6.13-rc5-mm1/fs/ext3/inode.c
--- 2.6.13-rc5-mm1.clean/fs/ext3/inode.c	2005-08-09 18:23:30.000000000 -0400
+++ 2.6.13-rc5-mm1/fs/ext3/inode.c	2005-08-09 18:59:53.000000000 -0400
@@ -1369,12 +1369,12 @@ static int ext3_journalled_writepage(str
 		goto no_write;
 	}
 
-	if (!page_has_buffers(page) || PageChecked(page)) {
+	if (!page_has_buffers(page) || PageMiscFS(page)) {
 		/*
 		 * It's mmapped pagecache.  Add buffers and journal it.  There
 		 * doesn't seem much point in redirtying the page here.
 		 */
-		ClearPageChecked(page);
+		ClearPageMiscFS(page);
 		ret = block_prepare_write(page, 0, PAGE_CACHE_SIZE,
 					ext3_get_block);
 		if (ret != 0)
@@ -1429,7 +1429,7 @@ static int ext3_invalidatepage(struct pa
 	 * If it's a full truncate we just forget about the pending dirtying
 	 */
 	if (offset == 0)
-		ClearPageChecked(page);
+		ClearPageMiscFS(page);
 
 	return journal_invalidatepage(journal, page, offset);
 }
@@ -1438,7 +1438,7 @@ static int ext3_releasepage(struct page 
 {
 	journal_t *journal = EXT3_JOURNAL(page->mapping->host);
 
-	WARN_ON(PageChecked(page));
+	WARN_ON(PageMiscFS(page));
 	if (!page_has_buffers(page))
 		return 0;
 	return journal_try_to_free_buffers(journal, page, wait);
@@ -1535,7 +1535,7 @@ out:
  */
 static int ext3_journalled_set_page_dirty(struct page *page)
 {
-	SetPageChecked(page);
+	SetPageMiscFS(page);
 	return __set_page_dirty_nobuffers(page);
 }
 
diff -up --recursive 2.6.13-rc5-mm1.clean/fs/freevxfs/vxfs_subr.c 2.6.13-rc5-mm1/fs/freevxfs/vxfs_subr.c
--- 2.6.13-rc5-mm1.clean/fs/freevxfs/vxfs_subr.c	2005-08-09 18:23:11.000000000 -0400
+++ 2.6.13-rc5-mm1/fs/freevxfs/vxfs_subr.c	2005-08-09 18:59:54.000000000 -0400
@@ -79,7 +79,7 @@ vxfs_get_page(struct address_space *mapp
 		kmap(pp);
 		if (!PageUptodate(pp))
 			goto fail;
-		/** if (!PageChecked(pp)) **/
+		/** if (!PageMiscFS(pp)) **/
 			/** vxfs_check_page(pp); **/
 		if (PageError(pp))
 			goto fail;
diff -up --recursive 2.6.13-rc5-mm1.clean/fs/reiser4/page_cache.c 2.6.13-rc5-mm1/fs/reiser4/page_cache.c
--- 2.6.13-rc5-mm1.clean/fs/reiser4/page_cache.c	2005-08-09 18:23:30.000000000 -0400
+++ 2.6.13-rc5-mm1/fs/reiser4/page_cache.c	2005-08-09 18:59:58.000000000 -0400
@@ -754,7 +754,7 @@ print_page(const char *prefix, struct pa
 	       page_flag_name(page, PG_lru),
 	       page_flag_name(page, PG_slab),
 
-	       page_flag_name(page, PG_checked),
+	       page_flag_name(page, PG_miscfs),
 	       page_flag_name(page, PG_reserved),
 	       page_flag_name(page, PG_private), page_flag_name(page, PG_writeback), page_flag_name(page, PG_nosave));
 	if (jprivate(page) != NULL) {
diff -up --recursive 2.6.13-rc5-mm1.clean/fs/reiserfs/inode.c 2.6.13-rc5-mm1/fs/reiserfs/inode.c
--- 2.6.13-rc5-mm1.clean/fs/reiserfs/inode.c	2005-08-09 18:23:31.000000000 -0400
+++ 2.6.13-rc5-mm1/fs/reiserfs/inode.c	2005-08-09 19:00:02.000000000 -0400
@@ -2347,7 +2347,7 @@ static int reiserfs_write_full_page(stru
 	struct buffer_head *head, *bh;
 	int partial = 0;
 	int nr = 0;
-	int checked = PageChecked(page);
+	int checked = PageMiscFS(page);
 	struct reiserfs_transaction_handle th;
 	struct super_block *s = inode->i_sb;
 	int bh_per_page = PAGE_CACHE_SIZE / s->s_blocksize;
@@ -2409,7 +2409,7 @@ static int reiserfs_write_full_page(stru
 	 * blocks we're going to log
 	 */
 	if (checked) {
-		ClearPageChecked(page);
+		ClearPageMiscFS(page);
 		reiserfs_write_lock(s);
 		error = journal_begin(&th, s, bh_per_page + 1);
 		if (error) {
@@ -2790,7 +2790,7 @@ static int reiserfs_invalidatepage(struc
 	BUG_ON(!PageLocked(page));
 
 	if (offset == 0)
-		ClearPageChecked(page);
+		ClearPageMiscFS(page);
 
 	if (!page_has_buffers(page))
 		goto out;
@@ -2829,7 +2829,7 @@ static int reiserfs_set_page_dirty(struc
 {
 	struct inode *inode = page->mapping->host;
 	if (reiserfs_file_data_log(inode)) {
-		SetPageChecked(page);
+		SetPageMiscFS(page);
 		return __set_page_dirty_nobuffers(page);
 	}
 	return __set_page_dirty_buffers(page);
@@ -2852,7 +2852,7 @@ static int reiserfs_releasepage(struct p
 	struct buffer_head *bh;
 	int ret = 1;
 
-	WARN_ON(PageChecked(page));
+	WARN_ON(PageMiscFS(page));
 	spin_lock(&j->j_dirty_buffers_lock);
 	head = page_buffers(page);
 	bh = head;
diff -up --recursive 2.6.13-rc5-mm1.clean/include/linux/page-flags.h 2.6.13-rc5-mm1/include/linux/page-flags.h
--- 2.6.13-rc5-mm1.clean/include/linux/page-flags.h	2005-08-09 18:23:31.000000000 -0400
+++ 2.6.13-rc5-mm1/include/linux/page-flags.h	2005-08-10 17:41:32.000000000 -0400
@@ -61,8 +61,7 @@
 #define PG_active		 6
 #define PG_slab			 7	/* slab debug (Suparna wants this) */
 
-#define PG_checked		 8	/* kill me in 2.5.<early>. */
-#define PG_fs_misc		 8
+#define PG_fs_misc		 8	/* don't let me spread */
 #define PG_arch_1		 9
 #define PG_reserved		10
 #define PG_private		11	/* Has something at ->private */
@@ -227,9 +226,11 @@ extern void __mod_page_state(unsigned lo
 #define PageHighMem(page)	0 /* needed to optimize away at compile time */
 #endif
 
-#define PageChecked(page)	test_bit(PG_checked, &(page)->flags)
-#define SetPageChecked(page)	set_bit(PG_checked, &(page)->flags)
-#define ClearPageChecked(page)	clear_bit(PG_checked, &(page)->flags)
+#define PageMiscFS(page)	test_bit(PG_fs_misc, &(page)->flags)
+#define SetPageMiscFS(page)	set_bit(PG_fs_misc, &(page)->flags)
+#define ClearPageMiscFS(page)	clear_bit(PG_fs_misc, &(page)->flags)
+#define TestSetPageMiscFS(page)		test_and_set_bit(PG_fs_misc, &(page)->flags)
+#define TestClearPageMiscFS(page)	test_and_clear_bit(PG_fs_misc, &(page)->flags)
 
 #define PageReserved(page)	test_bit(PG_reserved, &(page)->flags)
 #define SetPageReserved(page)	set_bit(PG_reserved, &(page)->flags)
@@ -313,7 +314,7 @@ extern void __mod_page_state(unsigned lo
 #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
 #define ClearPageUncached(page)	clear_bit(PG_uncached, &(page)->flags)
 
-struct page;	/* forward declaration */
+struct page; /* What am I doing in this file? */
 
 int test_clear_page_dirty(struct page *page);
 int test_clear_page_writeback(struct page *page);
@@ -329,13 +330,4 @@ static inline void set_page_writeback(st
 	test_set_page_writeback(page);
 }
 
-/*
- * Filesystem-specific page bit testing
- */
-#define PageFsMisc(page)		test_bit(PG_fs_misc, &(page)->flags)
-#define SetPageFsMisc(page)		set_bit(PG_fs_misc, &(page)->flags)
-#define TestSetPageFsMisc(page)		test_and_set_bit(PG_fs_misc, &(page)->flags)
-#define ClearPageFsMisc(page)		clear_bit(PG_fs_misc, &(page)->flags)
-#define TestClearPageFsMisc(page)	test_and_clear_bit(PG_fs_misc, &(page)->flags)
-
 #endif	/* PAGE_FLAGS_H */
diff -up --recursive 2.6.13-rc5-mm1.clean/include/linux/pagemap.h 2.6.13-rc5-mm1/include/linux/pagemap.h
--- 2.6.13-rc5-mm1.clean/include/linux/pagemap.h	2005-08-09 18:23:31.000000000 -0400
+++ 2.6.13-rc5-mm1/include/linux/pagemap.h	2005-08-10 17:18:18.000000000 -0400
@@ -204,7 +204,7 @@ extern void end_page_writeback(struct pa
  */
 static inline void wait_on_page_fs_misc(struct page *page)
 {
-	if (PageFsMisc(page))
+	if (PageMiscFS(page))
 		wait_on_page_bit(page, PG_fs_misc);
 }
 
diff -up --recursive 2.6.13-rc5-mm1.clean/mm/filemap.c 2.6.13-rc5-mm1/mm/filemap.c
--- 2.6.13-rc5-mm1.clean/mm/filemap.c	2005-08-09 18:23:31.000000000 -0400
+++ 2.6.13-rc5-mm1/mm/filemap.c	2005-08-10 17:15:45.000000000 -0400
@@ -509,7 +509,7 @@ EXPORT_SYMBOL(__lock_page);
 void fastcall end_page_fs_misc(struct page *page)
 {
 	smp_mb__before_clear_bit();
-	if (!TestClearPageFsMisc(page))
+	if (!TestClearPageMiscFS(page))
 		BUG();
 	smp_mb__after_clear_bit();
 	__wake_up_bit(page_waitqueue(page), &page->flags, PG_fs_misc);
diff -up --recursive 2.6.13-rc5-mm1.clean/mm/page_alloc.c 2.6.13-rc5-mm1/mm/page_alloc.c
--- 2.6.13-rc5-mm1.clean/mm/page_alloc.c	2005-08-09 18:23:31.000000000 -0400
+++ 2.6.13-rc5-mm1/mm/page_alloc.c	2005-08-10 17:19:30.000000000 -0400
@@ -458,7 +458,7 @@ static void prep_new_page(struct page *p
 
 	page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
 			1 << PG_referenced | 1 << PG_arch_1 |
-			1 << PG_checked | 1 << PG_mappedtodisk);
+			1 << PG_fs_misc | 1 << PG_mappedtodisk);
 	page->private = 0;
 	set_page_refs(page, order);
 	kernel_map_pages(page, 1 << order, 1);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][PATCH] Rename PageChecked as PageMiscFS
  2005-08-10 22:12       ` Daniel Phillips
@ 2005-08-10 22:23         ` Daniel Phillips
  2005-08-10 22:34           ` Trond Myklebust
                             ` (2 more replies)
  2005-08-11  9:31         ` David Howells
  1 sibling, 3 replies; 91+ messages in thread
From: Daniel Phillips @ 2005-08-10 22:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, Hugh Dickins, David Howells

8, to verify that it really does not escape into VFS or MM from NFS, in fact 
I have misgivings about end_page_fs_misc which uses this flag but has no 
in-tree users to show how it is used and, hmm, isn't even _GPL.  What is up?

And note the wrongness tacked onto the end of page-flags.h.  I didn't do it!

Regards,

Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][PATCH] Rename PageChecked as PageMiscFS
  2005-08-10 22:23         ` Daniel Phillips
@ 2005-08-10 22:34           ` Trond Myklebust
  2005-08-10 22:57             ` Daniel Phillips
  2005-08-10 23:42           ` Adrian Bunk
  2005-08-11  9:46           ` David Howells
  2 siblings, 1 reply; 91+ messages in thread
From: Trond Myklebust @ 2005-08-10 22:34 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Andrew Morton, linux-kernel, linux-mm, Hugh Dickins, David Howells

to den 11.08.2005 Klokka 08:23 (+1000) skreiv Daniel Phillips:
> Note: I have not fully audited the NFS-related colliding use of page flags bit 
> 8, to verify that it really does not escape into VFS or MM from NFS, in fact 
> I have misgivings about end_page_fs_misc which uses this flag but has no 
> in-tree users to show how it is used and, hmm, isn't even _GPL.  What is up?

What "NFS-related colliding use of page flags bit 8"?

Cheers,
  Trond

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][PATCH] Rename PageChecked as PageMiscFS
  2005-08-10 22:34           ` Trond Myklebust
@ 2005-08-10 22:57             ` Daniel Phillips
  2005-08-10 23:23               ` Trond Myklebust
  2005-08-11  9:42               ` David Howells
  0 siblings, 2 replies; 91+ messages in thread
From: Daniel Phillips @ 2005-08-10 22:57 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Andrew Morton, linux-kernel, linux-mm, Hugh Dickins, David Howells

Hi Trond,

On Thursday 11 August 2005 08:34, Trond Myklebust wrote:
> to den 11.08.2005 Klokka 08:23 (+1000) skreiv Daniel Phillips:
> > Note: I have not fully audited the NFS-related colliding use of page
> > flags bit 8, to verify that it really does not escape into VFS or MM from
> > NFS, in fact I have misgivings about end_page_fs_misc which uses this
> > flag but has no in-tree users to show how it is used and, hmm, isn't even
> > _GPL.  What is up?
>
> What "NFS-related colliding use of page flags bit 8"?

As explained to me:

http://marc.theaimsgroup.com/?l=linux-kernel&m=112368417412580&w=2

Regards,

Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-10 14:27       ` David Howells
@ 2005-08-10 23:19         ` Daniel Phillips
  2005-08-11 10:49         ` David Howells
  1 sibling, 0 replies; 91+ messages in thread
From: Daniel Phillips @ 2005-08-10 23:19 UTC (permalink / raw)
  To: David Howells; +Cc: Andrew Morton, linux-kernel, linux-mm, hugh

On Thursday 11 August 2005 00:27, David Howells wrote:
> What happens is this:
>
>  (1) readpage() is issued against NFS (for example).
>
>  (2) NFS consults the local cache, and finds the page isn't available
> there.
>
>  (3) NFS reads the page from the server.
>
>  (4) NFS sets PG_fs_misc and tells the cache to store the page.
>
>  (5) NFS sets PG_uptodate and unlocks the page.
>
> Some time later, the cache finishes writing the page to disk:
>
>  (6) The cache calls NFS to say that it's finished writing the page.
>
>  (7) NFS calls end_page_fs_misc() - which clears PG_fs_misc - to indicate
> to any waiters that the page can now be written to.
>
> Now: any PTEs set up to point to this page start life read-only. If they're
> part of a shared-writable mapping, then the MMU will generate a WP fault
> when someone attempts to write to the page through that mapping:
>
>  (a) do_wp_page() gets called.
>
>  (b) do_wp_page() sees that the page's host has registered an interest in
>      knowing that the page is becoming writable:
>
> 	vm_operations_struct::page_mkwrite()
>
>  (c) do_wp_page() calls out to the filesystem.
>
>  (d) NFS sees the page is wanting to become writable and waits for the
>      PG_fs_misc flag to become cleared.
>
>  (e) NFS returns to the caller and things proceed as normal.
>
> Doing this permits the cache state to be more predictable in the event of
> power loss because we know that userspace won't have scribbled on this page
> whilst the cache was trying to write it to disk.

Hi David,

To be honest I'm having some trouble following this through logically.  I'll 
read through a few more times and see if that fixes the problem.  This seems 
cluster-related, so I have an interest.

Who is using this interface?

Regards,

Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][PATCH] Rename PageChecked as PageMiscFS
  2005-08-10 22:57             ` Daniel Phillips
@ 2005-08-10 23:23               ` Trond Myklebust
  2005-08-11  9:42               ` David Howells
  1 sibling, 0 replies; 91+ messages in thread
From: Trond Myklebust @ 2005-08-10 23:23 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Andrew Morton, linux-kernel, linux-mm, Hugh Dickins, David Howells

to den 11.08.2005 Klokka 08:57 (+1000) skreiv Daniel Phillips:
> > What "NFS-related colliding use of page flags bit 8"?
> 
> As explained to me:
> 
> http://marc.theaimsgroup.com/?l=linux-kernel&m=112368417412580&w=2

Oh. You are talking about CacheFS? That hasn't been declared "ready to
merge" yet.

That said, is it really safe to use any flags other than
PG_lock/PG_writeback there, David? I can't see that you want to allow
other tasks to modify or free the page while you are writing it to the
local cache.

Cheers,
  Trond

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][PATCH] Rename PageChecked as PageMiscFS
  2005-08-10 22:23         ` Daniel Phillips
  2005-08-10 22:34           ` Trond Myklebust
@ 2005-08-10 23:42           ` Adrian Bunk
  2005-08-11  9:46           ` David Howells
  2 siblings, 0 replies; 91+ messages in thread
From: Adrian Bunk @ 2005-08-10 23:42 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Andrew Morton, linux-kernel, linux-mm, Hugh Dickins, David Howells

On Thu, Aug 11, 2005 at 08:23:53AM +1000, Daniel Phillips wrote:
> Note: I have not fully audited the NFS-related colliding use of page flags bit 
> 8, to verify that it really does not escape into VFS or MM from NFS, in fact 
> I have misgivings about end_page_fs_misc which uses this flag but has no 
> in-tree users to show how it is used and, hmm, isn't even _GPL.  What is up?
> 
> And note the wrongness tacked onto the end of page-flags.h.  I didn't do it!

This is provide-a-filesystem-specific-syncable-page-bit.patch, and it's 
only in -mm.

Since this was done only for CacheFS, and Andrew dropped CacheFS from 
-mm he could drop this patch as well.

> Regards,
> 
> Daniel

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-10  9:27             ` Benjamin Herrenschmidt
@ 2005-08-11  9:09               ` Nick Piggin
  0 siblings, 0 replies; 91+ messages in thread
From: Nick Piggin @ 2005-08-11  9:09 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Russell King, Martin J. Bligh, ncunningham, Daniel Phillips,
	Linux Kernel Mailing List, Linux Memory Management, Hugh Dickins,
	Linus Torvalds, Andrew Morton, Andrea Arcangeli

Benjamin Herrenschmidt wrote:
> On Tue, 2005-08-09 at 20:41 +0100, Russell King wrote:
> 
>>On Tue, Aug 09, 2005 at 07:38:52AM -0700, Martin J. Bligh wrote:
>>
>>>pfn_valid() doesn't tell you it's RAM or not - it tells you whether you
>>>have a backing struct page for that address. Could be an IO mapped device,
>>>a small memory hole, whatever.
>>
>>The only things which have a struct page is RAM.  Nothing else does.
> 
> 
> Well, not anymore :)
> 

Well thanks everyone for the discussion and input. If I have missed
answering a question, please just mail me privately to let me know.

I guess that despite some architecture implementation differences,
everyone will be happy to see PageReserved go from core code. So
I will send Andrew the patches.

After that, we have a few options to move forward with completely
getting rid of the flag from the other funny places it has cropped
up. A portable page_is_ram() sounds like the best way to go, as it
would not use up a page flag.

As far as ioremap goes - I would rather completely disallow it from
remapping physical pages and enforce that where possible (eg. with
page_is_ram()).

However, these issues (page_is_ram, swsusp, ioremap) need not be
tackled right now. I will bring them up on the lists some time after
the core mm/ is working nicely without PageReserved.

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][PATCH] Rename PageChecked as PageMiscFS
  2005-08-08 21:54     ` Andrew Morton
  2005-08-09 23:23       ` [RFC][PATCH] Rename PageChecked as PageMiscFS Daniel Phillips
  2005-08-10 22:12       ` Daniel Phillips
@ 2005-08-11  9:26       ` David Howells
  2005-08-12  3:29         ` Daniel Phillips
  2005-08-12 12:41         ` David Howells
  2 siblings, 2 replies; 91+ messages in thread
From: David Howells @ 2005-08-11  9:26 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Andrew Morton, linux-kernel, linux-mm, Hugh Dickins, David Howells

Daniel Phillips <phillips@arcor.de> wrote:

> 
> This filesystem-specific flag needs to be prevented from escaping into other
> subsystems that might interact, such as VM.  The current usage is mainly
> for directories, except for Reiser4, which uses it for journalling
> ..
> +	SetPageMiscFS(page);

Can you please retain the *PageFsMisc names I've been using in my stuff?

In my opinion putting the "Fs" bit first gives a clearer indication that this
is a bit exclusively for the use of filesystems in general.

> +#define PG_fs_misc		 8	/* don't let me spread */

Should perhaps be:

  +#define PG_fs_misc		 8	/* for internal filesystem use only */

> and NFS, which presses it into service in a network cache coherency role.

The patches to make the AFS filesystem use it were removed, pending a release
of updated filesystem caching patches.

The NFS filesystem patches that use it haven't yet found there way into
Andrew's tree, but are also being held pending FS-Cache being updated.

If you wish, I will send the FS-Cache patch, the AFS patch and the NFS patch
to Andrew so that you can see. CacheFS needs more work, however, before that
can be re-released.

David
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][PATCH] Rename PageChecked as PageMiscFS
  2005-08-10 22:12       ` Daniel Phillips
  2005-08-10 22:23         ` Daniel Phillips
@ 2005-08-11  9:31         ` David Howells
  1 sibling, 0 replies; 91+ messages in thread
From: David Howells @ 2005-08-11  9:31 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Andrew Morton, linux-kernel, linux-mm, Hugh Dickins, David Howells

Daniel Phillips <phillips@arcor.de> wrote:

> Note: I have not fully audited the NFS-related colliding use of page flags
> bit 8,

Nor will you be able to until the NFS caching patches are released.

> to verify that it really does not escape into VFS or MM from NFS, in fact I
> have misgivings about end_page_fs_misc which uses this flag

end_page_fs_misc() simply makes use of the same waitqueues as other page
flags. This is surely preferable to instituting a whole new table of
waitqueues just for this flag.

> but has no in-tree users to show how it is used

It did have one: fs/afs/. But the patch has been temporarily removed. Look
back into, say, 2.6.13-rc2-mm1.

> and, hmm, isn't even _GPL.  What is up?

EXPORT_SYMBOL_GPL() is a bad idea. It should die as it gives the wrong
impression.

David
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][PATCH] Rename PageChecked as PageMiscFS
  2005-08-10 22:57             ` Daniel Phillips
  2005-08-10 23:23               ` Trond Myklebust
@ 2005-08-11  9:42               ` David Howells
  1 sibling, 0 replies; 91+ messages in thread
From: David Howells @ 2005-08-11  9:42 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Daniel Phillips, Andrew Morton, linux-kernel, linux-mm,
	Hugh Dickins, David Howells

Trond Myklebust <trond.myklebust@fys.uio.no> wrote:

> > http://marc.theaimsgroup.com/?l=linux-kernel&m=112368417412580&w=2
> 
> Oh. You are talking about CacheFS? That hasn't been declared "ready to
> merge" yet.

I can probably put out FS-Cache now, and the patches for kAFS and NFS to use
it. CacheFS is taking a little longer than expected because I'm having to be
so careful about ENOMEM handling.

> That said, is it really safe to use any flags other than
> PG_lock/PG_writeback there, David?

If I use PG_locked, that hurts performance horribly as readpage() can't then
unlock the page until the page has been read from the network _and_ has been
written to the cache, two operations which _must_ of necessity be sequential.

I can't use PG_writeback to cover the write to the cache as that has indicates
write completion to the network. Writes to the cache and the network may run
in parallel, and so you need two flags to keep track of the completion state
of both.

> I can't see that you want to allow other tasks to modify or free the page
> while you are writing it to the local cache.

I don't. Hence the use of a combination of the PG_fs_misc bit and the
page_mkwrite() VMA op.

The page release address space op also waits for the PG_fs_misc bit.

David
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][PATCH] Rename PageChecked as PageMiscFS
  2005-08-10 22:23         ` Daniel Phillips
  2005-08-10 22:34           ` Trond Myklebust
  2005-08-10 23:42           ` Adrian Bunk
@ 2005-08-11  9:46           ` David Howells
  2005-08-12  2:34             ` Daniel Phillips
  2005-08-12 12:32             ` David Howells
  2 siblings, 2 replies; 91+ messages in thread
From: David Howells @ 2005-08-11  9:46 UTC (permalink / raw)
  To: Adrian Bunk
  Cc: Daniel Phillips, Andrew Morton, linux-kernel, linux-mm,
	Hugh Dickins, David Howells

Adrian Bunk <bunk@stusta.de> wrote:

> Since this was done only for CacheFS, and Andrew dropped CacheFS from 
> -mm he could drop this patch as well.

I asked him not to. Somewhat at his instigation, I requested that he drop the
filesystem caching patches for the moment. I'm updating them and they'll be
back soon. Taking out this and the other remaining patch means he'll just be
given them back again shortly.

I know you want to ruthlessly trim out anything that isn't used, but please be
patient:-)

David
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-10 21:50           ` Pavel Machek
  2005-08-10 21:56             ` Martin J. Bligh
@ 2005-08-11 10:26             ` Rafael J. Wysocki
  1 sibling, 0 replies; 91+ messages in thread
From: Rafael J. Wysocki @ 2005-08-11 10:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: Pavel Machek, Daniel Phillips, Nick Piggin,
	Benjamin Herrenschmidt, Linux Memory Management, Hugh Dickins,
	Linus Torvalds, Andrew Morton, Andrea Arcangeli

Hi,

On Wednesday, 10 of August 2005 23:50, Pavel Machek wrote:
> Hi!
> 
> > > Swsusp is the main "is valid ram" user I have in mind here. It
> > > wants to know whether or not it should save and restore the
> > > memory of a given `struct page`.
> > 
> > Why can't it follow the rmap chain?
> 
> It is walking physical memory, not memory managment chains. I need
> something like:
> 
> static int saveable(struct zone * zone, unsigned long * zone_pfn)
> {
>         unsigned long pfn = *zone_pfn + zone->zone_start_pfn;
>         struct page * page;
> 
>         if (!pfn_valid(pfn))
>                 return 0;
> 
>         page = pfn_to_page(pfn);
>         BUG_ON(PageReserved(page) && PageNosave(page));
>         if (PageNosave(page))
>                 return 0;
>         if (PageReserved(page) && pfn_is_nosave(pfn)) {

This only is a trick to avoid calling pfn_is_nosave(pfn) for every single page
that is neither PageNosave nor PageNosaveFree, isn't it?

>                 pr_debug("[nosave pfn 0x%lx]", pfn);
>                 return 0;
>         }
>         if (PageNosaveFree(page))
>                 return 0;
> 
>         return 1;
> }

IMO it is safe to drop PageReserved from this function completely, which is
done in the following (experimental) patch (tested on x86-64).

Greets,
Rafael


Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>

Index: linux-2.6.13-rc5-mm1/kernel/power/swsusp.c
===================================================================
--- linux-2.6.13-rc5-mm1.orig/kernel/power/swsusp.c
+++ linux-2.6.13-rc5-mm1/kernel/power/swsusp.c
@@ -674,15 +674,14 @@ static int saveable(struct zone * zone, 
 		return 0;
 
 	page = pfn_to_page(pfn);
-	BUG_ON(PageReserved(page) && PageNosave(page));
 	if (PageNosave(page))
 		return 0;
-	if (PageReserved(page) && pfn_is_nosave(pfn)) {
-		pr_debug("[nosave pfn 0x%lx]", pfn);
-		return 0;
-	}
 	if (PageNosaveFree(page))
 		return 0;
+	if (pfn_is_nosave(pfn)) {
+		pr_debug("  [nosave pfn 0x%lx]\n", pfn);
+		return 0;
+	}
 
 	return 1;
 }
 

-- 
- Would you tell me, please, which way I ought to go from here?
- That depends a good deal on where you want to get to.
		-- Lewis Carroll "Alice's Adventures in Wonderland"
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-10 21:56             ` Martin J. Bligh
@ 2005-08-11 10:36               ` Rafael J. Wysocki
  2005-08-12 19:56                 ` Daniel Phillips
  0 siblings, 1 reply; 91+ messages in thread
From: Rafael J. Wysocki @ 2005-08-11 10:36 UTC (permalink / raw)
  To: linux-kernel, Martin J. Bligh
  Cc: Pavel Machek, Daniel Phillips, Nick Piggin,
	Benjamin Herrenschmidt, Linux Memory Management, Hugh Dickins,
	Linus Torvalds, Andrew Morton, Andrea Arcangeli

Hi,

On Wednesday, 10 of August 2005 23:56, Martin J. Bligh wrote:
> --On Wednesday, August 10, 2005 23:50:22 +0200 Pavel Machek <pavel@suse.cz> wrote:
> 
> > Hi!
> > 
> >> > Swsusp is the main "is valid ram" user I have in mind here. It
> >> > wants to know whether or not it should save and restore the
> >> > memory of a given `struct page`.
> >> 
> >> Why can't it follow the rmap chain?
> > 
> > It is walking physical memory, not memory managment chains. I need
> > something like:
> 
> Can you not use page_is_ram(pfn) ?

IMHO it would be inefficient.

There obviously are some non-RAM pages that should not be saved and there are
some that are not worthy of saving, although they are RAM (eg because they never
change), but this is very archtecture-dependent.  The arch code should mark them
as PageNosave for swsusp, and that's enough.

Greets,
Rafael


-- 
- Would you tell me, please, which way I ought to go from here?
- That depends a good deal on where you want to get to.
		-- Lewis Carroll "Alice's Adventures in Wonderland"
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-10 14:27       ` David Howells
  2005-08-10 23:19         ` Daniel Phillips
@ 2005-08-11 10:49         ` David Howells
  2005-08-12 19:34           ` Daniel Phillips
  2005-08-15 13:15           ` David Howells
  1 sibling, 2 replies; 91+ messages in thread
From: David Howells @ 2005-08-11 10:49 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: David Howells, Andrew Morton, linux-kernel, linux-mm, hugh

Daniel Phillips <phillips@arcor.de> wrote:

> To be honest I'm having some trouble following this through logically.  I'll
> read through a few more times and see if that fixes the problem.  This seems
> cluster-related, so I have an interest.

Well, perhaps I can explain the function for which I'm using this page flag
more clearly. You'll have to excuse me if it's covering stuff you don't know,
but I want to take it from first principles; plus this stuff might well find
its way into the kernel docs.

We want to use a relatively fast medium (such as RAM or local disk) to speed
up repeated accesses to a relatively slow medium (such as NFS, NBD, CDROM) by
means of caching the results of previous accesses to the slow medium on the
fast medium.

Now we already do this at one level: RAM. The page cache _is_ such a cache,
but whilst it's much faster than a disk, it is severely restricted in size
compared to media such as disks, it's more expensive, and it's contents
generally don't last over power failure or reboots. The major attribute of the
page cache is that the CPU can access it directly.

So we want to add another level: local disk. The FS-Cache/CacheFS patches
permit such as AFS and NFS to use local disk as a cache.

So, assume that NFS is using a local disk cache (it doesn't matter whether
it's CacheFS, CacheFiles, or something else), and assume a process has a file
open through NFS.

The process attempts to read from the file. This causes the NFS readpage() or
readpages() operation to be invoked to load the data into the page cache so
that the CPU can make use of it.

So the NFS page reading algorithm first consults the disk cache. Assume this
returns a negative response - NFS will then read from the server into the page
cache. Under cacheless operation, it would then unlock the page and the kernel
could then let userspace play with it, but we're dealing with a cache, and so
the newly fetched data must be stored in the disk cache for future retrieval.

NFS now has three choices:

 (1) It could institigate a write to the disk cache and wait for that to
     complete before unlocking the page and letting userspace see it, but we
     don't know how long that might take.

     CacheFS immediately dispatches a write BIO to get it DMA'd to the disk as
     soon as possible, but something like CacheFiles is dependent on an
     underlying filesystem - be it EXT3, ReiserFS, XFS, etc. - to perform the
     write, and we've no control over that.

	Time to unlock: CacheMiss + NetRead + CacheWrite
	Cache reliable: Yes

 (2) It could just unlock the page and let userspace scribble on it whilst
     simultaneously writing it to the cache. But that means the DMA to the
     disk may pick up some of userspace's scribblings, and that means you
     can't trust what's in the cache in the event of a power loss.

     This can be alleviated by marking untrustworthy files in the cache, but
     that then extends the management time in several ways.

	Time to unlock: CacheMiss + NetRead
	Cache reliable: No

 (3) It could tell the cache that the page needs writing to disk and then
     unlock it for userspace to read, but intercept the change of a PTE
     pointing to this page when it loses its write protection (PTEs start off
     read-only, generating a write protection fault on the first write).

     The interceptor would then force userspace to wait for the cache to
     finish DMA'ing the page before writing to it.

     Similarly, the write() or prepare_write() operations would wait for the
     cache to finish with that page.

	Time to unlock: CacheMiss + NetRead
	Cache reliable: Yes

I originally chose option (1), but then I saw just how much it affected
performance and worked on option (3).

I discarded option (2) because I want to be able to have some surety about the
state in the cache - I don't want to have to reinitialise it after a power
failure. Imagine if you cache /usr... Imagine if everyone in a very large
office caches /usr...

So, the way I implemented (3) is to use an extra page flag to indicate a write
underway to the cache, and thus allow cache write status to be determined when
someone wants to scribble on a page.

The fscache_write_page() function takes a pointer to a callback function. In
NFS this function clears the PG_fs_misc bit on the appropriate pages and wakes
up anyone who was waiting for this event (end_page_fs_misc()).

The NFS page_mkwrite() VMA op calls wait_on_page_fs_misc() to wait on that
page bit if it is set.

> Who is using this interface?

AFS and NFS will both use it. There may be others eventually who use it for
the same purpose. CacheFS has a different use for it internally.

David
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][PATCH] Rename PageChecked as PageMiscFS
  2005-08-11  9:46           ` David Howells
@ 2005-08-12  2:34             ` Daniel Phillips
  2005-08-12 12:32             ` David Howells
  1 sibling, 0 replies; 91+ messages in thread
From: Daniel Phillips @ 2005-08-12  2:34 UTC (permalink / raw)
  To: David Howells
  Cc: Adrian Bunk, Andrew Morton, linux-kernel, linux-mm, Hugh Dickins

On Thursday 11 August 2005 19:46, David Howells wrote:
> Adrian Bunk <bunk@stusta.de> wrote:
> > Since this was done only for CacheFS, and Andrew dropped CacheFS from
> > -mm he could drop this patch as well.
>
> I asked him not to. Somewhat at his instigation, I requested that he drop
> the filesystem caching patches for the moment. I'm updating them and
> they'll be back soon. Taking out this and the other remaining patch means
> he'll just be given them back again shortly.
>
> I know you want to ruthlessly trim out anything that isn't used, but please
> be patient:-)

Are you sure CacheFS is even the right way to do client-side caching?  What is 
wrong with handling the backing store directly in your network filesystem?  
You have to hack your filesystem to use CacheFS anyway, so why not write some 
library functions to handle the backing store mapping and turn the hack into 
a few library calls instead?

I just don't see how turning this functionality into a filesystem is the right 
abstraction.  What actual advantage is there?  I noticed somebody out there 
on the web waxing poetic about how the administrator can look into the cache, 
see what is cached, and even delete some of it.  That just makes me cringe.

Regards,

Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][PATCH] Rename PageChecked as PageMiscFS
  2005-08-11  9:26       ` David Howells
@ 2005-08-12  3:29         ` Daniel Phillips
  2005-08-12 12:41         ` David Howells
  1 sibling, 0 replies; 91+ messages in thread
From: Daniel Phillips @ 2005-08-12  3:29 UTC (permalink / raw)
  To: David Howells; +Cc: Andrew Morton, linux-kernel, linux-mm, Hugh Dickins

On Thursday 11 August 2005 19:26, David Howells wrote:
> Daniel Phillips <phillips@arcor.de> wrote:
> > +	SetPageMiscFS(page);
>
> Can you please retain the *PageFsMisc names I've been using in my stuff?
>
> In my opinion putting the "Fs" bit first gives a clearer indication that
> this is a bit exclusively for the use of filesystems in general.

You also achieved some sort of new low point in the abuse of StudlyCaps there. 
Please, let's not get started on mixed case acronyms.

Anyway, it sounds like you want to bless the use of private page flags in 
filesystems.  That is most probably a bad idea.  Take a browse through the 
existing users and feast your eyes on the spectacular lack of elegance.

Regards,

Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][PATCH] Rename PageChecked as PageMiscFS
  2005-08-11  9:46           ` David Howells
  2005-08-12  2:34             ` Daniel Phillips
@ 2005-08-12 12:32             ` David Howells
  1 sibling, 0 replies; 91+ messages in thread
From: David Howells @ 2005-08-12 12:32 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: David Howells, Adrian Bunk, Andrew Morton, linux-kernel,
	linux-mm, Hugh Dickins

[-- Attachment #1: Type: text/plain, Size: 3284 bytes --]

Daniel Phillips <phillips@arcor.de> wrote:

> > I know you want to ruthlessly trim out anything that isn't used, but please
> > be patient:-)
> 
> Are you sure CacheFS is even the right way to do client-side caching?

It's just one way. See the attached document for how it works.

> What is wrong with handling the backing store directly in your network
> filesystem?

What do you mean by "handle the backing store"? Note that the system I'm
proposing involves directly moving data between netfs pages and the cache. I'm
trying very hard to avoid copying the data any more than I have to.

> You have to hack your filesystem to use CacheFS anyway, so why not write
> some library functions to handle the backing store mapping and turn the hack
> into a few library calls instead?

FS-Cache is just that. CacheFS is one of a number of proposed backends.

> I just don't see how turning this functionality into a filesystem is the
> right abstraction.  What actual advantage is there?  I noticed somebody out
> there on the web waxing poetic about how the administrator can look into the
> cache, see what is cached, and even delete some of it.  That just makes me
> cringe.

Well... With CacheFS you can't do that; not now, at least.

Using a block device has the very great advantage that it's a lot easier to
provide guarantees about service quality. Reading an NFS file through CacheFS
on a blockdev seems to be somewhat faster than reading the same file from
EXT2. I'm not sure why, but I'm sure Stephen and others will be very
interested if I find out.

The downside of using a block device is that you have to have one available,
and it can't easily be used for something else. Actually, this last isn't
entirely true: CacheFS is a filesystem after all...

Actually, given that CacheFS is a filesystem, that makes the userspace UI for
using it very simple...

Besides, who says CacheFS will be the only back end? CacheFiles is coming too,
but CacheFiles is, in many ways, a lot harder as I have to work through an
existing filesystem, using existing access functions. Not only that, but
CacheFiles can't provide a guarantee of minimum space and can't provide
reservations. CacheFiles has to be able to use O_DIRECT (which I have a patch
for), but has to be able to detect holes in the backing file.

What ever you do, do not forget the following hard requirements:

 (1) It must be trivially possible run without a cache.

 (2) It must be possible to access a file that's larger than the maximum size
     of the cache.

 (3) It must be possible to simultaneously access a set of files that are
     larger than the maximum size of the cache.

 (4) It mustn't take hours to open a huge file, just so you can access one
     block.

 (5) The cache must be able to survive power failure, and be recovered into a
     known state.

 (6) It must be possible to ignore I/O errors on the cache.

 (7) There mustn't be too much change to the netfs. FS-Cache doesn't really
     have that much of an impact on any filesystem that wishes to use it.

Note that if you're thinking of using i_host on the netfs inode to point at
the cache inode, and downloading the entire file on iget(), possibly in
userspace, then forget it: that violates (2), (3), (4) and (6) at the very
least.

David

[-- Attachment #2: fscache.txt --]
[-- Type: text/plain, Size: 6771 bytes --]

			  ==========================
			  General Filesystem Caching
			  ==========================

========
OVERVIEW
========

This facility is a general purpose cache for network filesystems, though it
could be used for caching other things such as ISO9660 filesystems too.

FS-Cache mediates between cache backends (such as CacheFS) and network
filesystems:

	+---------+
	|         |                        +-----------+
	|   NFS   |--+                     |           |
	|         |  |                 +-->|  CacheFS  |
	+---------+  |   +----------+  |   | /dev/hda5 |
	             |   |          |  |   +-----------+
	+---------+  +-->|          |  |
	|         |      |          |--+   +-------------+
	|   AFS   |----->| FS-Cache |      |             |
	|         |      |          |----->| Cache Files |
	+---------+  +-->|          |      | /var/cache  |
	             |   |          |--+   +-------------+
	+---------+  |   +----------+  |
	|         |  |                 |   +-------------+
	|  ISOFS  |--+                 |   |             |
	|         |                    +-->| ReiserCache |
	+---------+                        | /           |
	                                   +-------------+

FS-Cache does not follow the idea of completely loading every netfs file
opened in its entirety into a cache before permitting it to be accessed and
then serving the pages out of that cache rather than the netfs inode because:

 (1) It must be practical to operate without a cache.

 (2) The size of any accessible file must not be limited to the size of the
     cache.

 (3) The combined size of all opened files (this includes mapped libraries)
     must not be limited to the size of the cache.

 (4) The user should not be forced to download an entire file just to do a
     one-off access of a small portion of it (such as might be done with the
     "file" program).

It instead serves the cache out in PAGE_SIZE chunks as and when requested by
the netfs('s) using it.


FS-Cache provides the following facilities:

 (1) More than one cache can be used at once. Caches can be selected explicitly
     by use of tags.

 (2) Caches can be added / removed at any time.

 (3) The netfs is provided with an interface that allows either party to
     withdraw caching facilities from a file (required for (2)).

 (4) The interface to the netfs returns as few errors as possible, preferring
     rather to let the netfs remain oblivious.

 (5) Cookies are used to represent indexes, files and other objects to the
     netfs. The simplest cookie is just a NULL pointer - indicating nothing
     cached there.

 (6) The netfs is allowed to propose - dynamically - any index hierarchy it
     desires, though it must be aware that the index search function is
     recursive, stack space is limited, and indexes can only be children of
     indexes.

 (7) Data I/O is done direct to and from the netfs's pages. The netfs indicates
     that page A is at index B of the data-file represented by cookie C, and
     that it should be read or written. The cache backend may or may not start
     I/O on that page, but if it does, a netfs callback will be invoked to
     indicate completion. The I/O may be either synchronous or asynchronous.

 (8) Cookies can be "retired" upon release. At this point FS-Cache will mark
     them as obsolete and the index hierarchy rooted at that point will get
     recycled.

 (9) The netfs provides a "match" function for index searches. In addition to
     saying whether a match was made or not, this can also specify that an
     entry should be updated or deleted.


FS-Cache maintains a virtual indexing tree in which all indexes, files, objects
and pages are kept. Bits of this tree may actually reside in one or more
caches.

                                           FSDEF
                                             |
                        +------------------------------------+
                        |                                    |
                       NFS                                  AFS
                        |                                    |
           +--------------------------+                +-----------+
           |                          |                |           |
        homedir                     mirror          afs.org   redhat.com
           |                          |                            |
     +------------+           +---------------+              +----------+
     |            |           |               |              |          |
   00001        00002       00007           00125        vol00001   vol00002
     |            |           |               |                         |
 +---+---+     +-----+      +---+      +------+------+            +-----+----+
 |   |   |     |     |      |   |      |      |      |            |     |    |
PG0 PG1 PG2   PG0  XATTR   PG0 PG1   DIRENT DIRENT DIRENT        R/W   R/O  Bak
                     |                                            |
                    PG0                                       +-------+
                                                              |       |
                                                            00001   00003
                                                              |
                                                          +---+---+
                                                          |   |   |
                                                         PG0 PG1 PG2

In the example above, you can see two netfs's being backed: NFS and AFS. These
have different index hierarchies:

 (*) The NFS primary index contains per-server indexes. Each server index is
     indexed by NFS file handles to get data file objects. Each data file
     objects can have an array of pages, but may also have further child
     objects, such as extended attributes and directory entries. Extended
     attribute objects themselves have page-array contents.

 (*) The AFS primary index contains per-cell indexes. Each cell index contains
     per-logical-volume indexes. Each of volume index contains up to three
     indexes for the read-write, read-only and backup mirrors of those
     volumes. Each of these contains vnode data file objects, each of which
     contains an array of pages.

The very top index is the FS-Cache master index in which individual netfs's
have entries.

Any index object may reside in more than one cache, provided it only has index
children. Any index with non-index object children will be assumed to only
reside in one cache.


The netfs API to FS-Cache can be found in:

	Documentation/filesystems/caching/netfs-api.txt

The cache backend API to FS-Cache can be found in:

	Documentation/filesystems/caching/backend-api.txt

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][PATCH] Rename PageChecked as PageMiscFS
  2005-08-11  9:26       ` David Howells
  2005-08-12  3:29         ` Daniel Phillips
@ 2005-08-12 12:41         ` David Howells
  2005-08-12 13:28           ` Hugh Dickins
                             ` (2 more replies)
  1 sibling, 3 replies; 91+ messages in thread
From: David Howells @ 2005-08-12 12:41 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: David Howells, Andrew Morton, linux-kernel, linux-mm, Hugh Dickins

Daniel Phillips <phillips@istop.com> wrote:

> You also achieved some sort of new low point in the abuse of StudlyCaps
> there.  Please, let's not get started on mixed case acronyms.

My patch has been around for quite a while, and no-one else has complained,
not even you before this point. Plus, you don't seem to be complaining about
PageSwapCache... nor even PageLocked.

I'm just requesting that you base your stuff on my patch that's already in
-mm. The names in there are already in use, though not currently in the -mm
patch (the patches that use it have been temporarily dropped).

> Anyway, it sounds like you want to bless the use of private page flags in
> filesystems. That is most probably a bad idea.

Just because you don't like it doesn't make it a bad idea or wrong.

Please then suggest an alternative way of doing this. Do you understand the
problem I'm trying to solve?

> Take a browse through the existing users and feast your eyes on the
> spectacular lack of elegance.

There may be plenty of inelegance in the kernel, but this comment isn't very
helpful. I've looked at an awful lot of code and cogitated much and tried
different ways of doing things. Currently this is the best I've come up with.

David
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][PATCH] Rename PageChecked as PageMiscFS
  2005-08-12 12:41         ` David Howells
@ 2005-08-12 13:28           ` Hugh Dickins
  2005-08-16 13:59           ` Pavel Machek
  2005-08-18 14:33           ` David Howells
  2 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2005-08-12 13:28 UTC (permalink / raw)
  To: David Howells; +Cc: Daniel Phillips, Andrew Morton, linux-kernel, linux-mm

On Fri, 12 Aug 2005, David Howells wrote:
> Daniel Phillips <phillips@istop.com> wrote:
> 
> I'm just requesting that you base your stuff on my patch that's already in
> -mm. The names in there are already in use, though not currently in the -mm
> patch (the patches that use it have been temporarily dropped).

Seconded: that would be fair, I see no reason to change your naming.

> > Anyway, it sounds like you want to bless the use of private page flags in
> > filesystems. That is most probably a bad idea.
> 
> Just because you don't like it doesn't make it a bad idea or wrong.

Seconded: I see no virtue in denying filesystems their one page flag.

Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-11 10:49         ` David Howells
@ 2005-08-12 19:34           ` Daniel Phillips
  2005-08-15 13:15           ` David Howells
  1 sibling, 0 replies; 91+ messages in thread
From: Daniel Phillips @ 2005-08-12 19:34 UTC (permalink / raw)
  To: David Howells; +Cc: Andrew Morton, linux-kernel, linux-mm, hugh

On Thursday 11 August 2005 20:49, David Howells wrote:
> Daniel Phillips <phillips@arcor.de> wrote:
> > To be honest I'm having some trouble following this through logically. 
> > I'll read through a few more times and see if that fixes the problem. 
> > This seems cluster-related, so I have an interest.
>
> Well, perhaps I can explain the function for which I'm using this page flag
> more clearly. You'll have to excuse me if it's covering stuff you don't
> know, but I want to take it from first principles; plus this stuff might
> well find its way into the kernel docs.
>
>
> We want to use a relatively fast medium (such as RAM or local disk) to
> speed up repeated accesses to a relatively slow medium (such as NFS, NBD,
> CDROM) by means of caching the results of previous accesses to the slow
> medium on the fast medium.
>
> Now we already do this at one level: RAM. The page cache _is_ such a cache,
> but whilst it's much faster than a disk, it is severely restricted in size

Did you just suggest that 16 TB/address_space is too small to cache NFS pages?

> compared to media such as disks, it's more expensive

It is?

> and it's contents generally don't last over power failure or reboots.

When used by RAMFS maybe.  But fortunately the page cache has a backing store 
API, in fact, that is its raison d'etre.

> The major attribute of the page cache is that the CPU can access it
> directly. 

You seem to have forgotten about non-resident pages.

> So we want to add another level: local disk. The FS-Cache/CacheFS patches
> permit such as AFS and NFS to use local disk as a cache.

The page cache already lets you do that.  I have not yet discerned a 
fundamental reason why you need to interface to another filesystem to 
implement backing store for an address_space.

> So, assume that NFS is using a local disk cache (it doesn't matter whether
> it's CacheFS, CacheFiles, or something else), and assume a process has a
> file open through NFS.
>
> The process attempts to read from the file. This causes the NFS readpage()
> or readpages() operation to be invoked to load the data into the page cache
> so that the CPU can make use of it.
>
> So the NFS page reading algorithm first consults the disk cache.  Assume 
> this returns a negative response - NFS will then read from the server into
> the page cache. Under cacheless operation, it would then unlock the page
> and the kernel could then let userspace play with it, but we're dealing
> with a cache, and so the newly fetched data must be stored in the disk
> cache for future retrieval.
>
> NFS now has three choices:
>
>  (1) It could institigate a write to the disk cache and wait for that to
>      complete before unlocking the page and letting userspace see it, but
> we don't know how long that might take.

Pages are typically unlocked while being written to backing store, e.g.:

http://lxr.linux.no/source/fs/buffer.c#L1839

What makes NFS special in this regard?

>      CacheFS immediately dispatches a write BIO to get it DMA'd to the disk
> as soon as possible, but something like CacheFiles is dependent on an
> underlying filesystem - be it EXT3, ReiserFS, XFS, etc. - to perform the
> write, and we've no control over that.

That is a problem you are in the process of inventing.

> 	Time to unlock: CacheMiss + NetRead + CacheWrite
> 	Cache reliable: Yes
>
>  (2) It could just unlock the page and let userspace scribble on it whilst
>      simultaneously writing it to the cache. But that means the DMA to the
>      disk may pick up some of userspace's scribblings, and that means you
>      can't trust what's in the cache in the event of a power loss.

I thought I saw a journal in there.  Anyway, if the user has asked for a racy 
write, that is what they should get.

>      This can be alleviated by marking untrustworthy files in the cache,
> but that then extends the management time in several ways.
>
> 	Time to unlock: CacheMiss + NetRead
> 	Cache reliable: No

I think your definition of trustworthy goes beyond what is required by Posix 
or Linux local filesystem semantics.

>  (3) It could tell the cache that the page needs writing to disk and then
>      unlock it for userspace to read, but intercept the change of a PTE
>      pointing to this page when it loses its write protection (PTEs start
> off read-only, generating a write protection fault on the first write).

We need to do something like this to implemented cross-node caching of 
shared-writeable mmaps.  This is another reason that your ideas need clear 
explanations: we need to go the rest of the way and get this sorted out for 
cluster filesystems in general, not just NFS (v4).  It does help a lot that 
you are attempting to explain what the needs of NFS actually are.  
Unfortunately, it seems you are proposing that this mechanism is essential 
even for single-node use, which is far from clear.

>      The interceptor would then force userspace to wait for the cache to
>      finish DMA'ing the page before writing to it.
>
>      Similarly, the write() or prepare_write() operations would wait for
> the cache to finish with that page.

Here you return to the assumption that the VFS should enforce per-page write 
granularity.  There is no such rule as far as I know.

> 	Time to unlock: CacheMiss + NetRead
> 	Cache reliable: Yes
>
> I originally chose option (1), but then I saw just how much it affected
> performance and worked on option (3).
>
> I discarded option (2) because I want to be able to have some surety about
> the state in the cache - I don't want to have to reinitialise it after a
> power failure. Imagine if you cache /usr... Imagine if everyone in a very
> large office caches /usr...
>
>
> So, the way I implemented (3) is to use an extra page flag to indicate a
> write underway to the cache, and thus allow cache write status to be
> determined when someone wants to scribble on a page.
>
> The fscache_write_page() function takes a pointer to a callback function.
> In NFS this function clears the PG_fs_misc bit on the appropriate pages and
> wakes up anyone who was waiting for this event (end_page_fs_misc()).
>
> The NFS page_mkwrite() VMA op calls wait_on_page_fs_misc() to wait on that
> page bit if it is set.
>
> > Who is using this interface?
>
> AFS and NFS will both use it. There may be others eventually who use it for
> the same purpose. CacheFS has a different use for it internally.

Let's try to clear up the page write atomicity question, please.  It seems 
your argument depends on it.

Regards,

Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-11 10:36               ` Rafael J. Wysocki
@ 2005-08-12 19:56                 ` Daniel Phillips
  2005-08-12 22:20                   ` Rafael J. Wysocki
  0 siblings, 1 reply; 91+ messages in thread
From: Daniel Phillips @ 2005-08-12 19:56 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-kernel, Martin J. Bligh, Pavel Machek, Nick Piggin,
	Benjamin Herrenschmidt, Linux Memory Management, Hugh Dickins,
	Linus Torvalds, Andrew Morton, Andrea Arcangeli

On Thursday 11 August 2005 20:36, Rafael J. Wysocki wrote:
> > >> > Swsusp is the main "is valid ram" user I have in mind here. It
> > >> > wants to know whether or not it should save and restore the
> > >> > memory of a given `struct page`.
> > >>
> > >> Why can't it follow the rmap chain?
> > >
> > > It is walking physical memory, not memory managment chains. I need
> > > something like:
> >
> > Can you not use page_is_ram(pfn) ?
>
> IMHO it would be inefficient.
>
> There obviously are some non-RAM pages that should not be saved and there
> are some that are not worthy of saving, although they are RAM (eg because
> they never change), but this is very archtecture-dependent.  The arch code
> should mark them as PageNosave for swsusp, and that's enough.

I still don't see why you can't lift your flags up into the VMA.  The rmap 
mechanism is there precisely to let you get from the physical page to the 
users and user data, including VMAs.

I am also not sure why you are talking about efficiency here.  Did you measure 
the impact on suspend performance?

Regards,

Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-12 19:56                 ` Daniel Phillips
@ 2005-08-12 22:20                   ` Rafael J. Wysocki
  2005-08-12 23:04                     ` Daniel Phillips
  0 siblings, 1 reply; 91+ messages in thread
From: Rafael J. Wysocki @ 2005-08-12 22:20 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: linux-kernel, Martin J. Bligh, Pavel Machek, Nick Piggin,
	Benjamin Herrenschmidt, Linux Memory Management, Hugh Dickins,
	Linus Torvalds, Andrew Morton, Andrea Arcangeli

On Friday, 12 of August 2005 21:56, Daniel Phillips wrote:
> On Thursday 11 August 2005 20:36, Rafael J. Wysocki wrote:
> > > >> > Swsusp is the main "is valid ram" user I have in mind here. It
> > > >> > wants to know whether or not it should save and restore the
> > > >> > memory of a given `struct page`.
> > > >>
> > > >> Why can't it follow the rmap chain?
> > > >
> > > > It is walking physical memory, not memory managment chains. I need
> > > > something like:
> > >
> > > Can you not use page_is_ram(pfn) ?
> >
> > IMHO it would be inefficient.
> >
> > There obviously are some non-RAM pages that should not be saved and there
> > are some that are not worthy of saving, although they are RAM (eg because
> > they never change), but this is very archtecture-dependent.  The arch code
> > should mark them as PageNosave for swsusp, and that's enough.
> 
> I still don't see why you can't lift your flags up into the VMA.  The rmap 
> mechanism is there precisely to let you get from the physical page to the 
> users and user data, including VMAs.

I'm not sure if I understand the issue, but swsusp works on a different level.
It only needs to figure out which physical pages, as represented by struct page
objects, should be saved to swap before suspend.  We browse all zones (once)
and create a list of page frames that should be saved on the basis of the contents
of the struct page objects alone.  IMHO if we needed to use any additional
mechanisms here, it would be less efficient than just checking the page flags.

> I am also not sure why you are talking about efficiency here.  Did you measure 
> the impact on suspend performance?

I should have said "not enough".  The problem is that there may be some page
frames corresponding to RAM (eg such that page_is_ram(pfn) is non-zero) which
for some reason should not be saved on given architecture and we need a
mechanism allowing us to identify them.

Greets,
Rafael


-- 
- Would you tell me, please, which way I ought to go from here?
- That depends a good deal on where you want to get to.
		-- Lewis Carroll "Alice's Adventures in Wonderland"
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-12 22:20                   ` Rafael J. Wysocki
@ 2005-08-12 23:04                     ` Daniel Phillips
  2005-08-13  7:06                       ` Rafael J. Wysocki
  0 siblings, 1 reply; 91+ messages in thread
From: Daniel Phillips @ 2005-08-12 23:04 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-kernel, Martin J. Bligh, Pavel Machek, Nick Piggin,
	Benjamin Herrenschmidt, Linux Memory Management, Hugh Dickins,
	Linus Torvalds, Andrew Morton, Andrea Arcangeli

On Saturday 13 August 2005 08:20, Rafael J. Wysocki wrote:
> On Friday, 12 of August 2005 21:56, Daniel Phillips wrote:
> > I still don't see why you can't lift your flags up into the VMA.  The
> > rmap mechanism is there precisely to let you get from the physical page
> > to the users and user data, including VMAs.
>
> I'm not sure if I understand the issue, but swsusp works on a different
> level. It only needs to figure out which physical pages, as represented by
> struct page objects, should be saved to swap before suspend.  We browse all
> zones (once) and create a list of page frames that should be saved on the
> basis of the contents of the struct page objects alone.  IMHO if we needed
> to use any additional mechanisms here, it would be less efficient than just
> checking the page flags.

Isn't that what hash tables are for?  It seems to me obvious that you don't 
absolutely need to reserve page flag bits, but you think this is better, 
maybe enough faster to make a perceptible difference.  How about testing with 
a hash table?  If it dims the lights then you have all the argument you need.

Admittedly, page flags have not gotten really tight just yet, and this is 
something you can change later if they do become tight.  But it would be very 
nice to know just which of those page flags are really needed (like uptodate) 
versus which are just there for convenience.  I think yours fall in the 
latter category.

Regards,

Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-12 23:04                     ` Daniel Phillips
@ 2005-08-13  7:06                       ` Rafael J. Wysocki
  0 siblings, 0 replies; 91+ messages in thread
From: Rafael J. Wysocki @ 2005-08-13  7:06 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: linux-kernel, Martin J. Bligh, Pavel Machek, Nick Piggin,
	Benjamin Herrenschmidt, Linux Memory Management, Hugh Dickins,
	Linus Torvalds, Andrew Morton, Andrea Arcangeli

On Saturday, 13 of August 2005 01:04, Daniel Phillips wrote:
> On Saturday 13 August 2005 08:20, Rafael J. Wysocki wrote:
> > On Friday, 12 of August 2005 21:56, Daniel Phillips wrote:
> > > I still don't see why you can't lift your flags up into the VMA.  The
> > > rmap mechanism is there precisely to let you get from the physical page
> > > to the users and user data, including VMAs.
> >
> > I'm not sure if I understand the issue, but swsusp works on a different
> > level. It only needs to figure out which physical pages, as represented by
> > struct page objects, should be saved to swap before suspend.  We browse all
> > zones (once) and create a list of page frames that should be saved on the
> > basis of the contents of the struct page objects alone.  IMHO if we needed
> > to use any additional mechanisms here, it would be less efficient than just
> > checking the page flags.
> 
> Isn't that what hash tables are for?  It seems to me obvious that you don't 
> absolutely need to reserve page flag bits, but you think this is better, 
> maybe enough faster to make a perceptible difference.  How about testing with 
> a hash table?  If it dims the lights then you have all the argument you need.
> 
> Admittedly, page flags have not gotten really tight just yet, and this is 
> something you can change later if they do become tight.  But it would be very 
> nice to know just which of those page flags are really needed (like uptodate) 
> versus which are just there for convenience.  I think yours fall in the 
> latter category.

Well, I think we can do without PG_nosave in swsusp, although it would require
a considerable effort to remove it.

Greets,
Rafael


-- 
- Would you tell me, please, which way I ought to go from here?
- That depends a good deal on where you want to get to.
		-- Lewis Carroll "Alice's Adventures in Wonderland"
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-11 10:49         ` David Howells
  2005-08-12 19:34           ` Daniel Phillips
@ 2005-08-15 13:15           ` David Howells
  2005-08-16  1:53             ` Daniel Phillips
  2005-08-16 10:28             ` David Howells
  1 sibling, 2 replies; 91+ messages in thread
From: David Howells @ 2005-08-15 13:15 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: David Howells, Andrew Morton, linux-kernel, linux-mm, hugh

Daniel Phillips <phillips@arcor.de> wrote:

> > Now we already do this at one level: RAM. The page cache _is_ such a cache,
> > but whilst it's much faster than a disk, it is severely restricted in size
> 
> Did you just suggest that 16 TB/address_space is too small to cache NFS pages?

No. I meant that you normally have a lot less RAM than you have, say, local
disk space or the NFS fileserver has available file data to serve.

> > compared to media such as disks, it's more expensive
> 
> It is?

By expensive, I mean moneywise. You can get large disks at under 50p/Gig these
days. If you can get 400GB of RAM for under GBP200, I'd really love to know from
where.

> > and it's contents generally don't last over power failure or reboots.
> 
> When used by RAMFS maybe.  But fortunately the page cache has a backing store 
> API, in fact, that is its raison d'etre.

Are you referring to writepage(s)? If so, your statement is irrelevant. The NFS
filesystem does not currently save anything locally, except in the page cache -
in RAM.

Swap is also irrelevant.

> > The major attribute of the page cache is that the CPU can access it
> > directly. 
> 
> You seem to have forgotten about non-resident pages.

Non-resident pages? The page cache doesn't have non-resident pages. Dirty pages
wind up either being written back to the host mapping or get written to swap
when evicted from RAM, and in either case have left the page cache.

Can you be more explicit about what you mean?

> > So we want to add another level: local disk. The FS-Cache/CacheFS patches
> > permit such as AFS and NFS to use local disk as a cache.
> 
> The page cache already lets you do that.

How? And if you wish to say inode->i_host, you must first consider the problems
with that mechanism.

> I have not yet discerned a fundamental reason why you need to interface to
> another filesystem to implement backing store for an address_space.

Where are you going to place the cache backing store? Are you going to point
the i_host from each NFS inode, say, directly at the i_data of a block device?

>From what you said, you aren't going to point i_host at the i_data from an
inode from another filesystem...

> > So, assume that NFS is using a local disk cache (it doesn't matter whether
> > it's CacheFS, CacheFiles, or something else), and assume a process has a
> > file open through NFS.
> >
> > The process attempts to read from the file. This causes the NFS readpage()
> > or readpages() operation to be invoked to load the data into the page cache
> > so that the CPU can make use of it.
> >
> > So the NFS page reading algorithm first consults the disk cache.  Assume 
> > this returns a negative response - NFS will then read from the server into
> > the page cache. Under cacheless operation, it would then unlock the page
> > and the kernel could then let userspace play with it, but we're dealing
> > with a cache, and so the newly fetched data must be stored in the disk
> > cache for future retrieval.
> >
> > NFS now has three choices:
> >
> >  (1) It could institigate a write to the disk cache and wait for that to
> >      complete before unlocking the page and letting userspace see it, but
> > we don't know how long that might take.
> 
> Pages are typically unlocked while being written to backing store, e.g.:
> 
> http://lxr.linux.no/source/fs/buffer.c#L1839
> 
> What makes NFS special in this regard?

You're asking the wrong question. Nothing makes NFS, AFS, ISOFS special in this
regard. What _is_ special is the procedure they're having to go through to
write data into the cache.

They don't want to extent the readpage-to-unlock time any more than they have
to, which means the data must be written to the cache after the page is
unlocked.

> >      CacheFS immediately dispatches a write BIO to get it DMA'd to the disk
> > as soon as possible, but something like CacheFiles is dependent on an
> > underlying filesystem - be it EXT3, ReiserFS, XFS, etc. - to perform the
> > write, and we've no control over that.
> 
> That is a problem you are in the process of inventing.

No. It's a problem, full stop - unless you're advocating reading the entire
file on open() or read_inode()...

I suppose in one way it's a problem of my inventing: I want to know what state
the cache is in, and so refuse to let userspace corrupt as-yet uncached data.

> >  (2) It could just unlock the page and let userspace scribble on it whilst
> >      simultaneously writing it to the cache. But that means the DMA to the
> >      disk may pick up some of userspace's scribblings, and that means you
> >      can't trust what's in the cache in the event of a power loss.
> 
> I thought I saw a journal in there.

That gives me full filesystem integrity and data integrity on the cache.

> Anyway, if the user has asked for a racy write, that is what they should get.

There are three things for you to consider:

 (1) What happens when the power goes off and comes back on again. Can you
     trust what's in the cache? It may have been modified by userspace process
     through a MAP_SHARED/PROT_WRITE mapping whilst it was initially being
     DMA'd to the cache, and we would not have recorded this fact.

 (2) The user hasn't asked for a racy write. We can - and by default should -
     maintain data integrity where possible.

 (3) The cache-less behaviour should match the cached behaviour if we can.

> >      This can be alleviated by marking untrustworthy files in the cache,
> > but that then extends the management time in several ways.
> >
> > 	Time to unlock: CacheMiss + NetRead
> > 	Cache reliable: No
> 
> I think your definition of trustworthy goes beyond what is required by Posix 
> or Linux local filesystem semantics.

Maybe there are other requirements than those of which you're aware. _You_ may
not care, but other people do. Some people have offices full of people with
/usr network mounted. They want to be sure when the power is restored that all
their machines come up with a minimum of fuss - and they certainly don't want
the network to melt down.

Besides, I can have my cake and eat it too, it would seem, from the little
performance testing I've been able to do thus far.

Furthermore, unless you're planning something more exotic than what I am,
you're still going to have to go through the loving hands of _some_ filesystem
or other. CacheFS is just one option; FS-Cache allows for others. There _will_
be a caching on cache files on an already mounted filesystem option too, if I
can get it to work, and despite what you may think, it's not as easy as it
seems - not if I want to keep the impact down.

Of course, if you have a better way of doing it, please say! I'll implement it
if I can work out how it's done, and if I think it is better.

> >  (3) It could tell the cache that the page needs writing to disk and then
> >      unlock it for userspace to read, but intercept the change of a PTE
> >      pointing to this page when it loses its write protection (PTEs start
> > off read-only, generating a write protection fault on the first write).
> 
> We need to do something like this to implemented cross-node caching of 
> shared-writeable mmaps.  This is another reason that your ideas need clear 
> explanations: we need to go the rest of the way and get this sorted out for 
> cluster filesystems in general, not just NFS (v4).  It does help a lot that 
> you are attempting to explain what the needs of NFS actually are.  
> Unfortunately, it seems you are proposing that this mechanism is essential 
> even for single-node use, which is far from clear.

It isn't essential. But it improves performance no end, because it lets you
avoid adding the write-to-disk time into the readpage-till-unlock time without
corrupting your cache.

Think of it this then:

   Doing, say, an NFS read through a cold disk cache involves two I/O
   operations: one to read the data from the network and the other to write it
   to the disk cache. Not only that, but the two operations _have_ to be
   sequential.

   In the simplest method, you do both operations before releasing the data to
   userspace. This is, however, really slow.

   Another method is to do the read operation, release the data to userspace
   and then write the data to the cache, not caring if userspace changes the
   data before they're written to the cache. This is, however, a real pain to
   recover from after a power failure or a crash.

   A third mechanism is to do the read, release and write in that order, but
   don't permit userspace to _modify_ the data until it has been written to the
   cache. This means you have some idea of the state your cache is in, even
   despite crashes and power failure, but at the expense of holding up writes
   to existing data, be it through a mapping or directly.

I've chosen the third mechanism. Most data read are never modified; most writes
truncate the intended file or create a new one.

> >      The interceptor would then force userspace to wait for the cache to
> >      finish DMA'ing the page before writing to it.
> >
> >      Similarly, the write() or prepare_write() operations would wait for
> > the cache to finish with that page.
> 
> Here you return to the assumption that the VFS should enforce per-page write 
> granularity.  There is no such rule as far as I know.

You mean like readpage(s) and writepage(s) don't suggest page granularity? And
the MMU certainly doesn't enforce it?

Actually, what you said is true, to a certain extent, but the VFS doesn't
currently get a look in on FS-Cache. I'd sort of like it to, but it can live
entirely within the interested filesystem (by which I mean NFS, AFS, ISOFS,
etc.).

I've optimised FS-Cache around pages, yes; but generally that's good enough for
the cache - though it does mean you might end up doing a little extra DMA'ing
than you'd really like to. However, most writes are a contiguous steam,
starting with the first byte of the file. FS-Cache must be given the size of
the file before the file may be extended that far, and so the caching backend
need only write out what is required.

> > The NFS page_mkwrite() VMA op calls wait_on_page_fs_misc() to wait on that
> > page bit if it is set.
> >
> > > Who is using this interface?
> >
> > AFS and NFS will both use it. There may be others eventually who use it for
> > the same purpose. CacheFS has a different use for it internally.

There's another use for this too: filesystems like EXT3 and JFFS2 could (and
perhaps should?) use page_mkwrite() to deal with ENOSPC by delivering a SIGBUS
rather than letting an unwritable, unreleasable page lurk in memory forever.

> Let's try to clear up the page write atomicity question, please.  It seems 
> your argument depends on it.

What about it? I want to know when a page is going to be modified so that I can
predict the state of the cache as much as possible. I don't want userspace
processes corrupting the cache in unrecorded ways. I'd very much rather not
have to blow my cache away on booting because it hadn't been shut down cleanly.

David
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-15 13:15           ` David Howells
@ 2005-08-16  1:53             ` Daniel Phillips
  2005-08-16 10:28             ` David Howells
  1 sibling, 0 replies; 91+ messages in thread
From: Daniel Phillips @ 2005-08-16  1:53 UTC (permalink / raw)
  To: David Howells; +Cc: Andrew Morton, linux-kernel, linux-mm, hugh

On Monday 15 August 2005 23:15, David Howells wrote:
> I want to know when a page is going to be modified so that I
> can predict the state of the cache as much as possible. I don't want
> userspace processes corrupting the cache in unrecorded ways.

There are two cases:

  1) Metadata.  If anybody is doing racy writes to metadata pages, it is
     your filesystem, and you have a bug.

  2) Data.  In Linux practice and Posix, racy writes to files have
     undefined semantics, including the possibility that data may end up
     interleaved on a disk block.

You seem to be trying to define (2) as "corruption" and setting out to prevent 
it.  But it is not the responsibility of a filesystem to prevent this, it is 
the responsibility of the application.

Could you please explain why it is not ok to end up with a half-written page 
in your cache, if the client was in fact halfway through writing it when it 
crashed?

Regards,

Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][patch 0/2] mm: remove PageReserved
  2005-08-15 13:15           ` David Howells
  2005-08-16  1:53             ` Daniel Phillips
@ 2005-08-16 10:28             ` David Howells
  1 sibling, 0 replies; 91+ messages in thread
From: David Howells @ 2005-08-16 10:28 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: David Howells, Andrew Morton, linux-kernel, linux-mm, hugh

Daniel Phillips <phillips@arcor.de> wrote:

> > I want to know when a page is going to be modified so that I
> > can predict the state of the cache as much as possible. I don't want
> > userspace processes corrupting the cache in unrecorded ways.
> 
> There are two cases:
> 
>   1) Metadata.  If anybody is doing racy writes to metadata pages, it is
>      your filesystem, and you have a bug.

I didn't say that I had a problem with metadata. I don't (or shouldn't). I
have done my best to implement filesystem integrity on CacheFS. CacheFiles is
harder to do this for because I have to work through another filesystem to
maintain consistency.

>   2) Data.  In Linux practice and Posix, racy writes to files have
>      undefined semantics, including the possibility that data may end up
>      interleaved on a disk block.

There are more cases than you are considering. This point can be split into
writing into the cache from a read from the netfs and writing into the cache
from a write to the netfs.

Don't forget that the cache isn't the data backing store (eg: NFS).

> You seem to be trying to define (2) as "corruption" and setting out to
> prevent it.  But it is not the responsibility of a filesystem to prevent
> this, it is the responsibility of the application.
> 
> Could you please explain why it is not ok to end up with a half-written page 
> in your cache, if the client was in fact halfway through writing it when it 
> crashed?

Because then the cache may hold something other than what the server displays,
and once the client computer is back on its feet after a crash it will then
read bad data from the cache.

Of course, the netfs can make the effort to write the half-written data back
to the server upon recovery, _if_ it can work out which that is, and _if_ the
server's idea of the current state hasn't advanced whilst the client was out
of commission (see AFS).

Basically, you've got four choices:

 (1) Prevention.

 (2) Cache invalidation.

 (3) Cache flush.

 (4) Pretend nothing happened.

I really hate the idea of (4) - we can end up with the cache and the server
having two totally different ideas on what the data ought to be because we
couldn't be bothered to fix it up. (3) is tricky as we have to work out what
is different. (2) is easiest - it _is_ a cache after all - but we don't want
to invalidate the _entire_ cache. (1) is relatively cheap in CacheFS.

With FS-Cache as I have implemented it, this is a choice made entirely by the
netfs. All FS-Cache/CacheFS does is wait to be given a page to write and then
tell you when it's written it, and allow you to arbitrarily mark inodes in
their auxilliary data. But that's it. Full stop. The netfs interface is
extremely simple - about as simple as I can make it (it has been revised
recently with this in mind).

Also, it could copy the data before writing, but that has two problems:

 (1) You have to have a page to copy the data into.

 (2) You have to copy the page.

Maintaining cache coherency really isn't fun. You've got a server with a bunch
of totally separate clients that may or may not know about one another and
each of these has a page cache and a local disk cache and a dcache.

David
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][PATCH] Rename PageChecked as PageMiscFS
  2005-08-12 12:41         ` David Howells
  2005-08-12 13:28           ` Hugh Dickins
@ 2005-08-16 13:59           ` Pavel Machek
  2005-08-18 14:33           ` David Howells
  2 siblings, 0 replies; 91+ messages in thread
From: Pavel Machek @ 2005-08-16 13:59 UTC (permalink / raw)
  To: David Howells
  Cc: Daniel Phillips, Andrew Morton, linux-kernel, linux-mm, Hugh Dickins

Hi!

> > You also achieved some sort of new low point in the abuse of StudlyCaps
> > there.  Please, let's not get started on mixed case acronyms.
> 
> My patch has been around for quite a while, and no-one else has complained,
> not even you before this point. Plus, you don't seem to be complaining about
> PageSwapCache... nor even PageLocked.

PageFsMisc really *is* ugly and hard to read. PageLocked etc. look
bad, too but ThIs iS rEaLlY WrOnG.

PageMisc would look less ugly, make note in a comment that it is for
filesystems only.

									Pavel
-- 
if you have sharp zaurus hardware you don't need... you know my address
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][PATCH] Rename PageChecked as PageMiscFS
  2005-08-12 12:41         ` David Howells
  2005-08-12 13:28           ` Hugh Dickins
  2005-08-16 13:59           ` Pavel Machek
@ 2005-08-18 14:33           ` David Howells
  2005-08-18 22:27             ` Pavel Machek
  2005-08-19 10:04             ` David Howells
  2 siblings, 2 replies; 91+ messages in thread
From: David Howells @ 2005-08-18 14:33 UTC (permalink / raw)
  To: Pavel Machek
  Cc: David Howells, Daniel Phillips, Andrew Morton, linux-kernel,
	linux-mm, Hugh Dickins

Pavel Machek <pavel@ucw.cz> wrote:

> > My patch has been around for quite a while, and no-one else has
> > complained, not even you before this point. Plus, you don't seem to be
> > complaining about PageSwapCache... nor even PageLocked.
> 
> PageFsMisc really *is* ugly and hard to read. PageLocked etc. look
> bad, too but ThIs iS rEaLlY WrOnG.

And PageMappedToDisk()?

I disagree. For the most part weird capsage is wrong, but this is readable.
Whilst it could make it page_fs_misc() instead, that'd be against the style of
the rest of the file, though maybe you want to go through and change all of
that too.

Maybe you'd prefer bPageFsMisc()? :-)

Actually, all these functions should really be called something like
IsPageXxxx() to note they're asking a question rather than giving a command.

> PageMisc would look less ugly

I disagree again. I don't think PageFsMisc() is particularly ugly or
unreadable; and it makes it a touch more likely that someone reading code that
uses it will notice that it's a miscellaneous flag specifically for filesystem
use (you can't rely on them going and looking in the header file for a
comment).

> , make note in a comment that it is for filesystems only.

There should be a comment as well, I suppose. I'll amend the patch for Andrew.

All this should also be documented in Documentation/ somewhere too, I suppose.

David
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][PATCH] Rename PageChecked as PageMiscFS
  2005-08-18 14:33           ` David Howells
@ 2005-08-18 22:27             ` Pavel Machek
  2005-08-19 10:04             ` David Howells
  1 sibling, 0 replies; 91+ messages in thread
From: Pavel Machek @ 2005-08-18 22:27 UTC (permalink / raw)
  To: David Howells; +Cc: Daniel Phillips, linux-kernel, linux-mm, Hugh Dickins

Hi!

> > PageMisc would look less ugly
> 
> I disagree again. I don't think PageFsMisc() is particularly ugly or
> unreadable; and it makes it a touch more likely that someone reading code that
> uses it will notice that it's a miscellaneous flag specifically for filesystem
> use (you can't rely on them going and looking in the header file for a
> comment).

Well, is it PageFsMisc or PageFSMisc? Subject gets second variant, and
I like it better, too. (That does not mean I like it).

								Pavel
-- 
if you have sharp zaurus hardware you don't need... you know my address
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][PATCH] Rename PageChecked as PageMiscFS
  2005-08-18 14:33           ` David Howells
  2005-08-18 22:27             ` Pavel Machek
@ 2005-08-19 10:04             ` David Howells
  2005-08-19 16:31               ` Daniel Phillips
  2005-08-20 10:45               ` David Howells
  1 sibling, 2 replies; 91+ messages in thread
From: David Howells @ 2005-08-19 10:04 UTC (permalink / raw)
  To: Pavel Machek
  Cc: David Howells, Daniel Phillips, linux-kernel, linux-mm, Hugh Dickins

Pavel Machek <pavel@suse.cz> wrote:

> > I disagree again. I don't think PageFsMisc() is particularly ugly or
> > unreadable; and it makes it a touch more likely that someone reading code
> > that uses it will notice that it's a miscellaneous flag specifically for
> > filesystem use (you can't rely on them going and looking in the header
> > file for a comment).
> 
> Well, is it PageFsMisc or PageFSMisc? Subject gets second variant, and
> I like it better, too. (That does not mean I like it).

The Subject wasn't set by me. Somehow the PageFsMisc variant looks better to
me, but I could just be biased.

David
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][PATCH] Rename PageChecked as PageMiscFS
  2005-08-19 10:04             ` David Howells
@ 2005-08-19 16:31               ` Daniel Phillips
  2005-08-20 10:45               ` David Howells
  1 sibling, 0 replies; 91+ messages in thread
From: Daniel Phillips @ 2005-08-19 16:31 UTC (permalink / raw)
  To: David Howells; +Cc: Pavel Machek, linux-kernel, linux-mm, Hugh Dickins

On Friday 19 August 2005 20:04, David Howells wrote:
> Pavel Machek <pavel@suse.cz> wrote:
> > > I disagree again. I don't think PageFsMisc() is particularly ugly or
> > > unreadable; and it makes it a touch more likely that someone reading
> > > code that uses it will notice that it's a miscellaneous flag
> > > specifically for filesystem use (you can't rely on them going and
> > > looking in the header file for a comment).
> >
> > Well, is it PageFsMisc or PageFSMisc? Subject gets second variant, and
> > I like it better, too. (That does not mean I like it).
>
> The Subject wasn't set by me. Somehow the PageFsMisc variant looks better
> to me, but I could just be biased.

Biased.  Fs is a mixed case acronym, nuff said.

Regards,

Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][PATCH] Rename PageChecked as PageMiscFS
  2005-08-19 10:04             ` David Howells
  2005-08-19 16:31               ` Daniel Phillips
@ 2005-08-20 10:45               ` David Howells
  2005-08-20 20:21                 ` Daniel Phillips
  1 sibling, 1 reply; 91+ messages in thread
From: David Howells @ 2005-08-20 10:45 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: David Howells, Pavel Machek, linux-kernel, linux-mm, Hugh Dickins

Daniel Phillips <phillips@istop.com> wrote:

> Biased.  Fs is a mixed case acronym, nuff said.

But I'm still right:-)

David
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC][PATCH] Rename PageChecked as PageMiscFS
  2005-08-20 10:45               ` David Howells
@ 2005-08-20 20:21                 ` Daniel Phillips
  0 siblings, 0 replies; 91+ messages in thread
From: Daniel Phillips @ 2005-08-20 20:21 UTC (permalink / raw)
  To: David Howells; +Cc: Pavel Machek, linux-kernel, linux-mm, Hugh Dickins

On Saturday 20 August 2005 20:45, David Howells wrote:
> Daniel Phillips <phillips@istop.com> wrote:
> > Biased.  Fs is a mixed case acronym, nuff said.
>
> But I'm still right:-)

Of course you are!  We're only impugning your taste, not your logic ;-)

OK, the questions re your global consistency model are a bazillion times more 
significant.  I have not forgotten about that, please stay tuned.

Regards,

Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

end of thread, other threads:[~2005-08-20 20:21 UTC | newest]

Thread overview: 91+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-08-07  3:28 [RFC][patch 0/2] mm: remove PageReserved Nick Piggin
2005-08-07  3:29 ` [patch 1/2] mm: remap ZERO_PAGE mappings Nick Piggin
2005-08-07  3:30   ` [patch 2/2] mm: core remove PageReserved Nick Piggin
2005-08-08 21:09 ` [RFC][patch 0/2] mm: " Daniel Phillips
2005-08-08 21:24   ` Daniel Phillips
2005-08-08 21:54     ` Andrew Morton
2005-08-09 23:23       ` [RFC][PATCH] Rename PageChecked as PageMiscFS Daniel Phillips
2005-08-10  7:48         ` Hugh Dickins
2005-08-10  8:06           ` Daniel Phillips
2005-08-10 22:12       ` Daniel Phillips
2005-08-10 22:23         ` Daniel Phillips
2005-08-10 22:34           ` Trond Myklebust
2005-08-10 22:57             ` Daniel Phillips
2005-08-10 23:23               ` Trond Myklebust
2005-08-11  9:42               ` David Howells
2005-08-10 23:42           ` Adrian Bunk
2005-08-11  9:46           ` David Howells
2005-08-12  2:34             ` Daniel Phillips
2005-08-12 12:32             ` David Howells
2005-08-11  9:31         ` David Howells
2005-08-11  9:26       ` David Howells
2005-08-12  3:29         ` Daniel Phillips
2005-08-12 12:41         ` David Howells
2005-08-12 13:28           ` Hugh Dickins
2005-08-16 13:59           ` Pavel Machek
2005-08-18 14:33           ` David Howells
2005-08-18 22:27             ` Pavel Machek
2005-08-19 10:04             ` David Howells
2005-08-19 16:31               ` Daniel Phillips
2005-08-20 10:45               ` David Howells
2005-08-20 20:21                 ` Daniel Phillips
2005-08-10 13:13     ` [RFC][patch 0/2] mm: remove PageReserved David Howells
2005-08-10 13:34       ` Daniel Phillips
2005-08-10 14:27       ` David Howells
2005-08-10 23:19         ` Daniel Phillips
2005-08-11 10:49         ` David Howells
2005-08-12 19:34           ` Daniel Phillips
2005-08-15 13:15           ` David Howells
2005-08-16  1:53             ` Daniel Phillips
2005-08-16 10:28             ` David Howells
2005-08-09  0:15   ` Nick Piggin
2005-08-09  8:51     ` Benjamin Herrenschmidt
2005-08-09  9:49       ` Nick Piggin
2005-08-09 19:19         ` Daniel Phillips
2005-08-09 19:22         ` Daniel Phillips
2005-08-10 21:50           ` Pavel Machek
2005-08-10 21:56             ` Martin J. Bligh
2005-08-11 10:36               ` Rafael J. Wysocki
2005-08-12 19:56                 ` Daniel Phillips
2005-08-12 22:20                   ` Rafael J. Wysocki
2005-08-12 23:04                     ` Daniel Phillips
2005-08-13  7:06                       ` Rafael J. Wysocki
2005-08-11 10:26             ` Rafael J. Wysocki
2005-08-09 11:25       ` Hugh Dickins
2005-08-09 14:31         ` Benjamin Herrenschmidt
2005-08-09 14:50           ` Hugh Dickins
2005-08-09 14:49             ` Benjamin Herrenschmidt
2005-08-09 15:36               ` Hugh Dickins
2005-08-09 21:27                 ` Daniel Phillips
2005-08-09 19:14     ` Daniel Phillips
2005-08-09 20:17       ` Hugh Dickins
2005-08-09 20:52         ` Daniel Phillips
2005-08-09  4:39   ` Nigel Cunningham
2005-08-09  4:59     ` Nick Piggin
2005-08-09  5:11       ` Nigel Cunningham
2005-08-09  5:20         ` Nick Piggin
2005-08-09  5:30           ` Nigel Cunningham
2005-08-09  7:08       ` Russell King
2005-08-09  8:38         ` Arjan van de Ven
2005-08-09  9:31           ` Nick Piggin
2005-08-09  9:49             ` Arjan van de Ven
2005-08-09  9:57               ` Nick Piggin
2005-08-09 10:24             ` Rafael J. Wysocki
2005-08-09  8:53         ` Benjamin Herrenschmidt
2005-08-09  9:15         ` Hugh Dickins
2005-08-09 10:27           ` Nick Piggin
2005-08-09 11:15             ` Hugh Dickins
2005-08-09 13:15               ` Nick Piggin
2005-08-09 13:26                 ` Arjan van de Ven
2005-08-09 14:28               ` Benjamin Herrenschmidt
2005-08-09 14:47                 ` Hugh Dickins
2005-08-09 19:49           ` Roman Zippel
2005-08-09  9:29         ` Nick Piggin
2005-08-09 19:40           ` Russell King
2005-08-09 14:38         ` Martin J. Bligh
2005-08-09 19:41           ` Russell King
2005-08-09 20:51             ` Linus Torvalds
2005-08-09 21:16             ` Martin J. Bligh
2005-08-09 21:51               ` Martin J. Bligh
2005-08-10  9:27             ` Benjamin Herrenschmidt
2005-08-11  9:09               ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox