* [patch][rfc] 0/5: remove PageReserved @ 2005-06-23 7:05 Nick Piggin 2005-06-23 7:06 ` [patch][rfc] 1/5: comment for mm/rmap.c Nick Piggin 0 siblings, 1 reply; 19+ messages in thread From: Nick Piggin @ 2005-06-23 7:05 UTC (permalink / raw) To: linux-kernel, Linux Memory Management; +Cc: Hugh Dickins, Badari Pulavarty Hi, The following set of patches removes PageReserved from core kernel code, and clears the way for the page flag to be completely removed when it is removed from all arch/ code. Drivers are mostly fairly trivial, but will need auditing. Arch maintainers and driver writers will need to help get that done. Actually, only patches 4 and 5 are really required - the first 3 are very minor things I noticed along the way (but I'm putting them in this series because they have clashes). Not quite ready for merging yet, although probably after the next round of comments it will be. It boots and runs on i386, ppc64, ia64 and not tested elsewhere. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* [patch][rfc] 1/5: comment for mm/rmap.c 2005-06-23 7:05 [patch][rfc] 0/5: remove PageReserved Nick Piggin @ 2005-06-23 7:06 ` Nick Piggin 2005-06-23 7:06 ` [patch][rfc] 2/5: micro optimisation " Nick Piggin 0 siblings, 1 reply; 19+ messages in thread From: Nick Piggin @ 2005-06-23 7:06 UTC (permalink / raw) To: linux-kernel, Linux Memory Management; +Cc: Hugh Dickins, Badari Pulavarty [-- Attachment #1: Type: text/plain, Size: 4 bytes --] 1/5 [-- Attachment #2: mm-comment-rmap.patch --] [-- Type: text/plain, Size: 709 bytes --] Just be clear that VM_RESERVED pages here are a bug, and the test is not there because they are expected. Signed-off-by: Nick Piggin <npiggin@suse.de> Index: linux-2.6/mm/rmap.c =================================================================== --- linux-2.6.orig/mm/rmap.c +++ linux-2.6/mm/rmap.c @@ -532,6 +532,8 @@ static int try_to_unmap_one(struct page * If the page is mlock()d, we cannot swap it out. * If it's recently referenced (perhaps page_referenced * skipped over this mm) then we should reactivate it. + * + * Pages belonging to VM_RESERVED regions should not happen here. */ if ((vma->vm_flags & (VM_LOCKED|VM_RESERVED)) || ptep_clear_flush_young(vma, address, pte)) { ^ permalink raw reply [flat|nested] 19+ messages in thread
* [patch][rfc] 2/5: micro optimisation for mm/rmap.c 2005-06-23 7:06 ` [patch][rfc] 1/5: comment for mm/rmap.c Nick Piggin @ 2005-06-23 7:06 ` Nick Piggin 2005-06-23 7:07 ` [patch][rfc] 3/5: remove atomic bitop when freeing page Nick Piggin 2005-06-23 7:26 ` [patch][rfc] 2/5: micro optimisation for mm/rmap.c William Lee Irwin III 0 siblings, 2 replies; 19+ messages in thread From: Nick Piggin @ 2005-06-23 7:06 UTC (permalink / raw) To: linux-kernel, Linux Memory Management; +Cc: Hugh Dickins, Badari Pulavarty [-- Attachment #1: Type: text/plain, Size: 4 bytes --] 2/5 [-- Attachment #2: mm-microopt-rmap.patch --] [-- Type: text/plain, Size: 1394 bytes --] Microoptimise page_add_anon_rmap. Although these expressions are used only in the taken branch of the if() statement, the compiler can't reorder them inside because atomic_inc_and_test is a barrier. Signed-off-by: Nick Piggin <npiggin@suse.de> Index: linux-2.6/mm/rmap.c =================================================================== --- linux-2.6.orig/mm/rmap.c +++ linux-2.6/mm/rmap.c @@ -442,22 +442,23 @@ int page_referenced(struct page *page, i void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, unsigned long address) { - struct anon_vma *anon_vma = vma->anon_vma; - pgoff_t index; - BUG_ON(PageReserved(page)); - BUG_ON(!anon_vma); inc_mm_counter(vma->vm_mm, anon_rss); - anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; - index = (address - vma->vm_start) >> PAGE_SHIFT; - index += vma->vm_pgoff; - index >>= PAGE_CACHE_SHIFT - PAGE_SHIFT; - if (atomic_inc_and_test(&page->_mapcount)) { - page->index = index; + struct anon_vma *anon_vma = vma->anon_vma; + pgoff_t index; + + BUG_ON(!anon_vma); + anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; page->mapping = (struct address_space *) anon_vma; + + index = (address - vma->vm_start) >> PAGE_SHIFT; + index += vma->vm_pgoff; + index >>= PAGE_CACHE_SHIFT - PAGE_SHIFT; + page->index = index; + inc_page_state(nr_mapped); } /* else checking page index and mapping is racy */ ^ permalink raw reply [flat|nested] 19+ messages in thread
* [patch][rfc] 3/5: remove atomic bitop when freeing page 2005-06-23 7:06 ` [patch][rfc] 2/5: micro optimisation " Nick Piggin @ 2005-06-23 7:07 ` Nick Piggin 2005-06-23 7:07 ` [patch][rfc] 4/5: remap ZERO_PAGE mappings Nick Piggin 2005-06-23 7:26 ` [patch][rfc] 2/5: micro optimisation for mm/rmap.c William Lee Irwin III 1 sibling, 1 reply; 19+ messages in thread From: Nick Piggin @ 2005-06-23 7:07 UTC (permalink / raw) To: linux-kernel, Linux Memory Management; +Cc: Hugh Dickins, Badari Pulavarty [-- Attachment #1: Type: text/plain, Size: 4 bytes --] 3/5 [-- Attachment #2: mm-remove-atomic-bitop.patch --] [-- Type: text/plain, Size: 1223 bytes --] This bitop does not need to be atomic because it is performed when there will be no references to the page (ie. the page is being freed). Signed-off-by: Nick Piggin <npiggin@suse.de> Index: linux-2.6/mm/page_alloc.c =================================================================== --- linux-2.6.orig/mm/page_alloc.c +++ linux-2.6/mm/page_alloc.c @@ -329,7 +329,7 @@ static inline void free_pages_check(cons 1 << PG_writeback ))) bad_page(function, page); if (PageDirty(page)) - ClearPageDirty(page); + __ClearPageDirty(page); } /* Index: linux-2.6/include/linux/page-flags.h =================================================================== --- linux-2.6.orig/include/linux/page-flags.h +++ linux-2.6/include/linux/page-flags.h @@ -194,6 +194,7 @@ extern void __mod_page_state(unsigned lo #define SetPageDirty(page) set_bit(PG_dirty, &(page)->flags) #define TestSetPageDirty(page) test_and_set_bit(PG_dirty, &(page)->flags) #define ClearPageDirty(page) clear_bit(PG_dirty, &(page)->flags) +#define __ClearPageDirty(page) __clear_bit(PG_dirty, &(page)->flags) #define TestClearPageDirty(page) test_and_clear_bit(PG_dirty, &(page)->flags) #define SetPageLRU(page) set_bit(PG_lru, &(page)->flags) ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [patch][rfc] 4/5: remap ZERO_PAGE mappings 2005-06-23 7:07 ` [patch][rfc] 3/5: remove atomic bitop when freeing page Nick Piggin @ 2005-06-23 7:07 ` Nick Piggin 2005-06-23 7:08 ` [patch][rfc] 5/5: core remove PageReserved Nick Piggin 0 siblings, 1 reply; 19+ messages in thread From: Nick Piggin @ 2005-06-23 7:07 UTC (permalink / raw) To: linux-kernel, Linux Memory Management; +Cc: Hugh Dickins, Badari Pulavarty [-- Attachment #1: Type: text/plain, Size: 4 bytes --] 4/5 [-- Attachment #2: mm-remap-ZERO_PAGE-mappings.patch --] [-- Type: text/plain, Size: 1081 bytes --] Remap ZERO_PAGE ptes when remapping memory. This is currently just an optimisation for MIPS, which is the only architecture with multiple zero pages - it now retains the mapping it needs for good cache performance, and as well do_wp_page is now able to always correctly detect and optimise zero page COW faults. In future, this becomes required in order to always be able to detect whether a pte points to a ZERO_PAGE using only the pte, vaddr pair. Signed-off-by: Nick Piggin <npiggin@suse.de> Index: linux-2.6/mm/mremap.c =================================================================== --- linux-2.6.orig/mm/mremap.c +++ linux-2.6/mm/mremap.c @@ -141,6 +141,10 @@ move_one_page(struct vm_area_struct *vma if (dst) { pte_t pte; pte = ptep_clear_flush(vma, old_addr, src); + /* ZERO_PAGE can be dependant on virtual addr */ + if (pfn_valid(pte_pfn(pte)) && + pte_page(pte) == ZERO_PAGE(old_addr)) + pte = pte_wrprotect(mk_pte(ZERO_PAGE(new_addr), new_vma->vm_page_prot)); set_pte_at(mm, new_addr, dst, pte); } else error = -ENOMEM; ^ permalink raw reply [flat|nested] 19+ messages in thread
* [patch][rfc] 5/5: core remove PageReserved 2005-06-23 7:07 ` [patch][rfc] 4/5: remap ZERO_PAGE mappings Nick Piggin @ 2005-06-23 7:08 ` Nick Piggin 2005-06-23 9:51 ` William Lee Irwin III 0 siblings, 1 reply; 19+ messages in thread From: Nick Piggin @ 2005-06-23 7:08 UTC (permalink / raw) To: linux-kernel, Linux Memory Management; +Cc: Hugh Dickins, Badari Pulavarty [-- Attachment #1: Type: text/plain, Size: 4 bytes --] 5/5 [-- Attachment #2: mm-core-remove-PageReserved.patch --] [-- Type: text/plain, Size: 24220 bytes --] Remove PageReserved() calls from core code by tightening VM_RESERVED handling in mm/ to cover PageReserved functionality. PageReserved special casing is removed from get_page and put_page. All setting and clearning of PageReserved is retained, and it is now flagged in the page_alloc checks to help ensure we don't introduce any refcount based freeing of Reserved pages. MAP_PRIVATE, PROT_WRITE of VM_RESERVED regions is tentatively being deprecated. We never completely handled it correctly anyway, and is difficult to handle nicely - difficult but not impossible, it could be reintroduced in future if required (Hugh has a proof of concept). Once PageReserved() calls are removed from all arch/ and driver code, the Set and Clear calls, and the PG_reserved bit can be trivially removed. Many thanks to Hugh Dickins for input. Signed-off-by: Nick Piggin <npiggin@suse.de> Index: linux-2.6/include/linux/mm.h =================================================================== --- linux-2.6.orig/include/linux/mm.h +++ linux-2.6/include/linux/mm.h @@ -156,7 +156,8 @@ extern unsigned int kobjsize(const void #define VM_DONTCOPY 0x00020000 /* Do not copy this vma on fork */ #define VM_DONTEXPAND 0x00040000 /* Cannot expand with mremap() */ -#define VM_RESERVED 0x00080000 /* Don't unmap it from swap_out */ +#define VM_RESERVED 0x00080000 /* Pages in region aren't managed with regular pagecache routines (replaces PageReserved), don't unmap it */ + #define VM_ACCOUNT 0x00100000 /* Is a VM accounted object */ #define VM_HUGETLB 0x00400000 /* Huge TLB Page VM */ #define VM_NONLINEAR 0x00800000 /* Is non-linear (remap_file_pages) */ @@ -337,7 +338,7 @@ static inline void get_page(struct page static inline void put_page(struct page *page) { - if (!PageReserved(page) && put_page_testzero(page)) + if (put_page_testzero(page)) __page_cache_release(page); } @@ -618,6 +619,7 @@ void install_arg_page(struct vm_area_str int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, int len, int write, int force, struct page **pages, struct vm_area_struct **vmas); +void print_invalid_pfn(const char *, pte_t, unsigned long, unsigned long); int __set_page_dirty_buffers(struct page *page); int __set_page_dirty_nobuffers(struct page *page); Index: linux-2.6/mm/madvise.c =================================================================== --- linux-2.6.orig/mm/madvise.c +++ linux-2.6/mm/madvise.c @@ -122,7 +122,7 @@ static long madvise_dontneed(struct vm_a unsigned long start, unsigned long end) { *prev = vma; - if ((vma->vm_flags & VM_LOCKED) || is_vm_hugetlb_page(vma)) + if ((vma->vm_flags & (VM_LOCKED|VM_RESERVED)) || is_vm_hugetlb_page(vma)) return -EINVAL; if (unlikely(vma->vm_flags & VM_NONLINEAR)) { Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c +++ linux-2.6/mm/memory.c @@ -333,6 +333,21 @@ out: } /* + * This function is called to print an error when a pte in a + * !VM_RESERVED region is found pointing to an invalid pfn (which + * is an error. + * + * The calling function must still handle the error. + */ +void print_invalid_pfn(const char *errfunc, pte_t pte, + unsigned long vm_flags, unsigned long vaddr) +{ + printk(KERN_ERR "%s: pte does not point to valid memory. " + "process = %s, pte = %08lx, vm_flags = %lx, vaddr = %lx\n", + errfunc, current->comm, (long)pte_val(pte), vm_flags, vaddr); +} + +/* * copy one vm_area from one task to the other. Assumes the page tables * already present in the new task to be cleared in the whole range * covered by this vma. @@ -361,25 +376,29 @@ copy_one_pte(struct mm_struct *dst_mm, s spin_unlock(&mmlist_lock); } } - set_pte_at(dst_mm, addr, dst_pte, pte); - return; + goto out_set_pte; } + /* If the region is VM_RESERVED, the mapping is not + * mapped via rmap - duplicate the pte as is. + */ + if (vm_flags & VM_RESERVED) + goto out_set_pte; + + /* If the pte points outside of valid memory but + * the region is not VM_RESERVED, we have a problem. + */ pfn = pte_pfn(pte); - /* the pte points outside of valid memory, the - * mapping is assumed to be good, meaningful - * and not mapped via rmap - duplicate the - * mapping as is. - */ - page = NULL; - if (pfn_valid(pfn)) - page = pfn_to_page(pfn); - - if (!page || PageReserved(page)) { - set_pte_at(dst_mm, addr, dst_pte, pte); - return; + if (unlikely(!pfn_valid(pfn))) { + print_invalid_pfn("copy_one_pte", pte, vm_flags, addr); + goto out_set_pte; /* try to do something sane */ } + page = pfn_to_page(pfn); + /* Mappings to zero pages aren't covered by rmap either. */ + if (page == ZERO_PAGE(addr)) + goto out_set_pte; + /* * If it's a COW mapping, write protect it both * in the parent and the child @@ -400,8 +419,9 @@ copy_one_pte(struct mm_struct *dst_mm, s inc_mm_counter(dst_mm, rss); if (PageAnon(page)) inc_mm_counter(dst_mm, anon_rss); - set_pte_at(dst_mm, addr, dst_pte, pte); page_dup_rmap(page); +out_set_pte: + set_pte_at(dst_mm, addr, dst_pte, pte); } static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, @@ -514,7 +534,8 @@ int copy_page_range(struct mm_struct *ds return 0; } -static void zap_pte_range(struct mmu_gather *tlb, pmd_t *pmd, +static void zap_pte_range(struct mmu_gather *tlb, + struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, struct zap_details *details) { @@ -528,10 +549,15 @@ static void zap_pte_range(struct mmu_gat if (pte_present(ptent)) { struct page *page = NULL; unsigned long pfn = pte_pfn(ptent); - if (pfn_valid(pfn)) { - page = pfn_to_page(pfn); - if (PageReserved(page)) - page = NULL; + if (!(vma->vm_flags & VM_RESERVED)) { + if (unlikely(!pfn_valid(pfn))) { + print_invalid_pfn("zap_pte_range", + ptent, vma->vm_flags, addr); + } else { + page = pfn_to_page(pfn); + if (page == ZERO_PAGE(addr)) + page = NULL; + } } if (unlikely(details) && page) { /* @@ -584,7 +610,8 @@ static void zap_pte_range(struct mmu_gat pte_unmap(pte - 1); } -static inline void zap_pmd_range(struct mmu_gather *tlb, pud_t *pud, +static inline void zap_pmd_range(struct mmu_gather *tlb, + struct vm_area_struct *vma, pud_t *pud, unsigned long addr, unsigned long end, struct zap_details *details) { @@ -596,11 +623,12 @@ static inline void zap_pmd_range(struct next = pmd_addr_end(addr, end); if (pmd_none_or_clear_bad(pmd)) continue; - zap_pte_range(tlb, pmd, addr, next, details); + zap_pte_range(tlb, vma, pmd, addr, next, details); } while (pmd++, addr = next, addr != end); } -static inline void zap_pud_range(struct mmu_gather *tlb, pgd_t *pgd, +static inline void zap_pud_range(struct mmu_gather *tlb, + struct vm_area_struct *vma, pgd_t *pgd, unsigned long addr, unsigned long end, struct zap_details *details) { @@ -612,7 +640,7 @@ static inline void zap_pud_range(struct next = pud_addr_end(addr, end); if (pud_none_or_clear_bad(pud)) continue; - zap_pmd_range(tlb, pud, addr, next, details); + zap_pmd_range(tlb, vma, pud, addr, next, details); } while (pud++, addr = next, addr != end); } @@ -633,7 +661,7 @@ static void unmap_page_range(struct mmu_ next = pgd_addr_end(addr, end); if (pgd_none_or_clear_bad(pgd)) continue; - zap_pud_range(tlb, pgd, addr, next, details); + zap_pud_range(tlb, vma, pgd, addr, next, details); } while (pgd++, addr = next, addr != end); tlb_end_vma(tlb, vma); } @@ -980,8 +1008,7 @@ int get_user_pages(struct task_struct *t if (pages) { pages[i] = page; flush_dcache_page(page); - if (!PageReserved(page)) - page_cache_get(page); + page_cache_get(page); } if (vmas) vmas[i] = vma; @@ -1085,8 +1112,7 @@ static int remap_pte_range(struct mm_str return -ENOMEM; do { BUG_ON(!pte_none(*pte)); - if (!pfn_valid(pfn) || PageReserved(pfn_to_page(pfn))) - set_pte_at(mm, addr, pte, pfn_pte(pfn, prot)); + set_pte_at(mm, addr, pte, pfn_pte(pfn, prot)); pfn++; } while (pte++, addr += PAGE_SIZE, addr != end); pte_unmap(pte - 1); @@ -1225,6 +1251,8 @@ static int do_wp_page(struct mm_struct * unsigned long pfn = pte_pfn(pte); pte_t entry; + BUG_ON(vma->vm_flags & VM_RESERVED); + if (unlikely(!pfn_valid(pfn))) { /* * This should really halt the system so it can be debugged or @@ -1232,9 +1260,8 @@ static int do_wp_page(struct mm_struct * * data, but for the moment just pretend this is OOM. */ pte_unmap(page_table); - printk(KERN_ERR "do_wp_page: bogus page at address %08lx\n", - address); spin_unlock(&mm->page_table_lock); + print_invalid_pfn("do_wp_page", pte, vma->vm_flags, address); return VM_FAULT_OOM; } old_page = pfn_to_page(pfn); @@ -1259,13 +1286,16 @@ static int do_wp_page(struct mm_struct * /* * Ok, we need to copy. Oh, well.. */ - if (!PageReserved(old_page)) + if (old_page == ZERO_PAGE(address)) + old_page = NULL; + else page_cache_get(old_page); + spin_unlock(&mm->page_table_lock); if (unlikely(anon_vma_prepare(vma))) goto no_new_page; - if (old_page == ZERO_PAGE(address)) { + if (old_page == NULL) { new_page = alloc_zeroed_user_highpage(vma, address); if (!new_page) goto no_new_page; @@ -1281,12 +1311,13 @@ static int do_wp_page(struct mm_struct * spin_lock(&mm->page_table_lock); page_table = pte_offset_map(pmd, address); if (likely(pte_same(*page_table, pte))) { - if (PageAnon(old_page)) - dec_mm_counter(mm, anon_rss); - if (PageReserved(old_page)) + if (old_page == NULL) inc_mm_counter(mm, rss); - else + else { page_remove_rmap(old_page); + if (PageAnon(old_page)) + dec_mm_counter(mm, anon_rss); + } flush_cache_page(vma, address, pfn); break_cow(vma, new_page, address, page_table); lru_cache_add_active(new_page); @@ -1296,13 +1327,16 @@ static int do_wp_page(struct mm_struct * new_page = old_page; } pte_unmap(page_table); - page_cache_release(new_page); - page_cache_release(old_page); spin_unlock(&mm->page_table_lock); + if (old_page) { + page_cache_release(new_page); + page_cache_release(old_page); + } return VM_FAULT_MINOR; no_new_page: - page_cache_release(old_page); + if (old_page) + page_cache_release(old_page); return VM_FAULT_OOM; } @@ -1739,7 +1773,7 @@ do_anonymous_page(struct mm_struct *mm, struct page * page = ZERO_PAGE(addr); /* Read-only mapping of ZERO_PAGE. */ - entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot)); + entry = pte_wrprotect(mk_pte(page, vma->vm_page_prot)); /* ..except if it's a write access */ if (write_access) { @@ -1878,9 +1912,6 @@ retry: */ /* Only go through if we didn't race with anybody else... */ if (pte_none(*page_table)) { - if (!PageReserved(new_page)) - inc_mm_counter(mm, rss); - flush_icache_page(vma, new_page); entry = mk_pte(new_page, vma->vm_page_prot); if (write_access) @@ -1889,8 +1920,10 @@ retry: if (anon) { lru_cache_add_active(new_page); page_add_anon_rmap(new_page, vma, address); - } else + } else if (!(vma->vm_flags & VM_RESERVED)) { page_add_file_rmap(new_page); + inc_mm_counter(mm, rss); + } pte_unmap(page_table); } else { /* One of our sibling threads was faster, back out. */ Index: linux-2.6/mm/page_alloc.c =================================================================== --- linux-2.6.orig/mm/page_alloc.c +++ linux-2.6/mm/page_alloc.c @@ -113,7 +113,8 @@ static void bad_page(const char *functio 1 << PG_reclaim | 1 << PG_slab | 1 << PG_swapcache | - 1 << PG_writeback); + 1 << PG_writeback | + 1 << PG_reserved ); set_page_count(page, 0); reset_page_mapcount(page); page->mapping = NULL; @@ -243,7 +244,6 @@ static inline int page_is_buddy(struct p { if (PagePrivate(page) && (page_order(page) == order) && - !PageReserved(page) && page_count(page) == 0) return 1; return 0; @@ -326,7 +326,8 @@ static inline void free_pages_check(cons 1 << PG_reclaim | 1 << PG_slab | 1 << PG_swapcache | - 1 << PG_writeback ))) + 1 << PG_writeback | + 1 << PG_reserved ))) bad_page(function, page); if (PageDirty(page)) __ClearPageDirty(page); @@ -454,7 +455,8 @@ static void prep_new_page(struct page *p 1 << PG_reclaim | 1 << PG_slab | 1 << PG_swapcache | - 1 << PG_writeback ))) + 1 << PG_writeback | + 1 << PG_reserved ))) bad_page(__FUNCTION__, page); page->flags &= ~(1 << PG_uptodate | 1 << PG_error | @@ -1017,7 +1019,7 @@ void __pagevec_free(struct pagevec *pvec fastcall void __free_pages(struct page *page, unsigned int order) { - if (!PageReserved(page) && put_page_testzero(page)) { + if (put_page_testzero(page)) { if (order == 0) free_hot_page(page); else @@ -1654,7 +1656,7 @@ void __init memmap_init_zone(unsigned lo for (page = start; page < (start + size); page++) { set_page_zone(page, NODEZONE(nid, zone)); - set_page_count(page, 0); + set_page_count(page, 1); reset_page_mapcount(page); SetPageReserved(page); INIT_LIST_HEAD(&page->lru); Index: linux-2.6/mm/swap.c =================================================================== --- linux-2.6.orig/mm/swap.c +++ linux-2.6/mm/swap.c @@ -48,7 +48,7 @@ void put_page(struct page *page) } return; } - if (!PageReserved(page) && put_page_testzero(page)) + if (put_page_testzero(page)) __page_cache_release(page); } EXPORT_SYMBOL(put_page); @@ -215,7 +215,7 @@ void release_pages(struct page **pages, struct page *page = pages[i]; struct zone *pagezone; - if (PageReserved(page) || !put_page_testzero(page)) + if (!put_page_testzero(page)) continue; pagezone = page_zone(page); Index: linux-2.6/mm/fremap.c =================================================================== --- linux-2.6.orig/mm/fremap.c +++ linux-2.6/mm/fremap.c @@ -29,18 +29,21 @@ static inline void zap_pte(struct mm_str return; if (pte_present(pte)) { unsigned long pfn = pte_pfn(pte); + struct page *page; flush_cache_page(vma, addr, pfn); pte = ptep_clear_flush(vma, addr, ptep); - if (pfn_valid(pfn)) { - struct page *page = pfn_to_page(pfn); - if (!PageReserved(page)) { - if (pte_dirty(pte)) - set_page_dirty(page); - page_remove_rmap(page); - page_cache_release(page); - dec_mm_counter(mm, rss); - } + if (unlikely(!pfn_valid(pfn))) { + print_invalid_pfn("zap_pte", pte, vma->vm_flags, addr); + return; + } + page = pfn_to_page(pfn); + if (page != ZERO_PAGE(addr)) { + if (pte_dirty(pte)) + set_page_dirty(page); + page_remove_rmap(page); + dec_mm_counter(mm, rss); + page_cache_release(page); } } else { if (!pte_file(pte)) @@ -65,6 +68,8 @@ int install_page(struct mm_struct *mm, s pgd_t *pgd; pte_t pte_val; + BUG_ON(vma->vm_flags & VM_RESERVED); + pgd = pgd_offset(mm, addr); spin_lock(&mm->page_table_lock); @@ -122,6 +127,8 @@ int install_file_pte(struct mm_struct *m pgd_t *pgd; pte_t pte_val; + BUG_ON(vma->vm_flags & VM_RESERVED); + pgd = pgd_offset(mm, addr); spin_lock(&mm->page_table_lock); Index: linux-2.6/mm/msync.c =================================================================== --- linux-2.6.orig/mm/msync.c +++ linux-2.6/mm/msync.c @@ -37,11 +37,12 @@ static void sync_pte_range(struct vm_are if (!pte_maybe_dirty(*pte)) continue; pfn = pte_pfn(*pte); - if (!pfn_valid(pfn)) + if (unlikely(!pfn_valid(pfn))) { + print_invalid_pfn("sync_pte_range", *pte, + vma->vm_flags, addr); continue; + } page = pfn_to_page(pfn); - if (PageReserved(page)) - continue; if (ptep_clear_flush_dirty(vma, addr, pte) || page_test_and_clear_dirty(page)) @@ -149,6 +150,9 @@ static int msync_interval(struct vm_area if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED)) return -EBUSY; + if (vma->vm_flags & VM_RESERVED) + return -EINVAL; + if (file && (vma->vm_flags & VM_SHARED)) { filemap_sync(vma, addr, end); Index: linux-2.6/mm/rmap.c =================================================================== --- linux-2.6.orig/mm/rmap.c +++ linux-2.6/mm/rmap.c @@ -442,8 +442,6 @@ int page_referenced(struct page *page, i void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, unsigned long address) { - BUG_ON(PageReserved(page)); - inc_mm_counter(vma->vm_mm, anon_rss); if (atomic_inc_and_test(&page->_mapcount)) { @@ -473,8 +471,7 @@ void page_add_anon_rmap(struct page *pag void page_add_file_rmap(struct page *page) { BUG_ON(PageAnon(page)); - if (!pfn_valid(page_to_pfn(page)) || PageReserved(page)) - return; + BUG_ON(!pfn_valid(page_to_pfn(page))); if (atomic_inc_and_test(&page->_mapcount)) inc_page_state(nr_mapped); @@ -488,8 +485,6 @@ void page_add_file_rmap(struct page *pag */ void page_remove_rmap(struct page *page) { - BUG_ON(PageReserved(page)); - if (atomic_add_negative(-1, &page->_mapcount)) { BUG_ON(page_mapcount(page) < 0); /* @@ -647,13 +642,14 @@ static void try_to_unmap_cluster(unsigne continue; pfn = pte_pfn(*pte); - if (!pfn_valid(pfn)) + if (unlikely(!pfn_valid(pfn))) { + print_invalid_pfn("try_to_unmap_cluster", *pte, + vma->vm_flags, address); continue; + } page = pfn_to_page(pfn); BUG_ON(PageAnon(page)); - if (PageReserved(page)) - continue; if (ptep_clear_flush_young(vma, address, pte)) continue; @@ -816,7 +812,6 @@ int try_to_unmap(struct page *page) { int ret; - BUG_ON(PageReserved(page)); BUG_ON(!PageLocked(page)); if (PageAnon(page)) Index: linux-2.6/drivers/scsi/sg.c =================================================================== --- linux-2.6.orig/drivers/scsi/sg.c +++ linux-2.6/drivers/scsi/sg.c @@ -1887,9 +1887,10 @@ st_unmap_user_pages(struct scatterlist * int i; for (i=0; i < nr_pages; i++) { - if (dirtied && !PageReserved(sgl[i].page)) + if (dirtied) SetPageDirty(sgl[i].page); /* unlock_page(sgl[i].page); */ + /* FIXME: XXX don't dirty/unmap VM_RESERVED regions? */ /* FIXME: cache flush missing for rw==READ * FIXME: call the correct reference counting function */ Index: linux-2.6/drivers/scsi/st.c =================================================================== --- linux-2.6.orig/drivers/scsi/st.c +++ linux-2.6/drivers/scsi/st.c @@ -4435,8 +4435,9 @@ static int sgl_unmap_user_pages(struct s int i; for (i=0; i < nr_pages; i++) { - if (dirtied && !PageReserved(sgl[i].page)) + if (dirtied) SetPageDirty(sgl[i].page); + /* FIXME: XXX don't dirty/unmap VM_RESERVED regions? */ /* FIXME: cache flush missing for rw==READ * FIXME: call the correct reference counting function */ Index: linux-2.6/sound/core/pcm_native.c =================================================================== --- linux-2.6.orig/sound/core/pcm_native.c +++ linux-2.6/sound/core/pcm_native.c @@ -2942,8 +2942,7 @@ static struct page * snd_pcm_mmap_status return NOPAGE_OOM; runtime = substream->runtime; page = virt_to_page(runtime->status); - if (!PageReserved(page)) - get_page(page); + get_page(page); if (type) *type = VM_FAULT_MINOR; return page; @@ -2985,8 +2984,7 @@ static struct page * snd_pcm_mmap_contro return NOPAGE_OOM; runtime = substream->runtime; page = virt_to_page(runtime->control); - if (!PageReserved(page)) - get_page(page); + get_page(page); if (type) *type = VM_FAULT_MINOR; return page; @@ -3059,8 +3057,7 @@ static struct page *snd_pcm_mmap_data_no vaddr = runtime->dma_area + offset; page = virt_to_page(vaddr); } - if (!PageReserved(page)) - get_page(page); + get_page(page); if (type) *type = VM_FAULT_MINOR; return page; Index: linux-2.6/mm/mmap.c =================================================================== --- linux-2.6.orig/mm/mmap.c +++ linux-2.6/mm/mmap.c @@ -1073,6 +1073,17 @@ munmap_back: error = file->f_op->mmap(file, vma); if (error) goto unmap_and_free_vma; + if ((vma->vm_flags & (VM_SHARED | VM_WRITE | VM_RESERVED)) + == (VM_WRITE | VM_RESERVED)) { + printk(KERN_WARNING "program %s is using MAP_PRIVATE, " + "PROT_WRITE mmap of VM_RESERVED memory, which " + "is deprecated. Please report this to " + "linux-kernel@vger.kernel.org\n",current->comm); + if (vma->vm_ops && vma->vm_ops->close) + vma->vm_ops->close(vma); + error = -EACCES; + goto unmap_and_free_vma; + } } else if (vm_flags & VM_SHARED) { error = shmem_zero_setup(vma); if (error) Index: linux-2.6/mm/mprotect.c =================================================================== --- linux-2.6.orig/mm/mprotect.c +++ linux-2.6/mm/mprotect.c @@ -131,6 +131,14 @@ mprotect_fixup(struct vm_area_struct *vm return -ENOMEM; newflags |= VM_ACCOUNT; } + if (oldflags & VM_RESERVED) { + BUG_ON(oldflags & VM_WRITE); + printk(KERN_WARNING "program %s is using MAP_PRIVATE, " + "PROT_WRITE mprotect of VM_RESERVED memory, " + "which is deprecated. Please report this to " + "linux-kernel@vger.kernel.org\n",current->comm); + return -EACCES; + } } newprot = protection_map[newflags & 0xf]; Index: linux-2.6/mm/bootmem.c =================================================================== --- linux-2.6.orig/mm/bootmem.c +++ linux-2.6/mm/bootmem.c @@ -286,6 +286,7 @@ static unsigned long __init free_all_boo if (j + 16 < BITS_PER_LONG) prefetchw(page + j + 16); __ClearPageReserved(page + j); + set_page_count(page + j, 0); } __free_pages(page, order); i += BITS_PER_LONG; Index: linux-2.6/mm/mempolicy.c =================================================================== --- linux-2.6.orig/mm/mempolicy.c +++ linux-2.6/mm/mempolicy.c @@ -253,8 +253,10 @@ static int check_pte_range(struct mm_str if (!pte_present(*pte)) continue; pfn = pte_pfn(*pte); - if (!pfn_valid(pfn)) + if (unlikely(!pfn_valid(pfn))) { + print_invalid_pfn("check_pte_range", *pte, -1UL, addr); continue; + } nid = pfn_to_nid(pfn); if (!test_bit(nid, nodes)) break; @@ -326,6 +328,8 @@ check_range(struct mm_struct *mm, unsign first = find_vma(mm, start); if (!first) return ERR_PTR(-EFAULT); + if (first->vm_flags & VM_RESERVED) + return ERR_PTR(-EACCESS); prev = NULL; for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) { if (!vma->vm_next && vma->vm_end < end) Index: linux-2.6/kernel/power/swsusp.c =================================================================== --- linux-2.6.orig/kernel/power/swsusp.c +++ linux-2.6/kernel/power/swsusp.c @@ -433,15 +433,12 @@ static int save_highmem_zone(struct zone continue; page = pfn_to_page(pfn); /* + * PageReserved(page) - * This condition results from rvmalloc() sans vmalloc_32() * and architectural memory reservations. This should be * corrected eventually when the cases giving rise to this * are better understood. */ - if (PageReserved(page)) { - printk("highmem reserved page?!\n"); - continue; - } BUG_ON(PageNosave(page)); if (PageNosaveFree(page)) continue; @@ -527,13 +524,8 @@ static int saveable(struct zone * zone, return 0; page = pfn_to_page(pfn); - BUG_ON(PageReserved(page) && PageNosave(page)); if (PageNosave(page)) return 0; - if (PageReserved(page) && pfn_is_nosave(pfn)) { - pr_debug("[nosave pfn 0x%lx]", pfn); - return 0; - } if (PageNosaveFree(page)) return 0; Index: linux-2.6/arch/ppc64/kernel/vdso.c =================================================================== --- linux-2.6.orig/arch/ppc64/kernel/vdso.c +++ linux-2.6/arch/ppc64/kernel/vdso.c @@ -175,13 +175,6 @@ static struct page * vdso_vma_nopage(str if (address < vma->vm_start || address > vma->vm_end) return NOPAGE_SIGBUS; - /* - * Last page is systemcfg, special handling here, no get_page() a - * this is a reserved page - */ - if ((vma->vm_end - address) <= PAGE_SIZE) - return virt_to_page(systemcfg); - pg = virt_to_page(vbase + offset); get_page(pg); DBG(" ->page count: %d\n", page_count(pg)); ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [patch][rfc] 5/5: core remove PageReserved 2005-06-23 7:08 ` [patch][rfc] 5/5: core remove PageReserved Nick Piggin @ 2005-06-23 9:51 ` William Lee Irwin III 2005-06-23 10:32 ` Nick Piggin 2005-06-24 4:50 ` Andrew Morton 0 siblings, 2 replies; 19+ messages in thread From: William Lee Irwin III @ 2005-06-23 9:51 UTC (permalink / raw) To: Nick Piggin Cc: linux-kernel, Linux Memory Management, Hugh Dickins, Badari Pulavarty On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote: > @@ -337,7 +338,7 @@ static inline void get_page(struct page > static inline void put_page(struct page *page) > { > - if (!PageReserved(page) && put_page_testzero(page)) > + if (put_page_testzero(page)) > __page_cache_release(page); > } No sweep before this? I'm not so sure. On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote: > @@ -514,7 +534,8 @@ int copy_page_range(struct mm_struct *ds > return 0; > } > > -static void zap_pte_range(struct mmu_gather *tlb, pmd_t *pmd, > +static void zap_pte_range(struct mmu_gather *tlb, > + struct vm_area_struct *vma, pmd_t *pmd, > unsigned long addr, unsigned long end, > struct zap_details *details) > { As exciting as this is, !!(vma->vm_flags & VM_RESERVED) could trivially go into struct zap_details without excess args or diff. On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote: > @@ -1225,6 +1251,8 @@ static int do_wp_page(struct mm_struct * > unsigned long pfn = pte_pfn(pte); > pte_t entry; > > + BUG_ON(vma->vm_flags & VM_RESERVED); > + > if (unlikely(!pfn_valid(pfn))) { > /* > * This should really halt the system so it can be debugged or !pfn_valid(pfn) is banned when !(vma->vm_flags & VM_RESERVED); here we BUG_ON the precondition then handle !pfn_valid(pfn) in the old manner where some new infrastructure has been erected for reporting such errors. On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote: > @@ -1259,13 +1286,16 @@ static int do_wp_page(struct mm_struct * > /* > * Ok, we need to copy. Oh, well.. > */ > - if (!PageReserved(old_page)) > + if (old_page == ZERO_PAGE(address)) > + old_page = NULL; > + else > page_cache_get(old_page); > + > spin_unlock(&mm->page_table_lock); > > if (unlikely(anon_vma_prepare(vma))) > goto no_new_page; > - if (old_page == ZERO_PAGE(address)) { > + if (old_page == NULL) { > new_page = alloc_zeroed_user_highpage(vma, address); > if (!new_page) > goto no_new_page; There are some micro-optimizations mixed in with this and some subsequent do_wp_page() alterations. On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote: > @@ -1654,7 +1656,7 @@ void __init memmap_init_zone(unsigned lo > > for (page = start; page < (start + size); page++) { > set_page_zone(page, NODEZONE(nid, zone)); > - set_page_count(page, 0); > + set_page_count(page, 1); > reset_page_mapcount(page); > SetPageReserved(page); > INIT_LIST_HEAD(&page->lru); The refcounting and PG_reserved activity in memmap_init_zone() is superfluous. bootmem.c does all the necessary accounting internally. On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote: > Index: linux-2.6/mm/fremap.c > =================================================================== > --- linux-2.6.orig/mm/fremap.c > +++ linux-2.6/mm/fremap.c > @@ -29,18 +29,21 @@ static inline void zap_pte(struct mm_str > return; > if (pte_present(pte)) { > unsigned long pfn = pte_pfn(pte); > + struct page *page; > > flush_cache_page(vma, addr, pfn); > pte = ptep_clear_flush(vma, addr, ptep); > - if (pfn_valid(pfn)) { > - struct page *page = pfn_to_page(pfn); > - if (!PageReserved(page)) { > - if (pte_dirty(pte)) > - set_page_dirty(page); > - page_remove_rmap(page); > - page_cache_release(page); > - dec_mm_counter(mm, rss); > - } > + if (unlikely(!pfn_valid(pfn))) { > + print_invalid_pfn("zap_pte", pte, vma->vm_flags, addr); > + return; > + } > + page = pfn_to_page(pfn); > + if (page != ZERO_PAGE(addr)) { > + if (pte_dirty(pte)) > + set_page_dirty(page); > + page_remove_rmap(page); > + dec_mm_counter(mm, rss); > + page_cache_release(page); > } > } else { > if (!pte_file(pte)) There is no error returned here to be handled by the caller. On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote: > @@ -65,6 +68,8 @@ int install_page(struct mm_struct *mm, s > pgd_t *pgd; > pte_t pte_val; > > + BUG_ON(vma->vm_flags & VM_RESERVED); > + > pgd = pgd_offset(mm, addr); > spin_lock(&mm->page_table_lock); > > @@ -122,6 +127,8 @@ int install_file_pte(struct mm_struct *m > pgd_t *pgd; > pte_t pte_val; > > + BUG_ON(vma->vm_flags & VM_RESERVED); > + > pgd = pgd_offset(mm, addr); > spin_lock(&mm->page_table_lock); This has no effect but to artificially constrain the interface. There is no a priori reason to avoid the use of install_page() to deposit mappings to normal pages in VM_RESERVED vmas. It's only the reverse being "banned" here. Similar comments also apply to zap_pte(). On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote: > Index: linux-2.6/drivers/scsi/sg.c > =================================================================== > --- linux-2.6.orig/drivers/scsi/sg.c > +++ linux-2.6/drivers/scsi/sg.c > @@ -1887,9 +1887,10 @@ st_unmap_user_pages(struct scatterlist * > int i; > > for (i=0; i < nr_pages; i++) { > - if (dirtied && !PageReserved(sgl[i].page)) > + if (dirtied) > SetPageDirty(sgl[i].page); > /* unlock_page(sgl[i].page); */ > + /* FIXME: XXX don't dirty/unmap VM_RESERVED regions? */ > /* FIXME: cache flush missing for rw==READ > * FIXME: call the correct reference counting function > */ An answer should be devised for this. My numerous SCSI CD-ROM devices (I have 5 across several different machines of several different arches) are rather unlikely to be happy with /* FIXME: XXX ... as an answer. On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote: > Index: linux-2.6/drivers/scsi/st.c > =================================================================== > --- linux-2.6.orig/drivers/scsi/st.c > +++ linux-2.6/drivers/scsi/st.c > @@ -4435,8 +4435,9 @@ static int sgl_unmap_user_pages(struct s > int i; > > for (i=0; i < nr_pages; i++) { > - if (dirtied && !PageReserved(sgl[i].page)) > + if (dirtied) > SetPageDirty(sgl[i].page); > + /* FIXME: XXX don't dirty/unmap VM_RESERVED regions? */ > /* FIXME: cache flush missing for rw==READ > * FIXME: call the correct reference counting function > */ Mutatis mutandis for my SCSI tape drive. On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote: > Index: linux-2.6/sound/core/pcm_native.c > =================================================================== > --- linux-2.6.orig/sound/core/pcm_native.c > +++ linux-2.6/sound/core/pcm_native.c > @@ -2942,8 +2942,7 @@ static struct page * snd_pcm_mmap_status > return NOPAGE_OOM; > runtime = substream->runtime; > page = virt_to_page(runtime->status); > - if (!PageReserved(page)) > - get_page(page); > + get_page(page); > if (type) > *type = VM_FAULT_MINOR; > return page; snd_malloc_pages() marks all pages it allocates PG_reserved. This merits some commentary, and likely the removal of the superfluous PG_reserved usage. On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote: > Index: linux-2.6/mm/mmap.c > =================================================================== > --- linux-2.6.orig/mm/mmap.c > +++ linux-2.6/mm/mmap.c > @@ -1073,6 +1073,17 @@ munmap_back: > error = file->f_op->mmap(file, vma); > if (error) > goto unmap_and_free_vma; > + if ((vma->vm_flags & (VM_SHARED | VM_WRITE | VM_RESERVED)) > + == (VM_WRITE | VM_RESERVED)) { > + printk(KERN_WARNING "program %s is using MAP_PRIVATE, " > + "PROT_WRITE mmap of VM_RESERVED memory, which " > + "is deprecated. Please report this to " > + "linux-kernel@vger.kernel.org\n",current->comm); > + if (vma->vm_ops && vma->vm_ops->close) > + vma->vm_ops->close(vma); > + error = -EACCES; > + goto unmap_and_free_vma; > + } > } else if (vm_flags & VM_SHARED) { > error = shmem_zero_setup(vma); > if (error) This is user-triggerable where the driver honors mmap() protection flags blindly. On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote: > Index: linux-2.6/mm/bootmem.c > =================================================================== > --- linux-2.6.orig/mm/bootmem.c > +++ linux-2.6/mm/bootmem.c > @@ -286,6 +286,7 @@ static unsigned long __init free_all_boo > if (j + 16 < BITS_PER_LONG) > prefetchw(page + j + 16); > __ClearPageReserved(page + j); > + set_page_count(page + j, 0); > } > __free_pages(page, order); > i += BITS_PER_LONG; ibid re: bootmem On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote: > Index: linux-2.6/kernel/power/swsusp.c > =================================================================== > --- linux-2.6.orig/kernel/power/swsusp.c > +++ linux-2.6/kernel/power/swsusp.c > @@ -433,15 +433,12 @@ static int save_highmem_zone(struct zone > continue; > page = pfn_to_page(pfn); > /* > + * PageReserved(page) - > * This condition results from rvmalloc() sans vmalloc_32() > * and architectural memory reservations. This should be > * corrected eventually when the cases giving rise to this > * are better understood. > */ > - if (PageReserved(page)) { > - printk("highmem reserved page?!\n"); > - continue; > - } > BUG_ON(PageNosave(page)); > if (PageNosaveFree(page)) > continue; This behavioral change needs to be commented on. There are some additional difficulties when memory holes are unintentionally covered by mem_map[]; It is beneficial otherwise. It's likely to triplefault on such holes. On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote: > @@ -527,13 +524,8 @@ static int saveable(struct zone * zone, > return 0; > > page = pfn_to_page(pfn); > - BUG_ON(PageReserved(page) && PageNosave(page)); > if (PageNosave(page)) > return 0; > - if (PageReserved(page) && pfn_is_nosave(pfn)) { > - pr_debug("[nosave pfn 0x%lx]", pfn); > - return 0; > - } > if (PageNosaveFree(page)) > return 0; The pfn_is_nosave() check must stand barring justification of why unintentionally saving (and hence restoring) the page is tolerable. -- wli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [patch][rfc] 5/5: core remove PageReserved 2005-06-23 9:51 ` William Lee Irwin III @ 2005-06-23 10:32 ` Nick Piggin 2005-06-23 22:08 ` William Lee Irwin III 2005-06-24 4:50 ` Andrew Morton 1 sibling, 1 reply; 19+ messages in thread From: Nick Piggin @ 2005-06-23 10:32 UTC (permalink / raw) To: William Lee Irwin III Cc: linux-kernel, Linux Memory Management, Hugh Dickins, Badari Pulavarty William Lee Irwin III wrote: > On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote: > >>@@ -337,7 +338,7 @@ static inline void get_page(struct page >> static inline void put_page(struct page *page) >> { >>- if (!PageReserved(page) && put_page_testzero(page)) >>+ if (put_page_testzero(page)) >> __page_cache_release(page); >> } > > > No sweep before this? I'm not so sure. > Sweep what? There is a warning that will get triggered in the case we free a PageReserved page through this path. > > On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote: > >>@@ -514,7 +534,8 @@ int copy_page_range(struct mm_struct *ds >> return 0; >> } >> >>-static void zap_pte_range(struct mmu_gather *tlb, pmd_t *pmd, >>+static void zap_pte_range(struct mmu_gather *tlb, >>+ struct vm_area_struct *vma, pmd_t *pmd, >> unsigned long addr, unsigned long end, >> struct zap_details *details) >> { > > > As exciting as this is, !!(vma->vm_flags & VM_RESERVED) could trivially > go into struct zap_details without excess args or diff. > Actually, it isn't trivial. I thought of trying that. > > On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote: > >>@@ -1225,6 +1251,8 @@ static int do_wp_page(struct mm_struct * >> unsigned long pfn = pte_pfn(pte); >> pte_t entry; >> >>+ BUG_ON(vma->vm_flags & VM_RESERVED); >>+ >> if (unlikely(!pfn_valid(pfn))) { >> /* >> * This should really halt the system so it can be debugged or > > > !pfn_valid(pfn) is banned when !(vma->vm_flags & VM_RESERVED); here we > BUG_ON the precondition then handle !pfn_valid(pfn) in the old manner > where some new infrastructure has been erected for reporting such errors. > No, it uses print_invalid_pfn too. > > On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote: > >>@@ -1259,13 +1286,16 @@ static int do_wp_page(struct mm_struct * >> /* >> * Ok, we need to copy. Oh, well.. >> */ >>- if (!PageReserved(old_page)) >>+ if (old_page == ZERO_PAGE(address)) >>+ old_page = NULL; >>+ else >> page_cache_get(old_page); >>+ >> spin_unlock(&mm->page_table_lock); >> >> if (unlikely(anon_vma_prepare(vma))) >> goto no_new_page; >>- if (old_page == ZERO_PAGE(address)) { >>+ if (old_page == NULL) { >> new_page = alloc_zeroed_user_highpage(vma, address); >> if (!new_page) >> goto no_new_page; > > > There are some micro-optimizations mixed in with this and some > subsequent do_wp_page() alterations. > We no longer page_cache_get ZERO_PAGE, ever. That I changed it so in a more optimal way than it was is surely acceptable? > > On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote: > >>@@ -1654,7 +1656,7 @@ void __init memmap_init_zone(unsigned lo >> >> for (page = start; page < (start + size); page++) { >> set_page_zone(page, NODEZONE(nid, zone)); >>- set_page_count(page, 0); >>+ set_page_count(page, 1); >> reset_page_mapcount(page); >> SetPageReserved(page); >> INIT_LIST_HEAD(&page->lru); > > > The refcounting and PG_reserved activity in memmap_init_zone() is > superfluous. bootmem.c does all the necessary accounting internally. > Well not superfluous yet. It is kept around to support all the arch code that still uses it (not much, mainly reserved memory reporting). > > On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote: > >>Index: linux-2.6/mm/fremap.c >>=================================================================== >>--- linux-2.6.orig/mm/fremap.c >>+++ linux-2.6/mm/fremap.c >>@@ -29,18 +29,21 @@ static inline void zap_pte(struct mm_str >> return; >> if (pte_present(pte)) { >> unsigned long pfn = pte_pfn(pte); >>+ struct page *page; >> >> flush_cache_page(vma, addr, pfn); >> pte = ptep_clear_flush(vma, addr, ptep); >>- if (pfn_valid(pfn)) { >>- struct page *page = pfn_to_page(pfn); >>- if (!PageReserved(page)) { >>- if (pte_dirty(pte)) >>- set_page_dirty(page); >>- page_remove_rmap(page); >>- page_cache_release(page); >>- dec_mm_counter(mm, rss); >>- } >>+ if (unlikely(!pfn_valid(pfn))) { >>+ print_invalid_pfn("zap_pte", pte, vma->vm_flags, addr); >>+ return; >>+ } >>+ page = pfn_to_page(pfn); >>+ if (page != ZERO_PAGE(addr)) { >>+ if (pte_dirty(pte)) >>+ set_page_dirty(page); >>+ page_remove_rmap(page); >>+ dec_mm_counter(mm, rss); >>+ page_cache_release(page); >> } >> } else { >> if (!pte_file(pte)) > > > There is no error returned here to be handled by the caller. > That's OK, the pte has been cleared. Nothing else we can do. > > On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote: > >>@@ -65,6 +68,8 @@ int install_page(struct mm_struct *mm, s >> pgd_t *pgd; >> pte_t pte_val; >> >>+ BUG_ON(vma->vm_flags & VM_RESERVED); >>+ >> pgd = pgd_offset(mm, addr); >> spin_lock(&mm->page_table_lock); >> >>@@ -122,6 +127,8 @@ int install_file_pte(struct mm_struct *m >> pgd_t *pgd; >> pte_t pte_val; >> >>+ BUG_ON(vma->vm_flags & VM_RESERVED); >>+ >> pgd = pgd_offset(mm, addr); >> spin_lock(&mm->page_table_lock); > > > This has no effect but to artificially constrain the interface. There > is no a priori reason to avoid the use of install_page() to deposit > mappings to normal pages in VM_RESERVED vmas. It's only the reverse > being "banned" here. Similar comments also apply to zap_pte(). > No, install_page is playing with the page (eg. page_add_file_rmap) which is explicity banned even before my PageReserved removal. It is unclear that this ever safely worked for normal pages, and will hit NULL page dereferences if trying to do it with iomem. > > On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote: > >>Index: linux-2.6/drivers/scsi/sg.c >>=================================================================== >>--- linux-2.6.orig/drivers/scsi/sg.c >>+++ linux-2.6/drivers/scsi/sg.c >>@@ -1887,9 +1887,10 @@ st_unmap_user_pages(struct scatterlist * >> int i; >> >> for (i=0; i < nr_pages; i++) { >>- if (dirtied && !PageReserved(sgl[i].page)) >>+ if (dirtied) >> SetPageDirty(sgl[i].page); >> /* unlock_page(sgl[i].page); */ >>+ /* FIXME: XXX don't dirty/unmap VM_RESERVED regions? */ >> /* FIXME: cache flush missing for rw==READ >> * FIXME: call the correct reference counting function >> */ > > > An answer should be devised for this. My numerous SCSI CD-ROM devices > (I have 5 across several different machines of several different arches) > are rather unlikely to be happy with /* FIXME: XXX ... as an answer. > The worst it will do is dirty a VM_RESERVED page. So it is going to work unless you're thinking about doing something crazy like mmap /dev/mem and send your CDROM some pages from there. But yeah, I have to find someone who knows what they're doing to look at this. > > On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote: > >>Index: linux-2.6/drivers/scsi/st.c >>=================================================================== >>--- linux-2.6.orig/drivers/scsi/st.c >>+++ linux-2.6/drivers/scsi/st.c >>@@ -4435,8 +4435,9 @@ static int sgl_unmap_user_pages(struct s >> int i; >> >> for (i=0; i < nr_pages; i++) { >>- if (dirtied && !PageReserved(sgl[i].page)) >>+ if (dirtied) >> SetPageDirty(sgl[i].page); >>+ /* FIXME: XXX don't dirty/unmap VM_RESERVED regions? */ >> /* FIXME: cache flush missing for rw==READ >> * FIXME: call the correct reference counting function >> */ > > > Mutatis mutandis for my SCSI tape drive. > > > On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote: > >>Index: linux-2.6/sound/core/pcm_native.c >>=================================================================== >>--- linux-2.6.orig/sound/core/pcm_native.c >>+++ linux-2.6/sound/core/pcm_native.c >>@@ -2942,8 +2942,7 @@ static struct page * snd_pcm_mmap_status >> return NOPAGE_OOM; >> runtime = substream->runtime; >> page = virt_to_page(runtime->status); >>- if (!PageReserved(page)) >>- get_page(page); >>+ get_page(page); >> if (type) >> *type = VM_FAULT_MINOR; >> return page; > > > snd_malloc_pages() marks all pages it allocates PG_reserved. This > merits some commentary, and likely the removal of the superfluous > PG_reserved usage. > Sure, but not in this patch. The aim here is just to eliminate special casing of refcounting. Other PG_reserved usage can stay around for the moment (and is actually good for catching errors). > > On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote: > >>Index: linux-2.6/mm/mmap.c >>=================================================================== >>--- linux-2.6.orig/mm/mmap.c >>+++ linux-2.6/mm/mmap.c >>@@ -1073,6 +1073,17 @@ munmap_back: >> error = file->f_op->mmap(file, vma); >> if (error) >> goto unmap_and_free_vma; >>+ if ((vma->vm_flags & (VM_SHARED | VM_WRITE | VM_RESERVED)) >>+ == (VM_WRITE | VM_RESERVED)) { >>+ printk(KERN_WARNING "program %s is using MAP_PRIVATE, " >>+ "PROT_WRITE mmap of VM_RESERVED memory, which " >>+ "is deprecated. Please report this to " >>+ "linux-kernel@vger.kernel.org\n",current->comm); >>+ if (vma->vm_ops && vma->vm_ops->close) >>+ vma->vm_ops->close(vma); >>+ error = -EACCES; >>+ goto unmap_and_free_vma; >>+ } >> } else if (vm_flags & VM_SHARED) { >> error = shmem_zero_setup(vma); >> if (error) > > > This is user-triggerable where the driver honors mmap() protection > flags blindly. > If the user is allowed write access to VM_RESERVED memory, then I suspect there is a lot worse they can do than flood the log. But the check isn't going to stay around forever. > > On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote: > >>Index: linux-2.6/mm/bootmem.c >>=================================================================== >>--- linux-2.6.orig/mm/bootmem.c >>+++ linux-2.6/mm/bootmem.c >>@@ -286,6 +286,7 @@ static unsigned long __init free_all_boo >> if (j + 16 < BITS_PER_LONG) >> prefetchw(page + j + 16); >> __ClearPageReserved(page + j); >>+ set_page_count(page + j, 0); >> } >> __free_pages(page, order); >> i += BITS_PER_LONG; > > > ibid re: bootmem > Not yet. > > On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote: > >>Index: linux-2.6/kernel/power/swsusp.c >>=================================================================== >>--- linux-2.6.orig/kernel/power/swsusp.c >>+++ linux-2.6/kernel/power/swsusp.c >>@@ -433,15 +433,12 @@ static int save_highmem_zone(struct zone >> continue; >> page = pfn_to_page(pfn); >> /* >>+ * PageReserved(page) - >> * This condition results from rvmalloc() sans vmalloc_32() >> * and architectural memory reservations. This should be >> * corrected eventually when the cases giving rise to this >> * are better understood. >> */ >>- if (PageReserved(page)) { >>- printk("highmem reserved page?!\n"); >>- continue; >>- } >> BUG_ON(PageNosave(page)); >> if (PageNosaveFree(page)) >> continue; > > > This behavioral change needs to be commented on. There are some additional > difficulties when memory holes are unintentionally covered by mem_map[]; > It is beneficial otherwise. It's likely to triplefault on such holes. > It seems the author of this code themselves didn't really understand what was going on here, so I'm buggered if I can be bothered :) Remember though, PageReserved can stay around for as long as we need, so this hunk can be trivially reverted if it is an immediate problem. > > On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote: > >>@@ -527,13 +524,8 @@ static int saveable(struct zone * zone, >> return 0; >> >> page = pfn_to_page(pfn); >>- BUG_ON(PageReserved(page) && PageNosave(page)); >> if (PageNosave(page)) >> return 0; >>- if (PageReserved(page) && pfn_is_nosave(pfn)) { >>- pr_debug("[nosave pfn 0x%lx]", pfn); >>- return 0; >>- } >> if (PageNosaveFree(page)) >> return 0; > > > The pfn_is_nosave() check must stand barring justification of why > unintentionally saving (and hence restoring) the page is tolerable. > I thought the PageNosave should catch that, and the next line is just a debug check. But I'm looking for someone who *actually* knows how swsusp works, if anyone would like to volunteer :) Thanks very much for the review! -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [patch][rfc] 5/5: core remove PageReserved 2005-06-23 10:32 ` Nick Piggin @ 2005-06-23 22:08 ` William Lee Irwin III 2005-06-23 23:21 ` Nick Piggin 0 siblings, 1 reply; 19+ messages in thread From: William Lee Irwin III @ 2005-06-23 22:08 UTC (permalink / raw) To: Nick Piggin Cc: linux-kernel, Linux Memory Management, Hugh Dickins, Badari Pulavarty William Lee Irwin III wrote: >> As exciting as this is, !!(vma->vm_flags & VM_RESERVED) could trivially >> go into struct zap_details without excess args or diff. On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote: > Actually, it isn't trivial. I thought of trying that. You're probably thinking of the zap_page_range() vs. unmap_page_range() topmost-level entrypoints as being the only places to set the flag in details. Unconditionally clobbering the field in details in zap_pud_range() should resolve any issues associated with that. This is obviously a style and/or diff-minimization issue. William Lee Irwin III wrote: >> The refcounting and PG_reserved activity in memmap_init_zone() is >> superfluous. bootmem.c does all the necessary accounting internally. On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote: > Well not superfluous yet. It is kept around to support all the arch > code that still uses it (not much, mainly reserved memory reporting). That activity is outside memmap_init_zone(). Specifically, the page structures are not used for anything whatsoever before bootmem puts them into the buddy allocators. It is not particularly interesting or difficult to defer even the initialization to the precise point in time bootmem cares to perform buddy bitmap insertion. This goes for all fields of struct page. What's notable about PG_reserved and refcounts here is that memmap_init_zone() goes about flipping bits one way where free_all_bootmem_core() undoes all its work. William Lee Irwin III wrote: >> There is no error returned here to be handled by the caller. On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote: > That's OK, the pte has been cleared. Nothing else we can do. This is called in the interior of a loop, which may be beneficial to terminate if this intended semantic is to be enforced. Furthermore, no error is propagated to the caller, which is not the desired effect in the stated error reporting scheme. So the code is inconsistent with explicitly stated intention. William Lee Irwin III wrote: >> This has no effect but to artificially constrain the interface. There >> is no a priori reason to avoid the use of install_page() to deposit >> mappings to normal pages in VM_RESERVED vmas. It's only the reverse >> being "banned" here. Similar comments also apply to zap_pte(). On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote: > No, install_page is playing with the page (eg. page_add_file_rmap) > which is explicity banned even before my PageReserved removal. It > is unclear that this ever safely worked for normal pages, and will > hit NULL page dereferences if trying to do it with iomem. You are going on about the fact that install_page() can't be used on memory outside mem_map[] as it requires a page structure, and can't be used on reserved pages because page_add_file_rmap() will BUG. This case is not being discussed. The issue at stake is inserting normal pages into a VM_RESERVED vma. These will arise as e.g. kernel-allocated buffers managed by normal reference counting. remap_pfn_range() can't do it; it refuses to operate on "valid memory". install_page() now won't do it; it refuses to touch a VM_RESERVED vma. So this creates a giant semantic hole, and potentially breaks working code (i.e. if you were going to do this you would need not only a replacement but also a sweep to adjust all the drivers doing it or prove their nonexistence). William Lee Irwin III wrote: >> An answer should be devised for this. My numerous SCSI CD-ROM devices >> (I have 5 across several different machines of several different arches) >> are rather unlikely to be happy with /* FIXME: XXX ... as an answer. On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote: > The worst it will do is dirty a VM_RESERVED page. So it is going > to work unless you're thinking about doing something crazy like > mmap /dev/mem and send your CDROM some pages from there. But yeah, > I have to find someone who knows what they're doing to look at this. Then you should replace FIXME: XXX with this explanation. By and large the presence of "FIXME: XXX" is a sign there is a gigantic hole in the code. It should definitely not be done with new code, but rather, exclusively confined to documenting discoveries of preexisting brokenness. After all, if a patch is introducing broken code, why would we merge it? Best to adjust the commentary and avoid that question altogether. There are actually larger questions about this than the reserved page handling. If e.g. pagecache pages need to be dirtied the raw bitflag toggling is probably not how it should be done. William Lee Irwin III wrote: >> snd_malloc_pages() marks all pages it allocates PG_reserved. This >> merits some commentary, and likely the removal of the superfluous >> PG_reserved usage. On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote: > Sure, but not in this patch. The aim here is just to eliminate special > casing of refcounting. Other PG_reserved usage can stay around for the > moment (and is actually good for catching errors). Unfortunately for this scheme, it's very much a case of putting the cart before the horse. PG_reserved is toggled at random in this driver after this change, to no useful effect (debugging or otherwise). And this really goes for the whole affair. Diddling the core first is just going to create bugs. Converting the users first is the way these things need to be done. When complete, nothing needs the core flags twiddling anymore and you just nuke the flag twiddling from the core. William Lee Irwin III wrote: >> This is user-triggerable where the driver honors mmap() protection >> flags blindly. On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote: > If the user is allowed write access to VM_RESERVED memory, then I > suspect there is a lot worse they can do than flood the log. > But the check isn't going to stay around forever. This doesn't really do a whole lot of good for the unwitting user who invokes a privileged application relying on such kernel behavior. User-visible changes need to be taken on with somewhat more care (okay, vastly more, along with long-term backward compatibility). William Lee Irwin III wrote: >> This behavioral change needs to be commented on. There are some additional >> difficulties when memory holes are unintentionally covered by mem_map[]; >> It is beneficial otherwise. It's likely to triplefault on such holes. On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote: > It seems the author of this code themselves didn't really understand > what was going on here, so I'm buggered if I can be bothered :) > Remember though, PageReserved can stay around for as long as we need, > so this hunk can be trivially reverted if it is an immediate problem. This doesn't really fly. A fair number of drivers are poorly-understood and numerous gyrations have to be gone through to avoid breaking their possible assumptions until at long last clarifications are made (that is, if they're ever made). swsusp is not demotable to below their status on a whim. The failure mode must also be considered. Users are highly unlikely to generate comprehensible bugreports from triplefaults. It's also rather easy to see this case will arise on every non-NUMA ia32 machine with > 4GB RAM, and that we dare not attempt to save or restore such pages during suspend/resume cycles. William Lee Irwin III wrote: >> The pfn_is_nosave() check must stand barring justification of why >> unintentionally saving (and hence restoring) the page is tolerable. On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote: > I thought the PageNosave should catch that, and the next line is > just a debug check. But I'm looking for someone who *actually* knows > how swsusp works, if anyone would like to volunteer :) This one is readily observable. It's doing a bounds check on the pfn. This corresponds to the __nosave region being assigned struct pages to cover it. We can't get by with accidentally saving and restoring it. -- wli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [patch][rfc] 5/5: core remove PageReserved 2005-06-23 22:08 ` William Lee Irwin III @ 2005-06-23 23:21 ` Nick Piggin 2005-06-24 0:59 ` William Lee Irwin III 0 siblings, 1 reply; 19+ messages in thread From: Nick Piggin @ 2005-06-23 23:21 UTC (permalink / raw) To: William Lee Irwin III Cc: linux-kernel, Linux Memory Management, Hugh Dickins, Badari Pulavarty William Lee Irwin III wrote: > William Lee Irwin III wrote: > >>>As exciting as this is, !!(vma->vm_flags & VM_RESERVED) could trivially >>>go into struct zap_details without excess args or diff. > > > On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote: > >>Actually, it isn't trivial. I thought of trying that. > > > You're probably thinking of the zap_page_range() vs. unmap_page_range() > topmost-level entrypoints as being the only places to set the flag in > details. Unconditionally clobbering the field in details in > zap_pud_range() should resolve any issues associated with that. > > This is obviously a style and/or diff-minimization issue. > I'm thinking of exit_mmap and unmap_vmas, that don't pass in a details field at all, and all the tests for details being marked unlikely. I ended up thinking it was less ugly this way. > > William Lee Irwin III wrote: > >>>The refcounting and PG_reserved activity in memmap_init_zone() is >>>superfluous. bootmem.c does all the necessary accounting internally. > > > On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote: > >>Well not superfluous yet. It is kept around to support all the arch >>code that still uses it (not much, mainly reserved memory reporting). > > > That activity is outside memmap_init_zone(). Specifically, the page > structures are not used for anything whatsoever before bootmem puts > them into the buddy allocators. It is not particularly interesting > or difficult to defer even the initialization to the precise point > in time bootmem cares to perform buddy bitmap insertion. This goes > for all fields of struct page. What's notable about PG_reserved and > refcounts here is that memmap_init_zone() goes about flipping bits one > way where free_all_bootmem_core() undoes all its work. > This patch doesn't care how it works, that would be for a later patch. > > William Lee Irwin III wrote: > >>>There is no error returned here to be handled by the caller. > > > On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote: > >>That's OK, the pte has been cleared. Nothing else we can do. > > > This is called in the interior of a loop, which may be beneficial to > terminate if this intended semantic is to be enforced. Furthermore, no > error is propagated to the caller, which is not the desired effect in > the stated error reporting scheme. So the code is inconsistent with > explicitly stated intention. > No, the error reporting scheme says it doesn't handle any error, that is all. What we have here in terms of behaviour is exactly what used to happen, that is - do something saneish on error. Changing behaviour would be outside the scope of this patch, but be my guest. > > William Lee Irwin III wrote: > >>>This has no effect but to artificially constrain the interface. There >>>is no a priori reason to avoid the use of install_page() to deposit >>>mappings to normal pages in VM_RESERVED vmas. It's only the reverse >>>being "banned" here. Similar comments also apply to zap_pte(). > > > On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote: > >>No, install_page is playing with the page (eg. page_add_file_rmap) >>which is explicity banned even before my PageReserved removal. It >>is unclear that this ever safely worked for normal pages, and will >>hit NULL page dereferences if trying to do it with iomem. > > > You are going on about the fact that install_page() can't be used on > memory outside mem_map[] as it requires a page structure, and can't be > used on reserved pages because page_add_file_rmap() will BUG. This case > is not being discussed. > And that it isn't allowed to touch struct page of physical pages in a VM_RESERVED region. > The issue at stake is inserting normal pages into a VM_RESERVED vma. > These will arise as e.g. kernel-allocated buffers managed by normal > reference counting. remap_pfn_range() can't do it; it refuses to > operate on "valid memory". install_page() now won't do it; it refuses > to touch a VM_RESERVED vma. So this creates a giant semantic hole, > and potentially breaks working code (i.e. if you were going to do > this you would need not only a replacement but also a sweep to adjust > all the drivers doing it or prove their nonexistence). > I think you'll find that remap_pfn_range will be happy to operate on valid memory, and that any driver trying to use install_page on VM_RESERVED probably needs fixing anyway. > > William Lee Irwin III wrote: > >>>An answer should be devised for this. My numerous SCSI CD-ROM devices >>>(I have 5 across several different machines of several different arches) >>>are rather unlikely to be happy with /* FIXME: XXX ... as an answer. > > > On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote: > >>The worst it will do is dirty a VM_RESERVED page. So it is going >>to work unless you're thinking about doing something crazy like >>mmap /dev/mem and send your CDROM some pages from there. But yeah, >>I have to find someone who knows what they're doing to look at this. > > > Then you should replace FIXME: XXX with this explanation. By and large > the presence of "FIXME: XXX" is a sign there is a gigantic hole in the > code. It should definitely not be done with new code, but rather, > exclusively confined to documenting discoveries of preexisting brokenness. > After all, if a patch is introducing broken code, why would we merge it? > Best to adjust the commentary and avoid that question altogether. > We wouldn't merge it. Hence this is an rfc and I explicitly said it is not for merging. > There are actually larger questions about this than the reserved page > handling. If e.g. pagecache pages need to be dirtied the raw bitflag > toggling is probably not how it should be done. > Yep. > > William Lee Irwin III wrote: > >>>snd_malloc_pages() marks all pages it allocates PG_reserved. This >>>merits some commentary, and likely the removal of the superfluous >>>PG_reserved usage. > > > On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote: > >>Sure, but not in this patch. The aim here is just to eliminate special >>casing of refcounting. Other PG_reserved usage can stay around for the >>moment (and is actually good for catching errors). > > > Unfortunately for this scheme, it's very much a case of putting the > cart before the horse. PG_reserved is toggled at random in this driver > after this change, to no useful effect (debugging or otherwise). And > this really goes for the whole affair. Diddling the core first is just > going to create bugs. Converting the users first is the way these > things need to be done. When complete, nothing needs the core flags > twiddling anymore and you just nuke the flag twiddling from the core. > I'm sorry, I don't see how 'diddling' the core will create bugs. This is a fine way to do it, and "converting" users first (whatever that means) is not possible because VM_RESERVED handling in core code is not up to the task of replacing PageReserved without this patch. > > William Lee Irwin III wrote: > >>>This is user-triggerable where the driver honors mmap() protection >>>flags blindly. > > > On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote: > >>If the user is allowed write access to VM_RESERVED memory, then I >>suspect there is a lot worse they can do than flood the log. >>But the check isn't going to stay around forever. > > > This doesn't really do a whole lot of good for the unwitting user > who invokes a privileged application relying on such kernel behavior. > User-visible changes need to be taken on with somewhat more care > (okay, vastly more, along with long-term backward compatibility). > It does a great lot of good, because they can tell us about it we'll fix it. Search the kernel sources and you'll find other examples that look almost exactly like this one. > > William Lee Irwin III wrote: > >>>This behavioral change needs to be commented on. There are some additional >>>difficulties when memory holes are unintentionally covered by mem_map[]; >>>It is beneficial otherwise. It's likely to triplefault on such holes. > > > On Thu, Jun 23, 2005 at 08:32:05PM +1000, Nick Piggin wrote: > >>It seems the author of this code themselves didn't really understand >>what was going on here, so I'm buggered if I can be bothered :) >>Remember though, PageReserved can stay around for as long as we need, >>so this hunk can be trivially reverted if it is an immediate problem. > > > This doesn't really fly. A fair number of drivers are poorly-understood > and numerous gyrations have to be gone through to avoid breaking their > possible assumptions until at long last clarifications are made (that > is, if they're ever made). swsusp is not demotable to below their > status on a whim. > Yep, that's why I'm going to ask some swsusp developers to have a look at it. I wouldn't pretend to be able to fix every bug everywhere in the kernel myself. Thanks, Nick -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [patch][rfc] 5/5: core remove PageReserved 2005-06-23 23:21 ` Nick Piggin @ 2005-06-24 0:59 ` William Lee Irwin III 2005-06-24 1:17 ` Nick Piggin 2005-06-24 1:25 ` Nick Piggin 0 siblings, 2 replies; 19+ messages in thread From: William Lee Irwin III @ 2005-06-24 0:59 UTC (permalink / raw) To: Nick Piggin Cc: linux-kernel, Linux Memory Management, Hugh Dickins, Badari Pulavarty William Lee Irwin III wrote: >> You're probably thinking of the zap_page_range() vs. unmap_page_range() >> topmost-level entrypoints as being the only places to set the flag in >> details. Unconditionally clobbering the field in details in >> zap_pud_range() should resolve any issues associated with that. >> This is obviously a style and/or diff-minimization issue. On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote: > I'm thinking of exit_mmap and unmap_vmas, that don't pass in a > details field at all, and all the tests for details being marked > unlikely. I ended up thinking it was less ugly this way. This doesn't pose any particular difficulty, but so noted since that's what you had in mind. William Lee Irwin III wrote: >> That activity is outside memmap_init_zone(). Specifically, the page >> structures are not used for anything whatsoever before bootmem puts >> them into the buddy allocators. It is not particularly interesting >> or difficult to defer even the initialization to the precise point >> in time bootmem cares to perform buddy bitmap insertion. This goes >> for all fields of struct page. What's notable about PG_reserved and >> refcounts here is that memmap_init_zone() goes about flipping bits one >> way where free_all_bootmem_core() undoes all its work. On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote: > This patch doesn't care how it works, that would be for a later patch. The general gist of all this is that the patch doesn't cover anywhere near enough ground, and so the above illustrates that the usage in/around bootmem is among the easiest of usages to remove. I have implemented what I'm talking about on several independent occasions. William Lee Irwin III wrote: >> This is called in the interior of a loop, which may be beneficial to >> terminate if this intended semantic is to be enforced. Furthermore, no >> error is propagated to the caller, which is not the desired effect in >> the stated error reporting scheme. So the code is inconsistent with >> explicitly stated intention. On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote: > No, the error reporting scheme says it doesn't handle any error, > that is all. What we have here in terms of behaviour is exactly > what used to happen, that is - do something saneish on error. > Changing behaviour would be outside the scope of this patch, but > be my guest. Some places BUG(), some back out with an error return, others blithely proceed. This kind of inconsistency will broadly confuse callers of the API's. William Lee Irwin III wrote: >> You are going on about the fact that install_page() can't be used on >> memory outside mem_map[] as it requires a page structure, and can't be >> used on reserved pages because page_add_file_rmap() will BUG. This case >> is not being discussed. On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote: > And that it isn't allowed to touch struct page of physical pages > in a VM_RESERVED region. non sequitur William Lee Irwin III wrote: >> The issue at stake is inserting normal pages into a VM_RESERVED vma. >> These will arise as e.g. kernel-allocated buffers managed by normal >> reference counting. remap_pfn_range() can't do it; it refuses to >> operate on "valid memory". install_page() now won't do it; it refuses >> to touch a VM_RESERVED vma. So this creates a giant semantic hole, >> and potentially breaks working code (i.e. if you were going to do >> this you would need not only a replacement but also a sweep to adjust >> all the drivers doing it or prove their nonexistence). On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote: > I think you'll find that remap_pfn_range will be happy to operate > on valid memory, and that any driver trying to use install_page > on VM_RESERVED probably needs fixing anyway. install_page() of a !PageReserved() page in a VM_RESERVED vma is neither broken now nor does it merit going BUG(). /dev/mem was disallowed from mapping ordinary kernel memory for a reason (though I disagree with it), so the removal of the pfn_valid()/PageReserved() checks can't be blithely done like in your patch. Other arrangements must be made. William Lee Irwin III wrote: >> Unfortunately for this scheme, it's very much a case of putting the >> cart before the horse. PG_reserved is toggled at random in this driver >> after this change, to no useful effect (debugging or otherwise). And >> this really goes for the whole affair. Diddling the core first is just >> going to create bugs. Converting the users first is the way these >> things need to be done. When complete, nothing needs the core flags >> twiddling anymore and you just nuke the flag twiddling from the core. On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote: > I'm sorry, I don't see how 'diddling' the core will create bugs. > This is a fine way to do it, and "converting" users first (whatever > that means) [cut text included in full in the following quote block] This is going way too far. Someone please deal with this. On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote: [continued] > This is a fine way to do it, and "converting" users first (whatever > that means) is not possible because VM_RESERVED handling in core > code is not up to the task of replacing PageReserved without this > patch. You aren't replacing any of the PG_reserved usage with VM_RESERVED in any of these patches. The primary motive of all this is AFAICT little more than getting the PG_reserved check out of put_page(), drivers and arches be damned. William Lee Irwin III wrote: >> This doesn't really fly. A fair number of drivers are poorly-understood >> and numerous gyrations have to be gone through to avoid breaking their >> possible assumptions until at long last clarifications are made (that >> is, if they're ever made). swsusp is not demotable to below their >> status on a whim. On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote: > Yep, that's why I'm going to ask some swsusp developers to have > a look at it. Do you have any idea who wrote the comment you referred to? On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote: > I wouldn't pretend to be able to fix every bug everywhere in the > kernel myself. I've got bad news for you. You will need to be willing and able to read and alter the source of things as unglamorous as crappy ancient drivers and obscure dogslow architectures and to do so carefully enough things continue to work when you're changing broadly-used code. You will never have the luxury of the hardware to test anywhere near 10% of it on. Worse yet, when you make the changes, you bear the burden of sweeps. The fact they break when you make your change does not mean they were buggy. It means you broke them. Sure, best effort is all you can ever do, but this is far from anything like that. If you can't handle the sweep, you have no business writing the patch. Someone please take this over. None of this will be heeded so long as I'm the one saying it. -- wli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [patch][rfc] 5/5: core remove PageReserved 2005-06-24 0:59 ` William Lee Irwin III @ 2005-06-24 1:17 ` Nick Piggin 2005-06-24 1:47 ` Nick Piggin 2005-06-24 1:25 ` Nick Piggin 1 sibling, 1 reply; 19+ messages in thread From: Nick Piggin @ 2005-06-24 1:17 UTC (permalink / raw) To: William Lee Irwin III Cc: linux-kernel, Linux Memory Management, Hugh Dickins, Badari Pulavarty William Lee Irwin III wrote: > On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote: > >>This patch doesn't care how it works, that would be for a later patch. > > > The general gist of all this is that the patch doesn't cover anywhere > near enough ground, and so the above illustrates that the usage > in/around bootmem is among the easiest of usages to remove. I have > implemented what I'm talking about on several independent occasions. > I don't think you understand what ground the patch covers. It removes PageReserved() queries from core code, leaving all PG_reserved flags and other uses of it intact so it doesn't cause mass breakage. Things can then be looked at and fixed up properly at a slower pace while the patch is in -mm, for example. > > William Lee Irwin III wrote: > >>>This is called in the interior of a loop, which may be beneficial to >>>terminate if this intended semantic is to be enforced. Furthermore, no >>>error is propagated to the caller, which is not the desired effect in >>>the stated error reporting scheme. So the code is inconsistent with >>>explicitly stated intention. > > > On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote: > >>No, the error reporting scheme says it doesn't handle any error, >>that is all. What we have here in terms of behaviour is exactly >>what used to happen, that is - do something saneish on error. >>Changing behaviour would be outside the scope of this patch, but >>be my guest. > > > Some places BUG(), some back out with an error return, others blithely > proceed. This kind of inconsistency will broadly confuse callers of > the API's. > I don't think any places BUG(), but regardless, this patch doesn't deal with changing behaviour, just reporting of the problem. > > William Lee Irwin III wrote: > >>>You are going on about the fact that install_page() can't be used on >>>memory outside mem_map[] as it requires a page structure, and can't be >>>used on reserved pages because page_add_file_rmap() will BUG. This case >>>is not being discussed. > > > On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote: > >>And that it isn't allowed to touch struct page of physical pages >>in a VM_RESERVED region. > > > non sequitur > Huh? That is why it is broken. Previously, PageReserved pages *were not* accounted with the normal rmap and friends in core code, so you can't just do it in here and hope it works. > > William Lee Irwin III wrote: > >>>The issue at stake is inserting normal pages into a VM_RESERVED vma. >>>These will arise as e.g. kernel-allocated buffers managed by normal >>>reference counting. remap_pfn_range() can't do it; it refuses to >>>operate on "valid memory". install_page() now won't do it; it refuses >>>to touch a VM_RESERVED vma. So this creates a giant semantic hole, >>>and potentially breaks working code (i.e. if you were going to do >>>this you would need not only a replacement but also a sweep to adjust >>>all the drivers doing it or prove their nonexistence). > > > On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote: > >>I think you'll find that remap_pfn_range will be happy to operate >>on valid memory, and that any driver trying to use install_page >>on VM_RESERVED probably needs fixing anyway. > > > install_page() of a !PageReserved() page in a VM_RESERVED vma is > neither broken now nor does it merit going BUG(). > Hugh says it is broken, so you can ask him the details. > /dev/mem was disallowed from mapping ordinary kernel memory for a > reason (though I disagree with it), so the removal of the > pfn_valid()/PageReserved() checks can't be blithely done like in > your patch. Other arrangements must be made. > > > William Lee Irwin III wrote: > >>>Unfortunately for this scheme, it's very much a case of putting the >>>cart before the horse. PG_reserved is toggled at random in this driver >>>after this change, to no useful effect (debugging or otherwise). And >>>this really goes for the whole affair. Diddling the core first is just >>>going to create bugs. Converting the users first is the way these >>>things need to be done. When complete, nothing needs the core flags >>>twiddling anymore and you just nuke the flag twiddling from the core. > > > On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote: > >>I'm sorry, I don't see how 'diddling' the core will create bugs. >>This is a fine way to do it, and "converting" users first (whatever >>that means) > > [cut text included in full in the following quote block] > > This is going way too far. Someone please deal with this. > No, just tell me how it might magically create bugs in drivers that didn't exist in the first place? > > On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote: > [continued] > >>This is a fine way to do it, and "converting" users first (whatever >>that means) is not possible because VM_RESERVED handling in core >>code is not up to the task of replacing PageReserved without this >>patch. > > > You aren't replacing any of the PG_reserved usage with VM_RESERVED > in any of these patches. The primary motive of all this is AFAICT > little more than getting the PG_reserved check out of put_page(), > drivers and arches be damned. > Excuse me? drivers work, arches work. And how might you "fix" them first, without the necessary core code in place? Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [patch][rfc] 5/5: core remove PageReserved 2005-06-24 1:17 ` Nick Piggin @ 2005-06-24 1:47 ` Nick Piggin 0 siblings, 0 replies; 19+ messages in thread From: Nick Piggin @ 2005-06-24 1:47 UTC (permalink / raw) To: William Lee Irwin III Cc: linux-kernel, Linux Memory Management, Hugh Dickins, Badari Pulavarty Nick Piggin wrote: > William Lee Irwin III wrote: > >> On Fri, Jun 24, 2005 at 09:21:31AM +1000, Nick Piggin wrote: >> >>> I'm sorry, I don't see how 'diddling' the core will create bugs. >>> This is a fine way to do it, and "converting" users first (whatever >>> that means) >> >> >> [cut text included in full in the following quote block] >> >> This is going way too far. Someone please deal with this. >> > > No, just tell me how it might magically create bugs in drivers > that didn't exist in the first place? Sorry, I got a bit carried away. That remark wasn't helpful. What I mean is: if you see an aspect of this change that will cause breakage in previously correct drivers, please raise your specific concerns with me. I have tried to be careful about this, and put in bugs/warnings where problems in already broken code are magnified. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [patch][rfc] 5/5: core remove PageReserved 2005-06-24 0:59 ` William Lee Irwin III 2005-06-24 1:17 ` Nick Piggin @ 2005-06-24 1:25 ` Nick Piggin 1 sibling, 0 replies; 19+ messages in thread From: Nick Piggin @ 2005-06-24 1:25 UTC (permalink / raw) To: William Lee Irwin III Cc: linux-kernel, Linux Memory Management, Hugh Dickins, Badari Pulavarty William Lee Irwin III wrote: >>I wouldn't pretend to be able to fix every bug everywhere in the >>kernel myself. > > If you can't handle the sweep, you have no business writing the patch. > Let me just say that where I have introduced incompatibilities I will try to look through drivers and arch code. I can't look through and find everything that is already buggy though. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [patch][rfc] 5/5: core remove PageReserved 2005-06-23 9:51 ` William Lee Irwin III 2005-06-23 10:32 ` Nick Piggin @ 2005-06-24 4:50 ` Andrew Morton 2005-06-24 8:24 ` William Lee Irwin III 2005-06-26 8:41 ` Nick Piggin 1 sibling, 2 replies; 19+ messages in thread From: Andrew Morton @ 2005-06-24 4:50 UTC (permalink / raw) To: William Lee Irwin III Cc: nickpiggin, linux-kernel, linux-mm, hugh, pbadari, linux-scsi William Lee Irwin III <wli@holomorphy.com> wrote: > > On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote: > > Index: linux-2.6/drivers/scsi/sg.c > > =================================================================== > > --- linux-2.6.orig/drivers/scsi/sg.c > > +++ linux-2.6/drivers/scsi/sg.c > > @@ -1887,9 +1887,10 @@ st_unmap_user_pages(struct scatterlist * > > int i; > > > > for (i=0; i < nr_pages; i++) { > > - if (dirtied && !PageReserved(sgl[i].page)) > > + if (dirtied) > > SetPageDirty(sgl[i].page); > > /* unlock_page(sgl[i].page); */ > > + /* FIXME: XXX don't dirty/unmap VM_RESERVED regions? */ > > /* FIXME: cache flush missing for rw==READ > > * FIXME: call the correct reference counting function > > */ > > An answer should be devised for this. My numerous SCSI CD-ROM devices > (I have 5 across several different machines of several different arches) > are rather unlikely to be happy with /* FIXME: XXX ... as an answer. > > > On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote: > > Index: linux-2.6/drivers/scsi/st.c > > =================================================================== > > --- linux-2.6.orig/drivers/scsi/st.c > > +++ linux-2.6/drivers/scsi/st.c > > @@ -4435,8 +4435,9 @@ static int sgl_unmap_user_pages(struct s > > int i; > > > > for (i=0; i < nr_pages; i++) { > > - if (dirtied && !PageReserved(sgl[i].page)) > > + if (dirtied) > > SetPageDirty(sgl[i].page); > > + /* FIXME: XXX don't dirty/unmap VM_RESERVED regions? */ > > /* FIXME: cache flush missing for rw==READ > > * FIXME: call the correct reference counting function > > */ > > Mutatis mutandis for my SCSI tape drive. This scsi code is already rather wrong. There isn't much point in just setting PG_dirty and leaving the page marked as clean in the radix tree. As it is we'll lose data if the user reads it into a MAP_SHARED memory buffer. set_page_dirty_lock() should be used here. That can sleep. <looks> The above two functions are called under write_lock_irqsave() (at least) and might be called from irq context (dunno). So we cannot use set_page_dirty_lock() and we don't have a ref on the page's inode. We could use set_page_dirty() and be racy against page reclaim. But to get all this correct (and it's very incorrect now) we'd need to punt the page dirtying up to process context, along the lines of bio_check_pages_dirty(). Or, if st_unmap_user_pages() and sgl_unmap_user_pages() are not called from irq context then we should arrange for them to be called without locks held and use set_page_dirty_lock(). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [patch][rfc] 5/5: core remove PageReserved 2005-06-24 4:50 ` Andrew Morton @ 2005-06-24 8:24 ` William Lee Irwin III 2005-06-26 8:41 ` Nick Piggin 1 sibling, 0 replies; 19+ messages in thread From: William Lee Irwin III @ 2005-06-24 8:24 UTC (permalink / raw) To: Andrew Morton Cc: nickpiggin, linux-kernel, linux-mm, hugh, pbadari, linux-scsi William Lee Irwin III <wli@holomorphy.com> wrote: >> An answer should be devised for this. My numerous SCSI CD-ROM devices >> (I have 5 across several different machines of several different arches) >> are rather unlikely to be happy with /* FIXME: XXX ... as an answer. [...] >> Mutatis mutandis for my SCSI tape drive. On Thu, Jun 23, 2005 at 09:50:11PM -0700, Andrew Morton wrote: > This scsi code is already rather wrong. There isn't much point in just > setting PG_dirty and leaving the page marked as clean in the radix tree. > As it is we'll lose data if the user reads it into a MAP_SHARED memory > buffer. > set_page_dirty_lock() should be used here. That can sleep. > The above two functions are called under write_lock_irqsave() (at least) > and might be called from irq context (dunno). So we cannot use > set_page_dirty_lock() and we don't have a ref on the page's inode. We > could use set_page_dirty() and be racy against page reclaim. > But to get all this correct (and it's very incorrect now) we'd need to punt > the page dirtying up to process context, along the lines of > bio_check_pages_dirty(). > Or, if st_unmap_user_pages() and sgl_unmap_user_pages() are not called from > irq context then we should arrange for them to be called without locks held > and use set_page_dirty_lock(). This all sounds very reasonable. I was originally more concerned about the new FIXME getting introduced but this sounds like a good way to resolve the preexisting FIXME's surrounding all this. -- wli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [patch][rfc] 5/5: core remove PageReserved 2005-06-24 4:50 ` Andrew Morton 2005-06-24 8:24 ` William Lee Irwin III @ 2005-06-26 8:41 ` Nick Piggin 1 sibling, 0 replies; 19+ messages in thread From: Nick Piggin @ 2005-06-26 8:41 UTC (permalink / raw) To: Andrew Morton Cc: William Lee Irwin III, linux-kernel, linux-mm, hugh, pbadari, linux-scsi Andrew Morton wrote: > William Lee Irwin III <wli@holomorphy.com> wrote: >> Mutatis mutandis for my SCSI tape drive. > > OK, for the VM_RESERVED case, it looks like it won't be much of a problem because get_user_pages faults on VM_IO regions (which is already set in remap_pfn_range which is used by mem.c and most drivers). So this code will simply not encounter VM_RESERVED regions - well obviously, get_user_pages should be made to explicitly check for VM_RESERVED too, but the point being that introducing such a check will not overly restrict drivers. [snip SetPageDirty is wrong] Not that this helps the existing bug... -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [patch][rfc] 2/5: micro optimisation for mm/rmap.c 2005-06-23 7:06 ` [patch][rfc] 2/5: micro optimisation " Nick Piggin 2005-06-23 7:07 ` [patch][rfc] 3/5: remove atomic bitop when freeing page Nick Piggin @ 2005-06-23 7:26 ` William Lee Irwin III 2005-06-23 7:33 ` Nick Piggin 1 sibling, 1 reply; 19+ messages in thread From: William Lee Irwin III @ 2005-06-23 7:26 UTC (permalink / raw) To: Nick Piggin Cc: linux-kernel, Linux Memory Management, Hugh Dickins, Badari Pulavarty On Thu, Jun 23, 2005 at 05:06:35PM +1000, Nick Piggin wrote: > + index = (address - vma->vm_start) >> PAGE_SHIFT; > + index += vma->vm_pgoff; > + index >>= PAGE_CACHE_SHIFT - PAGE_SHIFT; > + page->index = index; linear_page_index() -- wli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [patch][rfc] 2/5: micro optimisation for mm/rmap.c 2005-06-23 7:26 ` [patch][rfc] 2/5: micro optimisation for mm/rmap.c William Lee Irwin III @ 2005-06-23 7:33 ` Nick Piggin 0 siblings, 0 replies; 19+ messages in thread From: Nick Piggin @ 2005-06-23 7:33 UTC (permalink / raw) To: William Lee Irwin III Cc: linux-kernel, Linux Memory Management, Hugh Dickins, Badari Pulavarty [-- Attachment #1: Type: text/plain, Size: 347 bytes --] William Lee Irwin III wrote: > On Thu, Jun 23, 2005 at 05:06:35PM +1000, Nick Piggin wrote: > >>+ index = (address - vma->vm_start) >> PAGE_SHIFT; >>+ index += vma->vm_pgoff; >>+ index >>= PAGE_CACHE_SHIFT - PAGE_SHIFT; >>+ page->index = index; > > > linear_page_index() > Ah indeed it is, thanks. I'll queue this up as patch 2.5, then? [-- Attachment #2: mm-cleanup-rmap.patch --] [-- Type: text/plain, Size: 751 bytes --] Use linear_page_index in mm/rmap.c, as noted by Bill Irwin. Index: linux-2.6/mm/rmap.c =================================================================== --- linux-2.6.orig/mm/rmap.c +++ linux-2.6/mm/rmap.c @@ -448,16 +448,11 @@ void page_add_anon_rmap(struct page *pag if (atomic_inc_and_test(&page->_mapcount)) { struct anon_vma *anon_vma = vma->anon_vma; - pgoff_t index; - BUG_ON(!anon_vma); anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; page->mapping = (struct address_space *) anon_vma; - index = (address - vma->vm_start) >> PAGE_SHIFT; - index += vma->vm_pgoff; - index >>= PAGE_CACHE_SHIFT - PAGE_SHIFT; - page->index = index; + page->index = linear_page_index(vma, address); inc_page_state(nr_mapped); } ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2005-06-26 8:41 UTC | newest] Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2005-06-23 7:05 [patch][rfc] 0/5: remove PageReserved Nick Piggin 2005-06-23 7:06 ` [patch][rfc] 1/5: comment for mm/rmap.c Nick Piggin 2005-06-23 7:06 ` [patch][rfc] 2/5: micro optimisation " Nick Piggin 2005-06-23 7:07 ` [patch][rfc] 3/5: remove atomic bitop when freeing page Nick Piggin 2005-06-23 7:07 ` [patch][rfc] 4/5: remap ZERO_PAGE mappings Nick Piggin 2005-06-23 7:08 ` [patch][rfc] 5/5: core remove PageReserved Nick Piggin 2005-06-23 9:51 ` William Lee Irwin III 2005-06-23 10:32 ` Nick Piggin 2005-06-23 22:08 ` William Lee Irwin III 2005-06-23 23:21 ` Nick Piggin 2005-06-24 0:59 ` William Lee Irwin III 2005-06-24 1:17 ` Nick Piggin 2005-06-24 1:47 ` Nick Piggin 2005-06-24 1:25 ` Nick Piggin 2005-06-24 4:50 ` Andrew Morton 2005-06-24 8:24 ` William Lee Irwin III 2005-06-26 8:41 ` Nick Piggin 2005-06-23 7:26 ` [patch][rfc] 2/5: micro optimisation for mm/rmap.c William Lee Irwin III 2005-06-23 7:33 ` Nick Piggin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox