From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Mon, 23 Jun 2008 21:58:42 +0100 (BST) From: Hugh Dickins Subject: Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2? In-Reply-To: <20080623175203.GI10123@sgi.com> Message-ID: References: <20080618164158.GC10062@sgi.com> <200806190329.30622.nickpiggin@yahoo.com.au> <200806191307.04499.nickpiggin@yahoo.com.au> <20080619133809.GC10123@sgi.com> <20080623155400.GH10123@sgi.com> <20080623175203.GI10123@sgi.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org Return-Path: To: Robin Holt Cc: Nick Piggin , Ingo Molnar , Christoph Lameter , Jack Steiner , Rik van Riel , Lee Schermerhorn , linux-mm@kvack.org List-ID: On Mon, 23 Jun 2008, Robin Holt wrote: > On Mon, Jun 23, 2008 at 05:48:17PM +0100, Hugh Dickins wrote: > > > reuse test in do_wp_page that I'm still working on - of which Nick sent > > a lock_page approximation for you to try? Would you still be able to > > try mine when I'm ready, or does it now appear irrelevant to you? > > Before your response, I had convinced myself my problem was specific to > XPMEM, but I see your point and may agree that it is a problem for all > get_user_pages() users. > > I can certainly test when you have it ready. Thanks a lot, Robin. Here it is below. > > I had confused myself about Nick's first patch. I will give that > another look over and see if it fixes the problem. Nick's _first_ patch? The one I was thinking of was the one with lock_page in do_wp_page, but it shouldn't be necessary now if what's below works - though his is a much smaller and less worrying patch, so anyone looking for a quick fix to the issue might well prefer his. > > > http://lkml.org/lkml/2006/9/14/384 > > > > but it's a broken thread, with misunderstanding on all sides, > > so rather hard to get a grasp of it. > > That is extremely similar to the issue I am seeing. I think that if > Infiniband were using the mmu_notifier stuff, they would be closer, but > IIRC, there are significant hardware restrictions which prevent demand > paging for working on some IB devices. Ah, I'm glad you've managed to glean something from it, good. Here's the rollup of the patches I'm proposing for two issues: it doesn't get my signoff yet because I'll want to split it into little stages properly commented, and I'll want to do more strenuous testing; but this shouldn't blow up in your face. Against 2.6.26-rc7, should apply quite easily to earlier, but a little more work against 2.6.26-rc5-mm3 - though it simplifies some of that too (Rik+Lee Cced). I say two issues, two competing issues. One is the issue which may be your issue, that we want to decide COW in do_wp_page without needing PageLocked, so a concurrent shrink_page_list() doesn't occasionally force an unwanted COW, messing up some get_user_pages() uses. The other, of no interest to you, is that we do want PageLocked when deciding COW in do_wp_page, because that's a good moment to free up the swap space - leaving the modified page with swap just makes a big seek likely when it's next written to swap. Hugh include/linux/swap.h | 21 ++++-- mm/memory.c | 15 +++- mm/migrate.c | 5 - mm/page_io.c | 2 mm/swap_state.c | 12 ++- mm/swapfile.c | 128 ++++++++++++++++++++++------------------- 6 files changed, 105 insertions(+), 78 deletions(-) --- 2.6.26-rc7/include/linux/swap.h 2008-05-03 21:55:11.000000000 +0100 +++ linux/include/linux/swap.h 2008-06-23 18:12:47.000000000 +0100 @@ -250,8 +250,9 @@ extern unsigned int count_swap_pages(int extern sector_t map_swap_page(struct swap_info_struct *, pgoff_t); extern sector_t swapdev_block(int, pgoff_t); extern struct swap_info_struct *get_swap_info_struct(unsigned); -extern int can_share_swap_page(struct page *); -extern int remove_exclusive_swap_page(struct page *); +extern int can_reuse_swap_page_unlocked(struct page *); +extern int can_reuse_swap_page_locked(struct page *); +extern int try_to_free_swap(struct page *); struct backing_dev_info; extern spinlock_t swap_lock; @@ -319,8 +320,6 @@ static inline struct page *lookup_swap_c return NULL; } -#define can_share_swap_page(p) (page_mapcount(p) == 1) - static inline int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask) { @@ -337,9 +336,19 @@ static inline void delete_from_swap_cach #define swap_token_default_timeout 0 -static inline int remove_exclusive_swap_page(struct page *p) +static inline int can_reuse_swap_page_unlocked(struct page *page) { - return 0; + return 0; /* irrelevant: never called */ +} + +static inline int can_reuse_swap_page_locked(struct page *page) +{ + return 0; /* irrelevant: never called */ +} + +static inline int try_to_free_swap(struct page *page) +{ + return 0; /* irrelevant: never called */ } static inline swp_entry_t get_swap_page(void) --- 2.6.26-rc7/mm/memory.c 2008-06-21 08:41:19.000000000 +0100 +++ linux/mm/memory.c 2008-06-23 18:12:47.000000000 +0100 @@ -1686,9 +1686,14 @@ static int do_wp_page(struct mm_struct * * not dirty accountable. */ if (PageAnon(old_page)) { - if (!TestSetPageLocked(old_page)) { - reuse = can_share_swap_page(old_page); - unlock_page(old_page); + if (page_mapcount(old_page) == 1) { + if (!PageSwapCache(old_page)) + reuse = 1; + else if (!TestSetPageLocked(old_page)) { + reuse = can_reuse_swap_page_locked(old_page); + unlock_page(old_page); + } else + reuse = can_reuse_swap_page_unlocked(old_page); } } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED))) { @@ -2185,7 +2190,7 @@ static int do_swap_page(struct mm_struct inc_mm_counter(mm, anon_rss); pte = mk_pte(page, vma->vm_page_prot); - if (write_access && can_share_swap_page(page)) { + if (write_access && can_reuse_swap_page_locked(page)) { pte = maybe_mkwrite(pte_mkdirty(pte), vma); write_access = 0; } @@ -2196,7 +2201,7 @@ static int do_swap_page(struct mm_struct swap_free(entry); if (vm_swap_full()) - remove_exclusive_swap_page(page); + try_to_free_swap(page); unlock_page(page); if (write_access) { --- 2.6.26-rc7/mm/migrate.c 2008-06-21 08:41:19.000000000 +0100 +++ linux/mm/migrate.c 2008-06-23 18:12:47.000000000 +0100 @@ -330,8 +330,10 @@ static int migrate_page_move_mapping(str get_page(newpage); /* add cache reference */ #ifdef CONFIG_SWAP if (PageSwapCache(page)) { - SetPageSwapCache(newpage); set_page_private(newpage, page_private(page)); + /* page_swapcount() relies on private whenever PageSwapCache */ + smp_wmb(); + SetPageSwapCache(newpage); } #endif @@ -398,7 +400,6 @@ static void migrate_page_copy(struct pag #endif ClearPageActive(page); ClearPagePrivate(page); - set_page_private(page, 0); page->mapping = NULL; /* --- 2.6.26-rc7/mm/page_io.c 2008-04-17 03:49:44.000000000 +0100 +++ linux/mm/page_io.c 2008-06-23 18:12:47.000000000 +0100 @@ -98,7 +98,7 @@ int swap_writepage(struct page *page, st struct bio *bio; int ret = 0, rw = WRITE; - if (remove_exclusive_swap_page(page)) { + if (try_to_free_swap(page)) { unlock_page(page); goto out; } --- 2.6.26-rc7/mm/swap_state.c 2008-05-03 21:55:12.000000000 +0100 +++ linux/mm/swap_state.c 2008-06-23 18:12:47.000000000 +0100 @@ -76,13 +76,15 @@ int add_to_swap_cache(struct page *page, BUG_ON(PagePrivate(page)); error = radix_tree_preload(gfp_mask); if (!error) { + set_page_private(page, entry.val); + /* page_swapcount() relies on private whenever PageSwapCache */ + smp_wmb(); write_lock_irq(&swapper_space.tree_lock); error = radix_tree_insert(&swapper_space.page_tree, entry.val, page); if (!error) { page_cache_get(page); SetPageSwapCache(page); - set_page_private(page, entry.val); total_swapcache_pages++; __inc_zone_page_state(page, NR_FILE_PAGES); INC_CACHE_INFO(add_total); @@ -105,7 +107,6 @@ void __delete_from_swap_cache(struct pag BUG_ON(PagePrivate(page)); radix_tree_delete(&swapper_space.page_tree, page_private(page)); - set_page_private(page, 0); ClearPageSwapCache(page); total_swapcache_pages--; __dec_zone_page_state(page, NR_FILE_PAGES); @@ -188,13 +189,14 @@ void delete_from_swap_cache(struct page * * Its ok to check for PageSwapCache without the page lock * here because we are going to recheck again inside - * exclusive_swap_page() _with_ the lock. + * try_to_free_swap() _with_ the lock. * - Marcelo */ static inline void free_swap_cache(struct page *page) { - if (PageSwapCache(page) && !TestSetPageLocked(page)) { - remove_exclusive_swap_page(page); + if (PageSwapCache(page) && !page_mapped(page) && + !TestSetPageLocked(page)) { + try_to_free_swap(page); unlock_page(page); } } --- 2.6.26-rc7/mm/swapfile.c 2008-05-03 21:55:12.000000000 +0100 +++ linux/mm/swapfile.c 2008-06-23 18:12:47.000000000 +0100 @@ -251,7 +251,6 @@ static struct swap_info_struct * swap_in goto bad_offset; if (!p->swap_map[offset]) goto bad_free; - spin_lock(&swap_lock); return p; bad_free: @@ -300,90 +299,104 @@ void swap_free(swp_entry_t entry) p = swap_info_get(entry); if (p) { + spin_lock(&swap_lock); swap_entry_free(p, swp_offset(entry)); spin_unlock(&swap_lock); } } /* - * How many references to page are currently swapped out? + * How many page table references are there to this page's swap entry? + * including references to the page itself if add_mapcount. + * + * page_swapcount is only called on a page which was recently PageSwapCache + * (but might have been deleted from swap cache just before getting here). + * + * When called with PageLocked, the swapcount is stable, and the mapcount + * cannot rise (but may fall if the page is concurrently unmapped from a + * page table whose lock we don't hold). + * + * When called without PageLocked, the swapcount is stable while we hold + * swap_lock, and the mapcount cannot rise from 1 while we hold that page + * table lock (but may fall if the page is concurrently unmapped from a + * page table whose lock we don't hold). + * + * do_swap_page and unuse_pte call page_add_anon_rmap before swap_free, + * try_to_unmap_one calls swap_duplicate before page_remove_rmap: + * so in general, swapcount+mapcount should never be seen too low - + * but you need to consider more memory barriers if extending its use. */ -static inline int page_swapcount(struct page *page) +static int page_swapcount(struct page *page, int add_mapcount) { int count = 0; struct swap_info_struct *p; swp_entry_t entry; - entry.val = page_private(page); - p = swap_info_get(entry); - if (p) { - /* Subtract the 1 for the swap cache itself */ - count = p->swap_map[swp_offset(entry)] - 1; - spin_unlock(&swap_lock); + spin_lock(&swap_lock); + if (add_mapcount) + count = page_mapcount(page); + if (PageSwapCache(page)) { + /* We can rely on page_private once PageSwapCache is visible */ + smp_rmb(); + entry.val = page_private(page); + p = swap_info_get(entry); + if (p) { + /* Subtract the 1 for the swap cache itself */ + count += p->swap_map[swp_offset(entry)] - 1; + } } + spin_unlock(&swap_lock); return count; } /* - * We can use this swap cache entry directly - * if there are no other references to it. + * Can do_wp_page() make the faulting swapcache page writable without COW? + * But something else, probably shrink_page_list(), already has PageLocked. + * Don't be misled to COW the page unnecessarily: check swapcount+mapcount. */ -int can_share_swap_page(struct page *page) +int can_reuse_swap_page_unlocked(struct page *page) { - int count; - - BUG_ON(!PageLocked(page)); - count = page_mapcount(page); - if (count <= 1 && PageSwapCache(page)) - count += page_swapcount(page); - return count == 1; + return page_swapcount(page, 1) == 1; } /* - * Work out if there are any other processes sharing this - * swap cache page. Free it if you can. Return success. + * Can do_wp_page() make the faulting swapcache page writable without COW? + * having acquiring PageLocked. In this case, since we have that lock and + * are about to modify the page, we'd better free its swap space - it won't + * be read again, and writing there later would probably require an extra seek. */ -int remove_exclusive_swap_page(struct page *page) +int can_reuse_swap_page_locked(struct page *page) { - int retval; - struct swap_info_struct * p; - swp_entry_t entry; + VM_BUG_ON(!PageLocked(page)); + if (page_swapcount(page, 1) != 1) + return 0; + if (!PageSwapCache(page)) + return 1; + if (PageWriteback(page)) + return 1; - BUG_ON(PagePrivate(page)); - BUG_ON(!PageLocked(page)); + delete_from_swap_cache(page); + SetPageDirty(page); + return 1; +} +/* + * If swap is getting full, or if there are no more mappings of this page, + * then try_to_free_swap is called to free its swap space. + */ +int try_to_free_swap(struct page *page) +{ + VM_BUG_ON(!PageLocked(page)); if (!PageSwapCache(page)) return 0; if (PageWriteback(page)) return 0; - if (page_count(page) != 2) /* 2: us + cache */ + if (page_swapcount(page, 0)) /* Here we don't care about mapcount */ return 0; - entry.val = page_private(page); - p = swap_info_get(entry); - if (!p) - return 0; - - /* Is the only swap cache user the cache itself? */ - retval = 0; - if (p->swap_map[swp_offset(entry)] == 1) { - /* Recheck the page count with the swapcache lock held.. */ - write_lock_irq(&swapper_space.tree_lock); - if ((page_count(page) == 2) && !PageWriteback(page)) { - __delete_from_swap_cache(page); - SetPageDirty(page); - retval = 1; - } - write_unlock_irq(&swapper_space.tree_lock); - } - spin_unlock(&swap_lock); - - if (retval) { - swap_free(entry); - page_cache_release(page); - } - - return retval; + delete_from_swap_cache(page); + SetPageDirty(page); + return 1; } /* @@ -400,6 +413,7 @@ void free_swap_and_cache(swp_entry_t ent p = swap_info_get(entry); if (p) { + spin_lock(&swap_lock); if (swap_entry_free(p, swp_offset(entry)) == 1) { page = find_get_page(&swapper_space, entry.val); if (page && unlikely(TestSetPageLocked(page))) { @@ -410,14 +424,10 @@ void free_swap_and_cache(swp_entry_t ent spin_unlock(&swap_lock); } if (page) { - int one_user; - - BUG_ON(PagePrivate(page)); - one_user = (page_count(page) == 2); - /* Only cache user (+us), or swap space full? Free it! */ + /* Not mapped elsewhere, or swap space full? Free it! */ /* Also recheck PageSwapCache after page is locked (above) */ if (PageSwapCache(page) && !PageWriteback(page) && - (one_user || vm_swap_full())) { + (!page_mapped(page) || vm_swap_full())) { delete_from_swap_cache(page); SetPageDirty(page); } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org