From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Sat, 9 Nov 2002 18:52:42 +0100 From: Ingo Oeser Subject: [RFC PATCH] Re: get_user_pages rewrite (completed, updated for 2.4.46) Message-ID: <20021109185242.T659@nightmaster.csn.tu-chemnitz.de> References: <20021107110840.P659@nightmaster.csn.tu-chemnitz.de> <3DCC3E38.29B0ABEF@digeo.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3DCC3E38.29B0ABEF@digeo.com>; from akpm@digeo.com on Fri, Nov 08, 2002 at 02:44:08PM -0800 Sender: owner-linux-mm@kvack.org Return-Path: To: Andrew Morton Cc: linux-mm@kvack.org List-ID: Hi Andrew, thanks for your review. On Fri, Nov 08, 2002 at 02:44:08PM -0800, Andrew Morton wrote: [custom_page_walker_t locking rules for vma->mm->page_table_lock] > This locking is rather awkward. Why is it necessary, and can it > be simplified?? No unlocking is needed in the fast and common cases. That shall reduce bus traffic. That locking is also needed for follow_page and will not be dropped, if the page is already faulted into the process space (should be the common case for get_user_pages). Under normal operation walk_user_pages is a loop of follow_page(), which need needs that lock. e.g the while statement in single_page_walk() will not go to the loop. The original implementation did NO proper cleanup, if the call spanned multiple VMAs. That's why I introduced the case IS_ERR(vma), where the vma->mm->page_table_lock cannot be unlocked, but cleanup can happen in case of wrong VMA and the walker having collected some pages already. We have two possibilities to simplify locking: 1) Explicit argument, whether the page_table_lock is taken. - Would simplify usage, but I know that this kind of functions where eliminated during the past, because Linus and some other people don't like that kind of magic. - Would remove the need to do that for huge tlb pages. - We must check for that flag and restore state at exit and the error path. Handling the error path is already complicated, but very visible (the IS_ERR is a good indicator even to the inexperienced reader). 2) Always be unlock before we enter the custom page walker. - Would cause lock/follow_page/unlock/lock/page_cache_get/unlock for EVERY page in the normal get_user_pages() case. 3) Always unlock if IS_ERR(page) would trigger. (Actually the IS_ERR(page) is triggered also, if IS_ERR(vma) is true). - Removes the unlocking completely from the custom page walker, if it doesn't need to do that anyway. - Is no real simplification, since the walker can be entered with locking or without, as it is now. - We still require locking for huge tlb pages, but Mr. Irwin already acked the changes for that. 4) Introduce an explicit "cleaning" function passed additionally to the walk_user_pages() function. - This would seperate the error handling completly from the normal case. - It would be possible to omit the error handling, if not needed. (Or to forget it, if needed later ;-/ ) - The page walker will ALWAYS be entered with the page_table_lock taken. - The cleanup handler will ALWAYS be entered without it and only the custom_data passed along. - Function enter/exit overhead is compiled twice, because we have two functions. - And we still require locking for huge tlb pages. Which one do you like most? I would favor 3. I've appended a patch for that against page-walk-api-2.5.46-mm1-all.patch.bz2 for you to test it. I agree that the locking rules are awkward, but they are the best solution I could come up with while preserving speed and functionality. Any better rules will be implemented at your request. > wrt the removal of the vmas arg to get_user_pages(): I assume this > was because none of the multipage callers were using it? Yes, thats true. If some caller needs this, it can use a custom walker. Single patch against 2.5.46-mm1 is at http://www.tu-chemnitz.de/~ioe/patches-page_walk/page-walk-api-2.5.46-mm1-all.patch.bz2 All patches with description and diffstat of the whole thing at: http://www.tu-chemnitz.de/~ioe/patches-page_walk/index.html Thanks again for your review, I really appriciate your input here. Regards Ingo Oeser diff -u linux-2.5.46-mm1-ioe/include/linux/mm.h linux-2.5.46-mm1-ioe/include/linux/mm.h --- linux-2.5.46-mm1-ioe/include/linux/mm.h Fri Nov 8 12:55:49 2002 +++ linux-2.5.46-mm1-ioe/include/linux/mm.h Sat Nov 9 18:02:56 2002 @@ -396,14 +396,15 @@ * If this functions gets a page, for which %IS_ERR(@page) is true, than it * should do it's cleanup of customdata and return -PTR_ERR(@page). * - * This function is called with @vma->vm_mm->page_table_lock held, - * if IS_ERR(@vma) is not true. + * If IS_ERR(@page) is NOT TRUE, this function is called with + * @vma->vm_mm->page_table_lock held. * - * But if IS_ERR(@vma) is true, IS_ERR(@page) is also true, since if we have no - * vma, then we also have no user space page. + * The value of @vma is undefined if IS_ERR(@page) is TRUE. + * (So never use or check it if IS_ERR(@page) is TRUE) * - * If it returns a negative value, then the page_table_lock must be dropped - * by this function, if it is held. + * If it returns a negative value but got a valid page, then the + * page_table_lock must be dropped by this function. (This condition should be + * rather rare.) */ typedef int (*custom_page_walker_t)(struct vm_area_struct *vma, struct page *page, unsigned long virt_addr, void *customdata); diff -u linux-2.5.46-mm1-ioe/mm/memory.c linux-2.5.46-mm1-ioe/mm/memory.c --- linux-2.5.46-mm1-ioe/mm/memory.c Fri Nov 8 12:55:49 2002 +++ linux-2.5.46-mm1-ioe/mm/memory.c Sat Nov 9 18:15:06 2002 @@ -1158,8 +1158,6 @@ struct gup_add_pages *gup = customdata; - BUG_ON(!customdata); - if (!IS_ERR(page)) { gup->pages[gup->count++] = page; flush_dcache_page(page); @@ -1170,8 +1168,6 @@ return (gup->count == gup->max_pages) ? 1 : 0; } - if (!IS_ERR(vma)) - spin_unlock(&vma->vm_mm->page_table_lock); gup_pages_cleanup(gup); return -PTR_ERR(page); } @@ -1192,7 +1188,6 @@ spin_unlock(&mm->page_table_lock); fault = handle_mm_fault(mm, vma, start, write); - spin_lock(&mm->page_table_lock); switch (fault) { case VM_FAULT_MINOR: @@ -1210,8 +1205,13 @@ spin_unlock(&mm->page_table_lock); BUG(); } + spin_lock(&mm->page_table_lock); } - return get_page_map(map); + map=get_page_map(map); + if (IS_ERR(map)) + spin_unlock(&mm->page_table_lock); + + return map; } /* VMA contains already "start". @@ -1248,10 +1248,14 @@ spin_lock(&mm->page_table_lock); page = single_page_walk(tsk, mm, vma, start, write); - if (!(IS_ERR(page) || PageReserved(page))) + if (IS_ERR(page)) + goto out; + + if (!PageReserved(page)) page_cache_get(page); spin_unlock(&mm->page_table_lock); +out: return page; } @@ -2101,8 +2105,6 @@ return (*todo) ? 0 : 1; } - if (!IS_ERR(vma)) - spin_unlock(&vma->vm_mm->page_table_lock); return -PTR_ERR(page); } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/