On Wednesday 10 October 2007 17:50, Ken Chen wrote: > On 10/9/07, Ken Chen wrote: > > That's what I figures. In that case, why don't we get rid of all spin > > lock in the fast path of follow_hugetlb_pages. > > > > follow_hugetlb_page is called from get_user_pages, which should > > already hold mm->mmap_sem in read mode. That means page table tear > > down can not happen. We do a racy read on page table chain. If a > > race happened with another thread, no big deal, it will just fall into > > hugetlb_fault() which will then serialize with > > hugetlb_instantiation_mutex or mm->page_table_lock. And that's slow > > path anyway. > > never mind. ftruncate can come through in another path removes > mapping without holding mm->mmap_sem. So much for the crazy idea. Yeah, that's a killer... Here is another crazy idea I've been mulling around. I was on the brink of forgetting the whole thing until Suresh just now showed how much performance there is to be had. I don't suppose the mmap_sem avoidance from this patch matters so much if your database isn't using threads. But at least it should be faster (unless my crazy idea has some huge hole, and provided hugepages are implemented). Basic idea is that architectures can override get_user_pages. Or at least, a fast if not complete version and subsequently fall back to regular get_user_pages if it encounters something difficult (eg. a swapped out page). I *think* we can do this for x86-64 without taking mmap_sem, or _any_ page table locks at all. Obviously the CPUs themselves do a very similar lockless lookup for TLB fill. [ We actually might even be able to go one better if we could have virt->phys instructions in the CPU that would lookup and even fill the TLB for us. I don't know what the chances of that happening are, Suresh ;) ] Attached is the really basic sketch of how it will work. Any party poopers care tell me why I'm an idiot? :)