From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Thu, 26 Oct 2006 14:13:36 +1000 From: 'David Gibson' Subject: Re: [PATCH 3/3] hugetlb: fix absurd HugePages_Rsvd Message-ID: <20061026041336.GD6046@localhost.localdomain> References: <20061025100929.GA11040@localhost.localdomain> <000001c6f8b3$1d4bd020$a389030a@amr.corp.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <000001c6f8b3$1d4bd020$a389030a@amr.corp.intel.com> Sender: owner-linux-mm@kvack.org Return-Path: To: "Chen, Kenneth W" Cc: Hugh Dickins , Andrew Morton , Bill Irwin , Adam Litke , linux-mm@kvack.org List-ID: On Wed, Oct 25, 2006 at 08:59:18PM -0700, Chen, Kenneth W wrote: > David Gibson wrote on Wednesday, October 25, 2006 3:09 AM > > > And almost(?) all the backtracking could be taken out if i_mutex > > > were held; hugetlbfs_file_mmap is already taking i_mutex within > > > mmap_sem (contrary to usual mm lock ordering, but probably okay > > > since hugetlbfs has no read/write, though lockdep may need teaching). > > > Though serializing these faults at all is regrettable. > > > > Um, yes. Especially when I was in the middle of attempting to > > de-serialize it. Christoph Lameter has userspace stuff to do hugepage > > initialization (clearing mostly), in parallal, which obviously won't > > work with the serialization. I have a tentative patch to address it, > > I used to argue dearly on how important it is to allow parallel hugetlb > faults for scalability, but somehow lost my ground in the midst of flurry > development. Glad to see it is coming back. Ok. Fwiw, my experimental patch for this, which I wrote in response to Christoph Lameter's concerns is below. It will need revision in the face of whatever fix Hugh comes up with for the i_size problems. > > which replaces the hugetlb_instantiation_mutex with a table of > > mutexes, hashed on address_space and offset (or struct mm and address > > for MAP_PRIVATE). Originally I tried to simply remove the mutex, and > > just retry faults when we got an OOM but a race was detected. After > > several variants each on 2 or 3 basic approaches, each of which turned > > out to be less race-free than I originally thought, I gave up and went > > with the hashed mutexes. Either way though, there will still be > > i_size issues to sort out. > > We are trying to do too much in the fault path. One wild idea would be > to zero out page in the free_huge_page(). E.g. all "free" pages sitting > in the pool are ready to be handed out. In the free_huge_page() path, we > have a lot more freedom and can scale better because there are no owner yet, > nor tricky race to worry about. It will make the fault path faster. Hrm.. that's an interesting idea. One of the first things I attempted with this deserialization thing was just to shrink the serialized area, moving the hugepage clear out of the lock. This turned out to be impossible (or at least, much harder than I initially thought) due to lock-ordering and various other constraints. Doing the clear in free_huge_age() addresses that quite neatly. On the other hand, it won't help at all for COW faults - we still have to copy a huge page. And in addition, we will already have spent the time to clear a page which we're just going to overwrite with a copy. hugepage: Allow parallelization of the hugepage fault path At present, the page fault path for hugepages is serialized by a single mutex. This is used to avoid spurious out-of-memory conditions when the hugepage pool is fully utilized (two processes or threads can race to instantiate the same mapping with the last hugepage from the pool, the race loser returning VM_FAULT_OOM). This problem is specific to hugepages, because it is normal to want to use every single hugepage in the system - with normal pages we simply assume there will always be a few spare pages which can be used temporarily until the race is resolved. Unfortunately this serialization also means that clearing of hugepages cannot be parallelized across multiple CPUs, which can lead to very long process startup times when using large numbers of hugepages. This patch improves the situation by replacing the single mutex with a table of mutexes, selected based on a hash of the address_space and file offset being faulted (or mm and virtual address for MAP_PRIVATE mappings). Index: working-2.6/mm/hugetlb.c =================================================================== --- working-2.6.orig/mm/hugetlb.c 2006-10-23 13:50:31.000000000 +1000 +++ working-2.6/mm/hugetlb.c 2006-10-23 13:58:54.000000000 +1000 @@ -32,6 +32,13 @@ static unsigned int free_huge_pages_node */ static DEFINE_SPINLOCK(hugetlb_lock); +/* + * Serializes faults on the same logical page. This is used to + * prevent spurious OOMs when the hugepage pool is fully utilized. + */ +static int num_fault_mutexes; +static struct mutex *htlb_fault_mutex_table; + static void clear_huge_page(struct page *page, unsigned long addr) { int i; @@ -160,6 +167,13 @@ static int __init hugetlb_init(void) } max_huge_pages = free_huge_pages = nr_huge_pages = i; printk("Total HugeTLB memory allocated, %ld\n", free_huge_pages); + + num_fault_mutexes = 2 * num_possible_cpus() - 1; + htlb_fault_mutex_table = + kmalloc(sizeof(struct mutex) * num_fault_mutexes, GFP_KERNEL); + for (i = 0; i < num_fault_mutexes; i++) + mutex_init(&htlb_fault_mutex_table[i]); + return 0; } module_init(hugetlb_init); @@ -458,19 +472,14 @@ static int hugetlb_cow(struct mm_struct } int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, pte_t *ptep, int write_access) + struct address_space *mapping, unsigned long idx, + unsigned long address, pte_t *ptep, int write_access) { int ret = VM_FAULT_SIGBUS; - unsigned long idx; unsigned long size; struct page *page; - struct address_space *mapping; pte_t new_pte; - mapping = vma->vm_file->f_mapping; - idx = ((address - vma->vm_start) >> HPAGE_SHIFT) - + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT)); - /* * Use page lock to guard against racing truncation * before we get page_table_lock. @@ -535,28 +544,50 @@ backout: goto out; } +static int fault_mutex_hash(struct mm_struct *mm, struct vm_area_struct *vma, + struct address_space *mapping, + unsigned long pagenum, unsigned long address) +{ + void *p = mapping; + + if (! (vma->vm_flags & VM_SHARED)) { + p = mm; + pagenum = address << HPAGE_SIZE; + } + + return ((unsigned long)p + pagenum) % num_fault_mutexes; +} + int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, int write_access) { pte_t *ptep; pte_t entry; int ret; - static DEFINE_MUTEX(hugetlb_instantiation_mutex); + struct address_space *mapping; + unsigned long idx; + int hash; ptep = huge_pte_alloc(mm, address); if (!ptep) return VM_FAULT_OOM; + mapping = vma->vm_file->f_mapping; + idx = ((address - vma->vm_start) >> HPAGE_SHIFT) + + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT)); + /* * Serialize hugepage allocation and instantiation, so that we don't * get spurious allocation failures if two CPUs race to instantiate * the same page in the page cache. */ - mutex_lock(&hugetlb_instantiation_mutex); + hash = fault_mutex_hash(mm, vma, mapping, idx, address); + mutex_lock(&htlb_fault_mutex_table[hash]); entry = *ptep; if (pte_none(entry)) { - ret = hugetlb_no_page(mm, vma, address, ptep, write_access); - mutex_unlock(&hugetlb_instantiation_mutex); + ret = hugetlb_no_page(mm, vma, mapping, idx, + address, ptep, write_access); + mutex_unlock(&htlb_fault_mutex_table[hash]); return ret; } @@ -568,7 +599,7 @@ int hugetlb_fault(struct mm_struct *mm, if (write_access && !pte_write(entry)) ret = hugetlb_cow(mm, vma, address, ptep, entry); spin_unlock(&mm->page_table_lock); - mutex_unlock(&hugetlb_instantiation_mutex); + mutex_unlock(&htlb_fault_mutex_table[hash]); return ret; } -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org