From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id 30FE76B0082 for ; Fri, 22 Jan 2010 01:57:57 -0500 (EST) From: KOSAKI Motohiro Subject: Re: [RFC -v2 PATCH -mm] change anon_vma linking to fix multi-process server scalability issue In-Reply-To: <4B57E442.5060700@redhat.com> References: <20100121133448.73BD.A69D9226@jp.fujitsu.com> <4B57E442.5060700@redhat.com> Message-Id: <20100122135809.6C11.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit Date: Fri, 22 Jan 2010 15:57:52 +0900 (JST) Sender: owner-linux-mm@kvack.org To: Rik van Riel Cc: kosaki.motohiro@jp.fujitsu.com, linux-mm@kvack.org, linux-kernel@kvack.org, Lee Schermerhorn , minchan.kim@gmail.com, lwoodman@redhat.com, aarcange@redhat.com List-ID: > On 01/21/2010 12:05 AM, KOSAKI Motohiro wrote: > > >> In a workload with 1000 child processes and a VMA with 1000 anonymous > >> pages per process that get COWed, this leads to a system with a million > >> anonymous pages in the same anon_vma, each of which is mapped in just > >> one of the 1000 processes. However, the current rmap code needs to > >> walk them all, leading to O(N) scanning complexity for each page. > > >> This reduces rmap scanning complexity to O(1) for the pages of > >> the 1000 child processes, with O(N) complexity for at most 1/N > >> pages in the system. This reduces the average scanning cost in > >> heavily forking workloads from O(N) to 2. > > > I've only roughly reviewed this patch. So, perhaps I missed something. > > My first impression is, this is slightly large but benefit is only affected > > corner case. > > At the moment it mostly triggers with artificial workloads, but > having 1000 client connections to eg. an Oracle database is not > unheard of. > > The reason for wanting to fix the corner case is because it is > so incredibly bad. > > > If my remember is correct, you said you expect Nick's fair rwlock + Larry's rw-anon-lock > > makes good result at some week ago. Why do you make alternative patch? > > such way made bad result? or this patch have alternative benefit? > > After looking at the complexity figures (above), I suspect that > making a factor 5-10 speedup is not going to fix a factor 1000 > increased complexity. > > > This patch seems to increase fork overhead instead decreasing vmscan overhead. > > I'm not sure it is good deal. > > My hope is that the overhead of adding a few small objects per VMA > will be unnoticable, compared to the overhead of refcounting pages, > handling page tables, etc. > > The code looks like it could be a lot of anon_vma_chains, but in > practice the depth is limited because exec() wipes them all out. > Most of the time we will have just 0, 1 or 2 anon_vmas attached to > a VMA - one for the current process and one for the parent. > > > Hmm... > > Why can't we convert read side anon-vma walk to rcu? It need rcu aware vma > > free, but anon_vma is alredy freed by rcu. > > Changing the locking to RCU does not reduce the amount of work > that needs to be done in page_referenced_anon. If we have 1000 > siblings with 1000 pages each, we still end up scanning all > 1000 processes for each of those 1000 pages in the pageout code. > > Adding parallelism to that with better locking may speed it up > by the number of CPUs at most, which really may not help much > in these workloads. > > Today having 1000 client connections to a forking server is > considered a lot, but I suspect it could be more common in a > few years. I would like Linux to be ready for those kinds of > workloads. Thanks. probably I understand your intention. I think this patch need performance comparision with simple rw-spinlock patch. but I don't worry it, maybe someone else does this. roughly review is here. [ generally, this patch have too few lock related comment. I think I haven't undestand correct lock rule of this patch. ] > @@ -516,7 +517,8 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift) > /* > * cover the whole range: [new_start, old_end) > */ > - vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL); > + if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL)) > + return -ENOMEM; shift_arg_pages() have two vma_adjust() call. why don't we need change both? > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 84a524a..44cfb13 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -167,7 +167,7 @@ struct vm_area_struct { > * can only be in the i_mmap tree. An anonymous MAP_PRIVATE, stack > * or brk vma (with NULL file) can only be in an anon_vma list. > */ > - struct list_head anon_vma_node; /* Serialized by anon_vma->lock */ > + struct list_head anon_vma_chain; /* Serialized by anon_vma->lock */ > struct anon_vma *anon_vma; /* Serialized by page_table_lock */ Is this comment really correct? for example, following vma->anon_vma_chain operation is in place out of anon_vma lock. static void anon_vma_chain_link(struct vm_area_struct *vma, struct anon_vma_chain *avc, struct anon_vma *anon_vma) { avc->vma = vma; avc->anon_vma = anon_vma; list_add(&avc->same_vma, &vma->anon_vma_chain); spin_lock(&anon_vma->lock); list_add_tail(&avc->same_anon_vma, &anon_vma->head); spin_unlock(&anon_vma->lock); } I guess you intend to write /* locked by mmap_sem & friends */. note: however I don't think "& friends" is good comment ;-) > /* Function pointers to deal with this struct. */ > diff --git a/include/linux/rmap.h b/include/linux/rmap.h > index b019ae6..0d1903a 100644 > --- a/include/linux/rmap.h > +++ b/include/linux/rmap.h > @@ -37,7 +37,27 @@ struct anon_vma { > * is serialized by a system wide lock only visible to > * mm_take_all_locks() (mm_all_locks_mutex). > */ > - struct list_head head; /* List of private "related" vmas */ > + struct list_head head; /* Chain of private "related" vmas */ > +}; Hmm.. It seems unclear comment. this list head don't linked struct vm_area_struct. instead, linked struct anon_vma_chain. so "related vmas" isn't kindly comment. > + > +/* > + * The copy-on-write semantics of fork mean that an anon_vma > + * can become associated with multiple processes. Furthermore, > + * each child process will have its own anon_vma, where new > + * pages for that process are instantiated. > + * > + * This structure allows us to find the anon_vmas associated > + * with a VMA, or the VMAs associated with an anon_vma. > + * The "same_vma" list contains the anon_vma_chains linking > + * all the anon_vmas associated with this VMA. > + * The "same_anon_vma" list contains the anon_vma_chains > + * which link all the VMAs associated with this anon_vma. > + */ > +struct anon_vma_chain { > + struct vm_area_struct *vma; > + struct anon_vma *anon_vma; > + struct list_head same_vma; /* locked by mmap_sem & friends */ > + struct list_head same_anon_vma; /* locked by anon_vma->lock */ > }; Probably, This place need more lots comments. struct anon_vma_chain makes very complex relationship graph. example or good ascii art is needed. especially, fork parent and child have different anon_vma_chain. it seems tricky. > +static inline void anon_vma_merge(struct vm_area_struct *vma, > + struct vm_area_struct *next) > +{ > + BUG_ON(vma->anon_vma != next->anon_vma); > + unlink_anon_vmas(next); > +} > + Probably VM_BUG_ON is enough? > @@ -792,11 +809,13 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, > is_mergeable_anon_vma(prev->anon_vma, > next->anon_vma)) { > /* cases 1, 6 */ > - vma_adjust(prev, prev->vm_start, > + err = vma_adjust(prev, prev->vm_start, > next->vm_end, prev->vm_pgoff, NULL); > } else /* cases 2, 5, 7 */ > - vma_adjust(prev, prev->vm_start, > + err = vma_adjust(prev, prev->vm_start, > end, prev->vm_pgoff, NULL); > + if (err) > + return NULL; > return prev; > } Currently, the callers of vma_merge() assume vma_merge doesn't failure. IOW, they don't think return NULL is failure. Probably we need to change all callers too. > @@ -2454,7 +2506,8 @@ int mm_take_all_locks(struct mm_struct *mm) > if (signal_pending(current)) > goto out_unlock; > if (vma->anon_vma) > - vm_lock_anon_vma(mm, vma->anon_vma); > + list_for_each_entry(avc, &vma->anon_vma_chain, same_vma) > + vm_lock_anon_vma(mm, avc->anon_vma); > } This function is not protected by mmap_sem. but anon_vma_chain->same_vma iteration need to mmap_sem if your commnet is correct. > @@ -2509,13 +2562,15 @@ static void vm_unlock_mapping(struct address_space *mapping) > void mm_drop_all_locks(struct mm_struct *mm) > { > struct vm_area_struct *vma; > + struct anon_vma_chain *avc; > > BUG_ON(down_read_trylock(&mm->mmap_sem)); > BUG_ON(!mutex_is_locked(&mm_all_locks_mutex)); > > for (vma = mm->mmap; vma; vma = vma->vm_next) { > if (vma->anon_vma) > - vm_unlock_anon_vma(vma->anon_vma); > + list_for_each_entry(avc, &vma->anon_vma_chain, same_vma) > + vm_unlock_anon_vma(avc->anon_vma); > if (vma->vm_file && vma->vm_file->f_mapping) > vm_unlock_mapping(vma->vm_file->f_mapping); > } ditto. > @@ -188,10 +276,21 @@ static void anon_vma_ctor(void *data) > INIT_LIST_HEAD(&anon_vma->head); > } > void __init anon_vma_init(void) > { > anon_vma_cachep = kmem_cache_create("anon_vma", sizeof(struct anon_vma), > 0, SLAB_DESTROY_BY_RCU|SLAB_PANIC, anon_vma_ctor); > + anon_vma_chain_cachep = kmem_cache_create("anon_vma_chain", > + sizeof(struct anon_vma_chain), 0, > + SLAB_DESTROY_BY_RCU|SLAB_PANIC, anon_vma_chain_ctor); > } Why do we need SLAB_DESTROY_BY_RCU? anon_vma's one is required by page migration. (Oops, It should be commented, I think) but which code require anon_vma_chain's one? > /* > @@ -240,6 +339,14 @@ vma_address(struct page *page, struct vm_area_struct *vma) > /* page should be within @vma mapping range */ > return -EFAULT; > } > + if (unlikely(vma->vm_flags & VM_LOCK_RMAP)) > + /* > + * This VMA is being unlinked or not yet linked into the > + * VMA tree. Do not try to follow this rmap. This race > + * condition can result in page_referenced ignoring a > + * reference or try_to_unmap failing to unmap a page. > + */ > + return -EFAULT; > return address; > } In this place, the task have anon_vma->lock, but don't have mmap_sem. But, VM_LOCK_RMAP changing point (i.e. vma_adjust()) is protected by mmap_sem. IOW, "if (vma->vm_flags & VM_LOCK_RMAP)" return unstable value. Why can we use unstable value as "lock"? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org