From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id 1E3898D0039 for ; Fri, 18 Mar 2011 03:58:12 -0400 (EDT) Received: from kpbe17.cbf.corp.google.com (kpbe17.cbf.corp.google.com [172.25.105.81]) by smtp-out.google.com with ESMTP id p2I7vY9N018545 for ; Fri, 18 Mar 2011 00:57:34 -0700 Received: from qyk10 (qyk10.prod.google.com [10.241.83.138]) by kpbe17.cbf.corp.google.com with ESMTP id p2I7vUkL000661 (version=TLSv1/SSLv3 cipher=RC4-MD5 bits=128 verify=NOT) for ; Fri, 18 Mar 2011 00:57:33 -0700 Received: by qyk10 with SMTP id 10so3064299qyk.11 for ; Fri, 18 Mar 2011 00:57:30 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20110317124350.GQ2140@cmpxchg.org> References: <1299869011-26152-1-git-send-email-gthelen@google.com> <20110311171006.ec0d9c37.akpm@linux-foundation.org> <20110314202324.GG31120@redhat.com> <20110315184839.GB5740@redhat.com> <20110316131324.GM2140@cmpxchg.org> <20110316215214.GO2140@cmpxchg.org> <20110317124350.GQ2140@cmpxchg.org> From: Greg Thelen Date: Fri, 18 Mar 2011 00:57:09 -0700 Message-ID: Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Jan Kara , Vivek Goyal , Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org, containers@lists.osdl.org, linux-fsdevel@vger.kernel.org, Andrea Righi , Balbir Singh , KAMEZAWA Hiroyuki , Daisuke Nishimura , Minchan Kim , Ciju Rajan K , David Rientjes , Wu Fengguang , Chad Talbott , Justin TerAvest , Curt Wohlgemuth On Thu, Mar 17, 2011 at 5:43 AM, Johannes Weiner wrote= : > On Wed, Mar 16, 2011 at 09:41:48PM -0700, Greg Thelen wrote: >> In '[PATCH v6 8/9] memcg: check memcg dirty limits in page writeback' Ja= n and >> Vivek have had some discussion around how memcg and writeback mesh. >> In my mind, the >> discussions in 8/9 are starting to blend with this thread. >> >> I have been thinking about Johannes' struct memcg_mapping. =A0I think th= e idea >> may address several of the issues being discussed, especially >> interaction between >> IO-less balance_dirty_pages() and memcg writeback. >> >> Here is my thinking. =A0Feedback is most welcome! >> >> The data structures: >> - struct memcg_mapping { >> =A0 =A0 =A0 =A0struct address_space *mapping; >> =A0 =A0 =A0 =A0struct mem_cgroup *memcg; >> =A0 =A0 =A0 =A0int refcnt; >> =A0 }; >> - each memcg contains a (radix, hash_table, etc.) mapping from bdi to me= mcg_bdi. >> - each memcg_bdi contains a mapping from inode to memcg_mapping. =A0This= may be a >> =A0 very large set representing many cached inodes. >> - each memcg_mapping represents all pages within an bdi,inode,memcg. =A0= All >> =A0 corresponding cached inode pages point to the same memcg_mapping via >> =A0 pc->mapping. =A0I assume that all pages of inode belong to no more t= han one bdi. >> - manage a global list of memcg that are over their respective backgroun= d dirty >> =A0 limit. >> - i_mapping continues to point to a traditional non-memcg mapping (no ch= ange >> =A0 here). >> - none of these memcg_* structures affect root cgroup or kernels with me= mcg >> =A0 configured out. > > So structures roughly like this: > > struct mem_cgroup { > =A0 =A0 =A0 =A0... > =A0 =A0 =A0 =A0/* key is struct backing_dev_info * */ > =A0 =A0 =A0 =A0struct rb_root memcg_bdis; > }; > > struct memcg_bdi { > =A0 =A0 =A0 =A0/* key is struct address_space * */ > =A0 =A0 =A0 =A0struct rb_root memcg_mappings; > =A0 =A0 =A0 =A0struct rb_node node; > }; > > struct memcg_mapping { > =A0 =A0 =A0 =A0struct address_space *mapping; > =A0 =A0 =A0 =A0struct mem_cgroup *memcg; > =A0 =A0 =A0 =A0struct rb_node node; > =A0 =A0 =A0 =A0atomic_t count; > }; > > struct page_cgroup { > =A0 =A0 =A0 =A0... > =A0 =A0 =A0 =A0struct memcg_mapping *memcg_mapping; > }; > >> The routines under discussion: >> - memcg charging a new inode page to a memcg: will use inode->mapping an= d inode >> =A0 to walk memcg -> memcg_bdi -> memcg_mapping and lazily allocating mi= ssing >> =A0 levels in data structure. >> >> - Uncharging a inode page from a memcg: will use pc->mapping->memcg to l= ocate >> =A0 memcg. =A0If refcnt drops to zero, then remove memcg_mapping from th= e memcg_bdi. >> =A0 Also delete memcg_bdi if last memcg_mapping is removed. >> >> - account_page_dirtied(): nothing new here, continue to set the per-page= flags >> =A0 and increment the memcg per-cpu dirty page counter. =A0Same goes for= routines >> =A0 that mark pages in writeback and clean states. > > We may want to remember the dirty memcg_mappings so that on writeback > we don't have to go through every single one that the memcg refers to? I think this is a good idea to allow per memcg per bdi list of dirty mappin= gs. It feels like some of this is starting to gel. I've been sketching some of the code to see how the memcg locking will work out. The basic structures I see are: struct mem_cgroup { ... /* * For all file pages cached by this memcg sort by bdi. * key is struct backing_dev_info *; value is struct memcg_bdi * * Protected by bdis_lock. */ struct rb_root bdis; spinlock_t bdis_lock; /* or use rcu structure, memcg:bdi set could be fairly static */ }; struct memcg_bdi { struct backing_dev_info *bdi; /* * key is struct address_space *; value is struct memcg_mapping * * memcg_mappings live within either mappings or dirty_mappings set. */ struct rb_root mappings; struct rb_root dirty_mappings; struct rb_node node; spinlock_t lock; /* protect [dirty_]mappings */ }; struct memcg_mapping { struct address_space *mapping; struct memcg_bdi *memcg_bdi; struct rb_node node; atomic_t nr_pages; atomic_t nr_dirty; }; struct page_cgroup { ... struct memcg_mapping *memcg_mapping; }; - each memcg contains a mapping from bdi to memcg_bdi. - each memcg_bdi contains two mappings: mappings: from address_space to memcg_mapping for clean pages dirty_mappings: from address_space to memcg_mapping when there are some dirty pages - each memcg_mapping represents a set of cached pages within an bdi,inode,memcg. All corresponding cached inode pages point to the same memcg_mapping via pc->mapping. I assume that all pages of inode belong to no more than one = bdi. - manage a global list of memcg that are over their respective background d= irty limit. - i_mapping continues to point to a traditional non-memcg mapping (no chang= e here). - none of these memcg_* structures affect root cgroup or kernels with memcg configured out. The routines under discussion: - memcg charging a new inode page to a memcg: will use inode->mapping and i= node to walk memcg -> memcg_bdi -> mappings and lazily allocating missing levels in data structure. - Uncharging a inode page from a memcg: will use pc->mapping->memcg to loca= te memcg. If refcnt drops to zero, then remove memcg_mapping from the memcg_bdi.[dirty_]mappings. Also delete memcg_bdi if last memcg_mapping is removed. - account_page_dirtied(): increment nr_dirty. If first dirty page, then move memcg_mapping from memcg_bdi.mappings to memcg_bdi.dirty_mappings page counter. When marking page clean, do the opposite. - mem_cgroup_balance_dirty_pages(): if memcg dirty memory usage if above background limit, then add memcg to global memcg_over_bg_limit list and us= e memcg's set of memcg_bdi to wakeup each(?) corresponding bdi flusher. If = over fg limit, then use IO-less style foreground throttling with per-memcg per-= bdi (aka memcg_bdi) accounting structure. - bdi writeback: will revert some of the mmotm memcg dirty limit changes to fs-writeback.c so that wb_do_writeback() will return to checking wb_check_background_flush() to check background limits and being interruptible if sync flush occurs. wb_check_background_flush() will check the global memcg_over_bg_limit list for memcg that are over their dirty limit. Within each memcg write inodes from the dirty_mappings list until a threshold page count has been reached (MAX_WRITEBACK_PAGES). Then move to next listed memcg. - over_bground_thresh() will determine if memcg is still over bg limit. If over limit, then it per bdi per memcg background flushing will continue= . If not over limit then memcg will be removed from memcg_over_bg_limit list= . I'll post my resulting patches in RFC form, or (at the least) my conclusion= s. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org