From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id 55E128D0039 for ; Thu, 17 Mar 2011 00:42:18 -0400 (EDT) Received: from kpbe20.cbf.corp.google.com (kpbe20.cbf.corp.google.com [172.25.105.84]) by smtp-out.google.com with ESMTP id p2H4gGv1001295 for ; Wed, 16 Mar 2011 21:42:16 -0700 Received: from qwg5 (qwg5.prod.google.com [10.241.194.133]) by kpbe20.cbf.corp.google.com with ESMTP id p2H4g9c9020443 (version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=NOT) for ; Wed, 16 Mar 2011 21:42:15 -0700 Received: by qwg5 with SMTP id 5so1701710qwg.3 for ; Wed, 16 Mar 2011 21:42:09 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20110316215214.GO2140@cmpxchg.org> References: <1299869011-26152-1-git-send-email-gthelen@google.com> <20110311171006.ec0d9c37.akpm@linux-foundation.org> <20110314202324.GG31120@redhat.com> <20110315184839.GB5740@redhat.com> <20110316131324.GM2140@cmpxchg.org> <20110316215214.GO2140@cmpxchg.org> From: Greg Thelen Date: Wed, 16 Mar 2011 21:41:48 -0700 Message-ID: Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner , Jan Kara Cc: Vivek Goyal , Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org, containers@lists.osdl.org, linux-fsdevel@vger.kernel.org, Andrea Righi , Balbir Singh , KAMEZAWA Hiroyuki , Daisuke Nishimura , Minchan Kim , Ciju Rajan K , David Rientjes , Wu Fengguang , Chad Talbott , Justin TerAvest , Curt Wohlgemuth On Wed, Mar 16, 2011 at 2:52 PM, Johannes Weiner wrote= : > On Wed, Mar 16, 2011 at 02:19:26PM -0700, Greg Thelen wrote: >> On Wed, Mar 16, 2011 at 6:13 AM, Johannes Weiner wr= ote: >> > On Tue, Mar 15, 2011 at 02:48:39PM -0400, Vivek Goyal wrote: >> >> I think even for background we shall have to implement some kind of l= ogic >> >> where inodes are selected by traversing memcg->lru list so that for >> >> background write we don't end up writting too many inodes from other >> >> root group in an attempt to meet the low background ratio of memcg. >> >> >> >> So to me it boils down to coming up a new inode selection logic for >> >> memcg which can be used both for background as well as foreground >> >> writes. This will make sure we don't end up writting pages from the >> >> inodes we don't want to. >> > >> > Originally for struct page_cgroup reduction, I had the idea of >> > introducing something like >> > >> > =A0 =A0 =A0 =A0struct memcg_mapping { >> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0struct address_space *mapping; >> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0struct mem_cgroup *memcg; >> > =A0 =A0 =A0 =A0}; >> > >> > hanging off page->mapping to make memcg association no longer per-page >> > and save the pc->memcg linkage (it's not completely per-inode either, >> > multiple memcgs can still refer to a single inode). >> > >> > We could put these descriptors on a per-memcg list and write inodes >> > from this list during memcg-writeback. >> > >> > We would have the option of extending this structure to contain hints >> > as to which subrange of the inode is actually owned by the cgroup, to >> > further narrow writeback to the right pages - iff shared big files >> > become a problem. >> > >> > Does that sound feasible? >> >> If I understand your memcg_mapping proposal, then each inode could >> have a collection of memcg_mapping objects representing the set of >> memcg that were charged for caching pages of the inode's data. =A0When a >> new file page is charged to a memcg, then the inode's set of >> memcg_mapping would be scanned to determine if current's memcg is >> already in the memcg_mapping set. =A0If this is the first page for the >> memcg within the inode, then a new memcg_mapping would be allocated >> and attached to the inode. =A0The memcg_mapping may be reference counted >> and would be deleted when the last inode page for a particular memcg >> is uncharged. > > Dead-on. =A0Well, on which side you put the list - a per-memcg list of > inodes, or a per-inode list of memcgs - really depends on which way > you want to do the lookups. =A0But this is the idea, yes. > >> =A0 page->mapping =3D &memcg_mapping >> =A0 inode->i_mapping =3D collection of memcg_mapping, grows/shrinks with= [un]charge > > If the memcg_mapping list (or hash-table for quick find-or-create?) > was to be on the inode side, I'd put it in struct address_space, since > this is all about page cache, not so much an fs thing. > > Still, correct in general. > In '[PATCH v6 8/9] memcg: check memcg dirty limits in page writeback' Jan a= nd Vivek have had some discussion around how memcg and writeback mesh. In my mind, the discussions in 8/9 are starting to blend with this thread. I have been thinking about Johannes' struct memcg_mapping. I think the ide= a may address several of the issues being discussed, especially interaction between IO-less balance_dirty_pages() and memcg writeback. Here is my thinking. Feedback is most welcome! The data structures: - struct memcg_mapping { struct address_space *mapping; struct mem_cgroup *memcg; int refcnt; }; - each memcg contains a (radix, hash_table, etc.) mapping from bdi to memcg= _bdi. - each memcg_bdi contains a mapping from inode to memcg_mapping. This may = be a very large set representing many cached inodes. - each memcg_mapping represents all pages within an bdi,inode,memcg. All corresponding cached inode pages point to the same memcg_mapping via pc->mapping. I assume that all pages of inode belong to no more than one= bdi. - manage a global list of memcg that are over their respective background d= irty limit. - i_mapping continues to point to a traditional non-memcg mapping (no chang= e here). - none of these memcg_* structures affect root cgroup or kernels with memcg configured out. The routines under discussion: - memcg charging a new inode page to a memcg: will use inode->mapping and i= node to walk memcg -> memcg_bdi -> memcg_mapping and lazily allocating missing levels in data structure. - Uncharging a inode page from a memcg: will use pc->mapping->memcg to loca= te memcg. If refcnt drops to zero, then remove memcg_mapping from the memcg= _bdi. Also delete memcg_bdi if last memcg_mapping is removed. - account_page_dirtied(): nothing new here, continue to set the per-page fl= ags and increment the memcg per-cpu dirty page counter. Same goes for routin= es that mark pages in writeback and clean states. - mem_cgroup_balance_dirty_pages(): if memcg dirty memory usage if above background limit, then add memcg to global memcg_over_bg_limit list and u= se memcg's set of memcg_bdi to wakeup each(?) corresponding bdi flusher. If= over fg limit, then use IO-less style foreground throttling with per-memcg per= -bdi (aka memcg_bdi) accounting structure. - bdi writeback: will revert some of the mmotm memcg dirty limit changes to fs-writeback.c so that wb_do_writeback() will return to checking wb_check_background_flush() to check background limits and being interruptible if sync flush occurs. wb_check_background_flush() will check the global memcg_over_bg_limit list for memcg that are over their dirty limit. wb_writeback() will either (I am not sure): a) scan memcg's bdi_memcg list of inodes (only some of them are dirty) b) scan bdi dirty inode list (only some of them in memcg) using inode_in_memcg() to identify inodes to write. inode_in_memcg(inode,me= mcg), would walk memcg- -> memcg_bdi -> memcg_mapping to determine if the me= mcg is caching pages from the inode. - over_bground_thresh() will determine if memcg is still over bg limit. If over limit, then it per bdi per memcg background flushing will continu= e. If not over limit then memcg will be removed from memcg_over_bg_limit lis= t. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org