From: Greg Thelen <gthelen@google.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jan Kara <jack@suse.cz>, Vivek Goyal <vgoyal@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
containers@lists.osdl.org, linux-fsdevel@vger.kernel.org,
Andrea Righi <arighi@develer.com>,
Balbir Singh <balbir@linux.vnet.ibm.com>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>,
Minchan Kim <minchan.kim@gmail.com>,
Ciju Rajan K <ciju@linux.vnet.ibm.com>,
David Rientjes <rientjes@google.com>,
Wu Fengguang <fengguang.wu@intel.com>,
Chad Talbott <ctalbott@google.com>,
Justin TerAvest <teravest@google.com>,
Curt Wohlgemuth <curtw@google.com>
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting
Date: Fri, 18 Mar 2011 00:57:09 -0700 [thread overview]
Message-ID: <AANLkTinPsfz1-2O9HNXE_ej-oUa+N5YOdN+cQQimOCBP@mail.gmail.com> (raw)
In-Reply-To: <20110317124350.GQ2140@cmpxchg.org>
On Thu, Mar 17, 2011 at 5:43 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Wed, Mar 16, 2011 at 09:41:48PM -0700, Greg Thelen wrote:
>> In '[PATCH v6 8/9] memcg: check memcg dirty limits in page writeback' Jan and
>> Vivek have had some discussion around how memcg and writeback mesh.
>> In my mind, the
>> discussions in 8/9 are starting to blend with this thread.
>>
>> I have been thinking about Johannes' struct memcg_mapping. I think the idea
>> may address several of the issues being discussed, especially
>> interaction between
>> IO-less balance_dirty_pages() and memcg writeback.
>>
>> Here is my thinking. Feedback is most welcome!
>>
>> The data structures:
>> - struct memcg_mapping {
>> struct address_space *mapping;
>> struct mem_cgroup *memcg;
>> int refcnt;
>> };
>> - each memcg contains a (radix, hash_table, etc.) mapping from bdi to memcg_bdi.
>> - each memcg_bdi contains a mapping from inode to memcg_mapping. This may be a
>> very large set representing many cached inodes.
>> - each memcg_mapping represents all pages within an bdi,inode,memcg. All
>> corresponding cached inode pages point to the same memcg_mapping via
>> pc->mapping. I assume that all pages of inode belong to no more than one bdi.
>> - manage a global list of memcg that are over their respective background dirty
>> limit.
>> - i_mapping continues to point to a traditional non-memcg mapping (no change
>> here).
>> - none of these memcg_* structures affect root cgroup or kernels with memcg
>> configured out.
>
> So structures roughly like this:
>
> struct mem_cgroup {
> ...
> /* key is struct backing_dev_info * */
> struct rb_root memcg_bdis;
> };
>
> struct memcg_bdi {
> /* key is struct address_space * */
> struct rb_root memcg_mappings;
> struct rb_node node;
> };
>
> struct memcg_mapping {
> struct address_space *mapping;
> struct mem_cgroup *memcg;
> struct rb_node node;
> atomic_t count;
> };
>
> struct page_cgroup {
> ...
> struct memcg_mapping *memcg_mapping;
> };
>
>> The routines under discussion:
>> - memcg charging a new inode page to a memcg: will use inode->mapping and inode
>> to walk memcg -> memcg_bdi -> memcg_mapping and lazily allocating missing
>> levels in data structure.
>>
>> - Uncharging a inode page from a memcg: will use pc->mapping->memcg to locate
>> memcg. If refcnt drops to zero, then remove memcg_mapping from the memcg_bdi.
>> Also delete memcg_bdi if last memcg_mapping is removed.
>>
>> - account_page_dirtied(): nothing new here, continue to set the per-page flags
>> and increment the memcg per-cpu dirty page counter. Same goes for routines
>> that mark pages in writeback and clean states.
>
> We may want to remember the dirty memcg_mappings so that on writeback
> we don't have to go through every single one that the memcg refers to?
I think this is a good idea to allow per memcg per bdi list of dirty mappings.
It feels like some of this is starting to gel. I've been sketching
some of the code to see how the memcg locking will work out. The
basic structures I see are:
struct mem_cgroup {
...
/*
* For all file pages cached by this memcg sort by bdi.
* key is struct backing_dev_info *; value is struct memcg_bdi *
* Protected by bdis_lock.
*/
struct rb_root bdis;
spinlock_t bdis_lock; /* or use rcu structure, memcg:bdi set
could be fairly static */
};
struct memcg_bdi {
struct backing_dev_info *bdi;
/*
* key is struct address_space *; value is struct
memcg_mapping *
* memcg_mappings live within either mappings or
dirty_mappings set.
*/
struct rb_root mappings;
struct rb_root dirty_mappings;
struct rb_node node;
spinlock_t lock; /* protect [dirty_]mappings */
};
struct memcg_mapping {
struct address_space *mapping;
struct memcg_bdi *memcg_bdi;
struct rb_node node;
atomic_t nr_pages;
atomic_t nr_dirty;
};
struct page_cgroup {
...
struct memcg_mapping *memcg_mapping;
};
- each memcg contains a mapping from bdi to memcg_bdi.
- each memcg_bdi contains two mappings:
mappings: from address_space to memcg_mapping for clean pages
dirty_mappings: from address_space to memcg_mapping when there are
some dirty pages
- each memcg_mapping represents a set of cached pages within an
bdi,inode,memcg. All
corresponding cached inode pages point to the same memcg_mapping via
pc->mapping. I assume that all pages of inode belong to no more than one bdi.
- manage a global list of memcg that are over their respective background dirty
limit.
- i_mapping continues to point to a traditional non-memcg mapping (no change
here).
- none of these memcg_* structures affect root cgroup or kernels with memcg
configured out.
The routines under discussion:
- memcg charging a new inode page to a memcg: will use inode->mapping and inode
to walk memcg -> memcg_bdi -> mappings and lazily allocating missing
levels in data structure.
- Uncharging a inode page from a memcg: will use pc->mapping->memcg to locate
memcg. If refcnt drops to zero, then remove memcg_mapping from the
memcg_bdi.[dirty_]mappings.
Also delete memcg_bdi if last memcg_mapping is removed.
- account_page_dirtied(): increment nr_dirty. If first dirty page,
then move memcg_mapping from memcg_bdi.mappings to
memcg_bdi.dirty_mappings page counter. When marking page clean, do
the opposite.
- mem_cgroup_balance_dirty_pages(): if memcg dirty memory usage if above
background limit, then add memcg to global memcg_over_bg_limit list and use
memcg's set of memcg_bdi to wakeup each(?) corresponding bdi flusher. If over
fg limit, then use IO-less style foreground throttling with per-memcg per-bdi
(aka memcg_bdi) accounting structure.
- bdi writeback: will revert some of the mmotm memcg dirty limit changes to
fs-writeback.c so that wb_do_writeback() will return to checking
wb_check_background_flush() to check background limits and being
interruptible if sync flush occurs. wb_check_background_flush() will
check the global
memcg_over_bg_limit list for memcg that are over their dirty limit.
Within each memcg write inodes from the dirty_mappings list until a
threshold page count has been reached (MAX_WRITEBACK_PAGES). Then
move to next listed memcg.
- over_bground_thresh() will determine if memcg is still over bg limit.
If over limit, then it per bdi per memcg background flushing will continue.
If not over limit then memcg will be removed from memcg_over_bg_limit list.
I'll post my resulting patches in RFC form, or (at the least) my conclusions.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2011-03-18 7:58 UTC|newest]
Thread overview: 61+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-03-11 18:43 Greg Thelen
2011-03-11 18:43 ` [PATCH v6 1/9] memcg: document cgroup dirty memory interfaces Greg Thelen
2011-03-14 14:50 ` Minchan Kim
2011-03-11 18:43 ` [PATCH v6 2/9] memcg: add page_cgroup flags for dirty page tracking Greg Thelen
2011-03-11 18:43 ` [PATCH v6 3/9] memcg: add dirty page accounting infrastructure Greg Thelen
2011-03-14 14:56 ` Minchan Kim
2011-03-11 18:43 ` [PATCH v6 4/9] memcg: add kernel calls for memcg dirty page stats Greg Thelen
2011-03-14 15:10 ` Minchan Kim
2011-03-15 6:32 ` Greg Thelen
2011-03-15 13:50 ` Ryusuke Konishi
2011-03-11 18:43 ` [PATCH v6 5/9] memcg: add dirty limits to mem_cgroup Greg Thelen
2011-03-11 18:43 ` [PATCH v6 6/9] memcg: add cgroupfs interface to memcg dirty limits Greg Thelen
2011-03-14 15:16 ` Minchan Kim
2011-03-15 14:01 ` Mike Heffner
2011-03-16 0:00 ` KAMEZAWA Hiroyuki
2011-03-16 0:50 ` Greg Thelen
2011-03-11 18:43 ` [PATCH v6 7/9] memcg: add dirty limiting routines Greg Thelen
2011-03-11 18:43 ` [PATCH v6 8/9] memcg: check memcg dirty limits in page writeback Greg Thelen
2011-03-14 17:54 ` Vivek Goyal
2011-03-14 17:59 ` Vivek Goyal
2011-03-14 21:10 ` Jan Kara
2011-03-15 3:27 ` Greg Thelen
2011-03-15 23:12 ` Jan Kara
2011-03-16 2:35 ` Greg Thelen
2011-03-16 12:35 ` Jan Kara
2011-03-16 18:07 ` Vivek Goyal
2011-03-15 16:20 ` Vivek Goyal
2011-03-11 18:43 ` [PATCH v6 9/9] memcg: make background writeback memcg aware Greg Thelen
2011-03-15 22:54 ` Vivek Goyal
2011-03-16 1:00 ` Greg Thelen
2011-03-12 1:10 ` [PATCH v6 0/9] memcg: per cgroup dirty page accounting Andrew Morton
2011-03-14 18:29 ` Greg Thelen
2011-03-14 20:23 ` Vivek Goyal
2011-03-15 2:41 ` Greg Thelen
2011-03-15 18:48 ` Vivek Goyal
2011-03-16 13:13 ` Johannes Weiner
2011-03-16 14:59 ` Vivek Goyal
2011-03-16 16:35 ` Johannes Weiner
2011-03-16 17:06 ` Vivek Goyal
2011-03-16 21:19 ` Greg Thelen
2011-03-16 21:52 ` Johannes Weiner
2011-03-17 4:41 ` Greg Thelen
2011-03-17 12:43 ` Johannes Weiner
2011-03-17 14:49 ` Vivek Goyal
2011-03-17 14:53 ` Jan Kara
2011-03-17 15:42 ` Curt Wohlgemuth
2011-03-18 7:57 ` Greg Thelen [this message]
2011-03-18 14:50 ` Vivek Goyal
2011-03-23 9:06 ` KAMEZAWA Hiroyuki
2011-03-18 14:29 ` Vivek Goyal
2011-03-18 14:46 ` Johannes Weiner
2011-03-17 14:46 ` Jan Kara
2011-03-17 17:12 ` Vivek Goyal
2011-03-17 17:59 ` Jan Kara
2011-03-17 18:15 ` Vivek Goyal
2011-03-15 21:23 ` Vivek Goyal
2011-03-15 23:11 ` Vivek Goyal
2011-03-15 1:56 ` KAMEZAWA Hiroyuki
2011-03-15 2:51 ` Greg Thelen
2011-03-15 2:54 ` KAMEZAWA Hiroyuki
2011-03-16 12:45 ` Johannes Weiner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=AANLkTinPsfz1-2O9HNXE_ej-oUa+N5YOdN+cQQimOCBP@mail.gmail.com \
--to=gthelen@google.com \
--cc=akpm@linux-foundation.org \
--cc=arighi@develer.com \
--cc=balbir@linux.vnet.ibm.com \
--cc=ciju@linux.vnet.ibm.com \
--cc=containers@lists.osdl.org \
--cc=ctalbott@google.com \
--cc=curtw@google.com \
--cc=fengguang.wu@intel.com \
--cc=hannes@cmpxchg.org \
--cc=jack@suse.cz \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=minchan.kim@gmail.com \
--cc=nishimura@mxp.nes.nec.co.jp \
--cc=rientjes@google.com \
--cc=teravest@google.com \
--cc=vgoyal@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox