linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Greg Thelen <gthelen@google.com>
To: Johannes Weiner <hannes@cmpxchg.org>, Jan Kara <jack@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	containers@lists.osdl.org, linux-fsdevel@vger.kernel.org,
	Andrea Righi <arighi@develer.com>,
	Balbir Singh <balbir@linux.vnet.ibm.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>,
	Minchan Kim <minchan.kim@gmail.com>,
	Ciju Rajan K <ciju@linux.vnet.ibm.com>,
	David Rientjes <rientjes@google.com>,
	Wu Fengguang <fengguang.wu@intel.com>,
	Chad Talbott <ctalbott@google.com>,
	Justin TerAvest <teravest@google.com>,
	Curt Wohlgemuth <curtw@google.com>
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting
Date: Wed, 16 Mar 2011 21:41:48 -0700	[thread overview]
Message-ID: <AANLkTinCErw+0QGpXJ4+JyZ1O96BC7SJAyXaP4t5v17c@mail.gmail.com> (raw)
In-Reply-To: <20110316215214.GO2140@cmpxchg.org>

On Wed, Mar 16, 2011 at 2:52 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Wed, Mar 16, 2011 at 02:19:26PM -0700, Greg Thelen wrote:
>> On Wed, Mar 16, 2011 at 6:13 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> > On Tue, Mar 15, 2011 at 02:48:39PM -0400, Vivek Goyal wrote:
>> >> I think even for background we shall have to implement some kind of logic
>> >> where inodes are selected by traversing memcg->lru list so that for
>> >> background write we don't end up writting too many inodes from other
>> >> root group in an attempt to meet the low background ratio of memcg.
>> >>
>> >> So to me it boils down to coming up a new inode selection logic for
>> >> memcg which can be used both for background as well as foreground
>> >> writes. This will make sure we don't end up writting pages from the
>> >> inodes we don't want to.
>> >
>> > Originally for struct page_cgroup reduction, I had the idea of
>> > introducing something like
>> >
>> >        struct memcg_mapping {
>> >                struct address_space *mapping;
>> >                struct mem_cgroup *memcg;
>> >        };
>> >
>> > hanging off page->mapping to make memcg association no longer per-page
>> > and save the pc->memcg linkage (it's not completely per-inode either,
>> > multiple memcgs can still refer to a single inode).
>> >
>> > We could put these descriptors on a per-memcg list and write inodes
>> > from this list during memcg-writeback.
>> >
>> > We would have the option of extending this structure to contain hints
>> > as to which subrange of the inode is actually owned by the cgroup, to
>> > further narrow writeback to the right pages - iff shared big files
>> > become a problem.
>> >
>> > Does that sound feasible?
>>
>> If I understand your memcg_mapping proposal, then each inode could
>> have a collection of memcg_mapping objects representing the set of
>> memcg that were charged for caching pages of the inode's data.  When a
>> new file page is charged to a memcg, then the inode's set of
>> memcg_mapping would be scanned to determine if current's memcg is
>> already in the memcg_mapping set.  If this is the first page for the
>> memcg within the inode, then a new memcg_mapping would be allocated
>> and attached to the inode.  The memcg_mapping may be reference counted
>> and would be deleted when the last inode page for a particular memcg
>> is uncharged.
>
> Dead-on.  Well, on which side you put the list - a per-memcg list of
> inodes, or a per-inode list of memcgs - really depends on which way
> you want to do the lookups.  But this is the idea, yes.
>
>>   page->mapping = &memcg_mapping
>>   inode->i_mapping = collection of memcg_mapping, grows/shrinks with [un]charge
>
> If the memcg_mapping list (or hash-table for quick find-or-create?)
> was to be on the inode side, I'd put it in struct address_space, since
> this is all about page cache, not so much an fs thing.
>
> Still, correct in general.
>

In '[PATCH v6 8/9] memcg: check memcg dirty limits in page writeback' Jan and
Vivek have had some discussion around how memcg and writeback mesh.
In my mind, the
discussions in 8/9 are starting to blend with this thread.

I have been thinking about Johannes' struct memcg_mapping.  I think the idea
may address several of the issues being discussed, especially
interaction between
IO-less balance_dirty_pages() and memcg writeback.

Here is my thinking.  Feedback is most welcome!

The data structures:
- struct memcg_mapping {
       struct address_space *mapping;
       struct mem_cgroup *memcg;
       int refcnt;
  };
- each memcg contains a (radix, hash_table, etc.) mapping from bdi to memcg_bdi.
- each memcg_bdi contains a mapping from inode to memcg_mapping.  This may be a
  very large set representing many cached inodes.
- each memcg_mapping represents all pages within an bdi,inode,memcg.  All
  corresponding cached inode pages point to the same memcg_mapping via
  pc->mapping.  I assume that all pages of inode belong to no more than one bdi.
- manage a global list of memcg that are over their respective background dirty
  limit.
- i_mapping continues to point to a traditional non-memcg mapping (no change
  here).
- none of these memcg_* structures affect root cgroup or kernels with memcg
  configured out.

The routines under discussion:
- memcg charging a new inode page to a memcg: will use inode->mapping and inode
  to walk memcg -> memcg_bdi -> memcg_mapping and lazily allocating missing
  levels in data structure.

- Uncharging a inode page from a memcg: will use pc->mapping->memcg to locate
  memcg.  If refcnt drops to zero, then remove memcg_mapping from the memcg_bdi.
  Also delete memcg_bdi if last memcg_mapping is removed.

- account_page_dirtied(): nothing new here, continue to set the per-page flags
  and increment the memcg per-cpu dirty page counter.  Same goes for routines
  that mark pages in writeback and clean states.

- mem_cgroup_balance_dirty_pages(): if memcg dirty memory usage if above
  background limit, then add memcg to global memcg_over_bg_limit list and use
  memcg's set of memcg_bdi to wakeup each(?) corresponding bdi flusher.  If over
  fg limit, then use IO-less style foreground throttling with per-memcg per-bdi
  (aka memcg_bdi) accounting structure.

- bdi writeback: will revert some of the mmotm memcg dirty limit changes to
  fs-writeback.c so that wb_do_writeback() will return to checking
  wb_check_background_flush() to check background limits and being
interruptible if
  sync flush occurs.  wb_check_background_flush() will check the global
  memcg_over_bg_limit list for memcg that are over their dirty limit.
  wb_writeback() will either (I am not sure):
  a) scan memcg's bdi_memcg list of inodes (only some of them are dirty)
  b) scan bdi dirty inode list (only some of them in memcg) using
     inode_in_memcg() to identify inodes to write.  inode_in_memcg(inode,memcg),
     would walk memcg- -> memcg_bdi -> memcg_mapping to determine if the memcg
     is caching pages from the inode.

- over_bground_thresh() will determine if memcg is still over bg limit.
  If over limit, then it per bdi per memcg background flushing will continue.
  If not over limit then memcg will be removed from memcg_over_bg_limit list.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2011-03-17  4:42 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-03-11 18:43 Greg Thelen
2011-03-11 18:43 ` [PATCH v6 1/9] memcg: document cgroup dirty memory interfaces Greg Thelen
2011-03-14 14:50   ` Minchan Kim
2011-03-11 18:43 ` [PATCH v6 2/9] memcg: add page_cgroup flags for dirty page tracking Greg Thelen
2011-03-11 18:43 ` [PATCH v6 3/9] memcg: add dirty page accounting infrastructure Greg Thelen
2011-03-14 14:56   ` Minchan Kim
2011-03-11 18:43 ` [PATCH v6 4/9] memcg: add kernel calls for memcg dirty page stats Greg Thelen
2011-03-14 15:10   ` Minchan Kim
2011-03-15  6:32     ` Greg Thelen
2011-03-15 13:50       ` Ryusuke Konishi
2011-03-11 18:43 ` [PATCH v6 5/9] memcg: add dirty limits to mem_cgroup Greg Thelen
2011-03-11 18:43 ` [PATCH v6 6/9] memcg: add cgroupfs interface to memcg dirty limits Greg Thelen
2011-03-14 15:16   ` Minchan Kim
2011-03-15 14:01   ` Mike Heffner
2011-03-16  0:00     ` KAMEZAWA Hiroyuki
2011-03-16  0:50     ` Greg Thelen
2011-03-11 18:43 ` [PATCH v6 7/9] memcg: add dirty limiting routines Greg Thelen
2011-03-11 18:43 ` [PATCH v6 8/9] memcg: check memcg dirty limits in page writeback Greg Thelen
2011-03-14 17:54   ` Vivek Goyal
2011-03-14 17:59     ` Vivek Goyal
2011-03-14 21:10     ` Jan Kara
2011-03-15  3:27       ` Greg Thelen
2011-03-15 23:12         ` Jan Kara
2011-03-16  2:35           ` Greg Thelen
2011-03-16 12:35             ` Jan Kara
2011-03-16 18:07               ` Vivek Goyal
2011-03-15 16:20       ` Vivek Goyal
2011-03-11 18:43 ` [PATCH v6 9/9] memcg: make background writeback memcg aware Greg Thelen
2011-03-15 22:54   ` Vivek Goyal
2011-03-16  1:00     ` Greg Thelen
2011-03-12  1:10 ` [PATCH v6 0/9] memcg: per cgroup dirty page accounting Andrew Morton
2011-03-14 18:29   ` Greg Thelen
2011-03-14 20:23     ` Vivek Goyal
2011-03-15  2:41       ` Greg Thelen
2011-03-15 18:48         ` Vivek Goyal
2011-03-16 13:13           ` Johannes Weiner
2011-03-16 14:59             ` Vivek Goyal
2011-03-16 16:35               ` Johannes Weiner
2011-03-16 17:06                 ` Vivek Goyal
2011-03-16 21:19             ` Greg Thelen
2011-03-16 21:52               ` Johannes Weiner
2011-03-17  4:41                 ` Greg Thelen [this message]
2011-03-17 12:43                   ` Johannes Weiner
2011-03-17 14:49                     ` Vivek Goyal
2011-03-17 14:53                     ` Jan Kara
2011-03-17 15:42                       ` Curt Wohlgemuth
2011-03-18  7:57                     ` Greg Thelen
2011-03-18 14:50                       ` Vivek Goyal
2011-03-23  9:06                       ` KAMEZAWA Hiroyuki
2011-03-18 14:29                     ` Vivek Goyal
2011-03-18 14:46                       ` Johannes Weiner
2011-03-17 14:46                   ` Jan Kara
2011-03-17 17:12                     ` Vivek Goyal
2011-03-17 17:59                       ` Jan Kara
2011-03-17 18:15                         ` Vivek Goyal
2011-03-15 21:23         ` Vivek Goyal
2011-03-15 23:11           ` Vivek Goyal
2011-03-15  1:56     ` KAMEZAWA Hiroyuki
2011-03-15  2:51       ` Greg Thelen
2011-03-15  2:54         ` KAMEZAWA Hiroyuki
2011-03-16 12:45 ` Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=AANLkTinCErw+0QGpXJ4+JyZ1O96BC7SJAyXaP4t5v17c@mail.gmail.com \
    --to=gthelen@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=arighi@develer.com \
    --cc=balbir@linux.vnet.ibm.com \
    --cc=ciju@linux.vnet.ibm.com \
    --cc=containers@lists.osdl.org \
    --cc=ctalbott@google.com \
    --cc=curtw@google.com \
    --cc=fengguang.wu@intel.com \
    --cc=hannes@cmpxchg.org \
    --cc=jack@suse.cz \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=minchan.kim@gmail.com \
    --cc=nishimura@mxp.nes.nec.co.jp \
    --cc=rientjes@google.com \
    --cc=teravest@google.com \
    --cc=vgoyal@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox