From: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
To: Sha Zhengju <handai.szj@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>,
linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
akpm@linux-foundation.org, gthelen@google.com,
fengguang.wu@intel.com, glommer@parallels.com,
dchinner@redhat.com, Sha Zhengju <handai.szj@taobao.com>
Subject: Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
Date: Mon, 07 Jan 2013 16:25:20 +0900 [thread overview]
Message-ID: <50EA7860.6030300@jp.fujitsu.com> (raw)
In-Reply-To: <CAFj3OHXKyMO3gwghiBAmbowvqko-JqLtKroX2kzin1rk=q9tZg@mail.gmail.com>
(2013/01/05 13:48), Sha Zhengju wrote:
> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko@suse.cz> wrote:
>> On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
>>> From: Sha Zhengju <handai.szj@taobao.com>
>>>
>>> This patch adds memcg routines to count dirty pages, which allows memory controller
>>> to maintain an accurate view of the amount of its dirty memory and can provide some
>>> info for users while cgroup's direct reclaim is working.
>>
>> I guess you meant targeted resp. (hard/soft) limit reclaim here,
>> right? It is true that this is direct reclaim but it is not clear to me
>
> Yes, I meant memcg hard/soft reclaim here which is triggered directly
> by allocation and is distinct from background kswapd reclaim (global).
>
>> why the usefulnes should be limitted to the reclaim for users. I would
>> understand this if the users was in fact in-kernel users.
>>
>
> One of the reasons I'm trying to accounting the dirty pages is to get a
> more board overall view of memory usages because memcg hard/soft
> reclaim may have effect on response time of user application.
> Yeah, the beneficiary can be application administrator or kernel users. :P
>
>> [...]
>>> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
>>> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
>>> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
>>> So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty()
>>> and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention,
>>> a prepare PageDirty() checking is added.
>>
>> But there is another AA deadlock here I believe.
>> page_remove_rmap
>> mem_cgroup_begin_update_page_stat <<< 1
>> set_page_dirty
>> __set_page_dirty_buffers
>> __set_page_dirty
>> mem_cgroup_begin_update_page_stat <<< 2
>> move_lock_mem_cgroup
>> spin_lock_irqsave(&memcg->move_lock, *flags);
>>
>> mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
>> because we might race with the moving charges:
>> CPU0 CPU1
>> page_remove_rmap
>> mem_cgroup_can_attach
>> mem_cgroup_begin_update_page_stat (1)
>> rcu_read_lock
>> mem_cgroup_start_move
>> atomic_inc(&memcg_moving)
>> atomic_inc(&memcg->moving_account)
>> synchronize_rcu
>> __mem_cgroup_begin_update_page_stat
>> mem_cgroup_stolen <<< TRUE
>> move_lock_mem_cgroup
>> [...]
>> mem_cgroup_begin_update_page_stat (2)
>> __mem_cgroup_begin_update_page_stat
>> mem_cgroup_stolen <<< still TRUE
>> move_lock_mem_cgroup <<< DEADLOCK
>> [...]
>> mem_cgroup_end_update_page_stat
>> rcu_unlock
>> # wake up from synchronize_rcu
>> [...]
>> mem_cgroup_move_task
>> mem_cgroup_move_charge
>> walk_page_range
>> mem_cgroup_move_account
>> move_lock_mem_cgroup
>>
>>
>> Maybe I have missed some other locking which would prevent this from
>> happening but the locking relations are really complicated in this area
>> so if mem_cgroup_{begin,end}_update_page_stat might be called
>> recursively then we need a fat comment which justifies that.
>>
>
> Ohhh...good catching! I didn't notice there is a recursive call of
> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
> The mem_cgroup_{begin,end}_update_page_stat() design has depressed
> me a lot recently as the lock granularity is a little bigger than I thought.
> Not only the resource but also some code logic is in the range of locking
> which may be deadlock prone. The problem still exists if we are trying to
> add stat account of other memcg page later, may I make bold to suggest
> that we dig into the lock again...
>
> But with regard to the current lock implementation, I doubt if we can we can
> account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just
> try to get move_lock once in the beginning. IMHO we can make
> mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm
> thinking now is changing memcg->move_lock to rw-spinlock from the
> original spinlock:
> mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which make it
> reenterable and memcg moving task side try to get the write spinlock.
> Then the race may be following:
>
> CPU0 CPU1
> page_remove_rmap
> mem_cgroup_can_attach
> mem_cgroup_begin_update_page_stat (1)
> rcu_read_lock
> mem_cgroup_start_move
> atomic_inc(&memcg_moving)
>
> atomic_inc(&memcg->moving_account)
> synchronize_rcu
> __mem_cgroup_begin_update_page_stat
> mem_cgroup_stolen <<< TRUE
> move_lock_mem_cgroup <<<< read-spinlock success
> [...]
> mem_cgroup_begin_update_page_stat (2)
> __mem_cgroup_begin_update_page_stat
> mem_cgroup_stolen <<< still TRUE
> move_lock_mem_cgroup <<<< read-spinlock success
>
> [...]
> mem_cgroup_end_update_page_stat <<< locked = true, unlock
> rcu_unlock
> # wake up from synchronize_rcu
> [...]
> mem_cgroup_move_task
> mem_cgroup_move_charge
> walk_page_range
> mem_cgroup_move_account
>
> move_lock_mem_cgroup <<< write-spinlock
>
>
> AFAICS, the deadlock seems to be avoided by both the rcu and rwlock.
> Is there anything I lost?
>
rwlock will work with the nest but it seems ugly do updates under read-lock.
How about this straightforward ?
==
/*
* Once a thread takes memcg_move_lock() on a memcg, it can take the lock on
* the memcg again for nesting calls
*/
static void move_lock_mem_cgroup(memcg, flags);
{
current->memcg_move_lock_nested += 1;
if (current->memcg_move_lock_nested > 1) {
VM_BUG_ON(current->move_locked_memcg != memcg);
return;
}
spin_lock_irqsave(&memcg_move_lock, &flags);
current->move_lockdev_memcg = memcg;
}
static void move_unlock_mem_cgroup(memcg, flags)
{
current->memcg_move_lock_nested -= 1;
if (!current->memcg_move_lock_nested) {
current->move_locked_memcg = NULL;
spin_unlock_irqrestore(&memcg_move_lock,flags);
}
}
==
But, hmm, this kind of (ugly) hack may cause trouble as Hugh said.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2013-01-07 7:25 UTC|newest]
Thread overview: 49+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-12-25 17:18 [PATCH V3 0/8] Per-cgroup page stat accounting Sha Zhengju
2012-12-25 17:20 ` [PATCH V3 1/8] memcg: remove MEMCG_NR_FILE_MAPPED Sha Zhengju
2012-12-25 17:22 ` [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func Sha Zhengju
2012-12-28 0:39 ` Kamezawa Hiroyuki
2013-01-05 2:34 ` Sha Zhengju
2013-01-02 9:08 ` Michal Hocko
2013-01-05 2:49 ` Sha Zhengju
2013-01-05 10:45 ` Michal Hocko
2012-12-25 17:24 ` [PATCH V3 3/8] use vfs __set_page_dirty interface instead of doing it inside filesystem Sha Zhengju
2012-12-28 0:41 ` Kamezawa Hiroyuki
2012-12-25 17:26 ` [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting Sha Zhengju
2013-01-02 10:44 ` Michal Hocko
2013-01-05 4:48 ` Sha Zhengju
2013-01-06 20:02 ` Hugh Dickins
2013-01-07 7:49 ` Kamezawa Hiroyuki
2013-01-09 5:15 ` Hugh Dickins
2013-01-09 7:24 ` Kamezawa Hiroyuki
2013-01-09 14:35 ` Sha Zhengju
2013-01-09 14:47 ` Michal Hocko
2013-01-07 7:25 ` Kamezawa Hiroyuki [this message]
2013-01-09 15:02 ` Sha Zhengju
2013-01-10 2:16 ` Kamezawa Hiroyuki
2013-01-10 4:26 ` Sha Zhengju
2013-01-10 5:03 ` Kamezawa Hiroyuki
2013-01-10 8:28 ` Sha Zhengju
2013-05-03 9:11 ` Michal Hocko
2013-05-03 9:59 ` Sha Zhengju
2013-01-06 20:07 ` Greg Thelen
2013-01-09 9:45 ` Sha Zhengju
2012-12-25 17:26 ` [PATCH V3 5/8] memcg: add per cgroup writeback " Sha Zhengju
2012-12-28 0:52 ` Kamezawa Hiroyuki
2013-01-02 11:15 ` Michal Hocko
2013-01-06 20:07 ` Greg Thelen
2013-01-09 9:08 ` Sha Zhengju
2012-12-25 17:27 ` [PATCH V3 6/8] memcg: Don't account root_mem_cgroup page statistics Sha Zhengju
2012-12-28 1:04 ` Kamezawa Hiroyuki
2013-01-05 7:38 ` Sha Zhengju
2013-01-02 12:27 ` Michal Hocko
2013-01-05 10:52 ` Sha Zhengju
2013-01-09 12:57 ` Michal Hocko
2012-12-25 17:27 ` [PATCH V3 7/8] memcg: disable memcg page stat accounting code when not in use Sha Zhengju
2012-12-28 1:06 ` Kamezawa Hiroyuki
2012-12-28 1:45 ` Kamezawa Hiroyuki
2013-01-05 11:06 ` Sha Zhengju
2013-01-02 13:35 ` Michal Hocko
2012-12-25 17:28 ` [PATCH V3 8/8] memcg: Document cgroup dirty/writeback memory statistics Sha Zhengju
2012-12-28 1:10 ` Kamezawa Hiroyuki
2013-01-06 2:55 ` Sha Zhengju
2013-01-02 13:36 ` Michal Hocko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=50EA7860.6030300@jp.fujitsu.com \
--to=kamezawa.hiroyu@jp.fujitsu.com \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=dchinner@redhat.com \
--cc=fengguang.wu@intel.com \
--cc=glommer@parallels.com \
--cc=gthelen@google.com \
--cc=handai.szj@gmail.com \
--cc=handai.szj@taobao.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox