From: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
To: Sha Zhengju <handai.szj@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>, Hugh Dickins <hughd@google.com>,
Johannes Weiner <hannes@cmpxchg.org>,
linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
akpm@linux-foundation.org, gthelen@google.com,
fengguang.wu@intel.com, glommer@parallels.com,
dchinner@redhat.com, Sha Zhengju <handai.szj@taobao.com>
Subject: Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
Date: Thu, 10 Jan 2013 11:16:27 +0900 [thread overview]
Message-ID: <50EE247B.6090405@jp.fujitsu.com> (raw)
In-Reply-To: <CAFj3OHXMgRG6u2YoM7y5WuPo2ZNA1yPmKRV29FYj9B6Wj_c6Lw@mail.gmail.com>
(2013/01/10 0:02), Sha Zhengju wrote:
> On Mon, Jan 7, 2013 at 3:25 PM, Kamezawa Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> (2013/01/05 13:48), Sha Zhengju wrote:
>>>
>>> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko@suse.cz> wrote:
>>>>
>>>> On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
>>>>>
>>>>> From: Sha Zhengju <handai.szj@taobao.com>
>>>>>
>>>>> This patch adds memcg routines to count dirty pages, which allows memory
>>>>> controller
>>>>> to maintain an accurate view of the amount of its dirty memory and can
>>>>> provide some
>>>>> info for users while cgroup's direct reclaim is working.
>>>>
>>>>
>>>> I guess you meant targeted resp. (hard/soft) limit reclaim here,
>>>> right? It is true that this is direct reclaim but it is not clear to me
>>>
>>>
>>> Yes, I meant memcg hard/soft reclaim here which is triggered directly
>>> by allocation and is distinct from background kswapd reclaim (global).
>>>
>>>> why the usefulnes should be limitted to the reclaim for users. I would
>>>> understand this if the users was in fact in-kernel users.
>>>>
>>>
>>> One of the reasons I'm trying to accounting the dirty pages is to get a
>>> more board overall view of memory usages because memcg hard/soft
>>> reclaim may have effect on response time of user application.
>>> Yeah, the beneficiary can be application administrator or kernel users.
>>> :P
>>>
>>>> [...]
>>>>>
>>>>> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
>>>>> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
>>>>> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
>>>>> So we need to make mapping->tree_lock ahead of TestSetPageDirty in
>>>>> __set_page_dirty()
>>>>> and __set_page_dirty_nobuffers(). But in order to avoiding useless
>>>>> spinlock contention,
>>>>> a prepare PageDirty() checking is added.
>>>>
>>>>
>>>> But there is another AA deadlock here I believe.
>>>> page_remove_rmap
>>>> mem_cgroup_begin_update_page_stat <<< 1
>>>> set_page_dirty
>>>> __set_page_dirty_buffers
>>>> __set_page_dirty
>>>> mem_cgroup_begin_update_page_stat <<< 2
>>>> move_lock_mem_cgroup
>>>> spin_lock_irqsave(&memcg->move_lock, *flags);
>>>>
>>>> mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
>>>> because we might race with the moving charges:
>>>> CPU0 CPU1
>>>> page_remove_rmap
>>>> mem_cgroup_can_attach
>>>> mem_cgroup_begin_update_page_stat (1)
>>>> rcu_read_lock
>>>> mem_cgroup_start_move
>>>>
>>>> atomic_inc(&memcg_moving)
>>>>
>>>> atomic_inc(&memcg->moving_account)
>>>> synchronize_rcu
>>>> __mem_cgroup_begin_update_page_stat
>>>> mem_cgroup_stolen <<< TRUE
>>>> move_lock_mem_cgroup
>>>> [...]
>>>> mem_cgroup_begin_update_page_stat (2)
>>>> __mem_cgroup_begin_update_page_stat
>>>> mem_cgroup_stolen <<< still TRUE
>>>> move_lock_mem_cgroup <<< DEADLOCK
>>>> [...]
>>>> mem_cgroup_end_update_page_stat
>>>> rcu_unlock
>>>> # wake up from
>>>> synchronize_rcu
>>>> [...]
>>>> mem_cgroup_move_task
>>>> mem_cgroup_move_charge
>>>> walk_page_range
>>>>
>>>> mem_cgroup_move_account
>>>>
>>>> move_lock_mem_cgroup
>>>>
>>>>
>>>> Maybe I have missed some other locking which would prevent this from
>>>> happening but the locking relations are really complicated in this area
>>>> so if mem_cgroup_{begin,end}_update_page_stat might be called
>>>> recursively then we need a fat comment which justifies that.
>>>>
>>>
>>> Ohhh...good catching! I didn't notice there is a recursive call of
>>> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
>>> The mem_cgroup_{begin,end}_update_page_stat() design has depressed
>>> me a lot recently as the lock granularity is a little bigger than I
>>> thought.
>>> Not only the resource but also some code logic is in the range of locking
>>> which may be deadlock prone. The problem still exists if we are trying to
>>> add stat account of other memcg page later, may I make bold to suggest
>>> that we dig into the lock again...
>>>
>>> But with regard to the current lock implementation, I doubt if we can we
>>> can
>>> account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just
>>> try to get move_lock once in the beginning. IMHO we can make
>>> mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm
>>> thinking now is changing memcg->move_lock to rw-spinlock from the
>>> original spinlock:
>>> mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which
>>> make it
>>> reenterable and memcg moving task side try to get the write spinlock.
>>> Then the race may be following:
>>>
>>> CPU0 CPU1
>>> page_remove_rmap
>>> mem_cgroup_can_attach
>>> mem_cgroup_begin_update_page_stat (1)
>>> rcu_read_lock
>>> mem_cgroup_start_move
>>>
>>> atomic_inc(&memcg_moving)
>>>
>>> atomic_inc(&memcg->moving_account)
>>> synchronize_rcu
>>> __mem_cgroup_begin_update_page_stat
>>> mem_cgroup_stolen <<< TRUE
>>> move_lock_mem_cgroup <<<< read-spinlock success
>>> [...]
>>> mem_cgroup_begin_update_page_stat (2)
>>> __mem_cgroup_begin_update_page_stat
>>> mem_cgroup_stolen <<< still TRUE
>>> move_lock_mem_cgroup <<<< read-spinlock success
>>>
>>> [...]
>>> mem_cgroup_end_update_page_stat <<< locked = true, unlock
>>> rcu_unlock
>>> # wake up from
>>> synchronize_rcu
>>> [...]
>>> mem_cgroup_move_task
>>> mem_cgroup_move_charge
>>> walk_page_range
>>>
>>> mem_cgroup_move_account
>>>
>>> move_lock_mem_cgroup <<< write-spinlock
>>>
>>>
>>> AFAICS, the deadlock seems to be avoided by both the rcu and rwlock.
>>> Is there anything I lost?
>>>
>>
>> rwlock will work with the nest but it seems ugly do updates under read-lock.
>>
>> How about this straightforward ?
>> ==
>> /*
>> * Once a thread takes memcg_move_lock() on a memcg, it can take the lock on
>> * the memcg again for nesting calls
>> */
>> static void move_lock_mem_cgroup(memcg, flags);
>> {
>> current->memcg_move_lock_nested += 1;
>> if (current->memcg_move_lock_nested > 1) {
>> VM_BUG_ON(current->move_locked_memcg != memcg);
>> return;
>> }
>> spin_lock_irqsave(&memcg_move_lock, &flags);
>> current->move_lockdev_memcg = memcg;
>> }
>>
>> static void move_unlock_mem_cgroup(memcg, flags)
>> {
>> current->memcg_move_lock_nested -= 1;
>> if (!current->memcg_move_lock_nested) {
>> current->move_locked_memcg = NULL;
>> spin_unlock_irqrestore(&memcg_move_lock,flags);
>> }
>> }
>>
> Does we need to add two
> fields(current->memcg_move_lock_nested/move_locked_memcg) to 'struct
> task'? Is it feasible?
>
> Now I'm thinking about another synchronization proposal for memcg page
> stat updater and move_account, which seems to deal with recursion
> issue and deadlock:
>
> CPU A CPU B
>
> move_lock_mem_cgroup
> old_memcg = pc->mem_cgroup
> TestSetPageDirty(page)
> move_unlock_mem_cgroup
> move_lock_mem_cgroup
> if (PageDirty)
>
> old_memcg->nr_dirty --
>
> new_memcg->nr_dirty ++
>
> pc->mem_cgroup = new_memcgy
> move_unlock_mem_cgroup
>
> old_memcg->nr_dirty ++
>
I'm sorry I couldn't catch why you call TestSetPageDirty()....and what CPUA/CPUB is
doing ? CPUA calls move_account() and CPUB updates stat ? If so, why move_account()
is allowed to set PG_dirty ??
>
> So nr_dirty of old_memcg may be minus in a very short
> period('old_memcg->nr_dirty --' by CPU B), but it will be revised soon
> by CPU A. And the final figures of memcg->nr_dirty is correct.
It seems both of new_memcg and old_memcg has an account for a page. Is it correct ?
> Meanwhile the move_lock only protect saving old_memcg and
> TestSetPageDirty in its critical section and without any irrelevant
> logic, so the lock order or deadlock can be handled easily.
>
> But I'm not sure whether I've lost some race conditions, any comments
> are welcomed. : )
>
Sorry I couldn't understand.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2013-01-10 2:17 UTC|newest]
Thread overview: 49+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-12-25 17:18 [PATCH V3 0/8] Per-cgroup page stat accounting Sha Zhengju
2012-12-25 17:20 ` [PATCH V3 1/8] memcg: remove MEMCG_NR_FILE_MAPPED Sha Zhengju
2012-12-25 17:22 ` [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func Sha Zhengju
2012-12-28 0:39 ` Kamezawa Hiroyuki
2013-01-05 2:34 ` Sha Zhengju
2013-01-02 9:08 ` Michal Hocko
2013-01-05 2:49 ` Sha Zhengju
2013-01-05 10:45 ` Michal Hocko
2012-12-25 17:24 ` [PATCH V3 3/8] use vfs __set_page_dirty interface instead of doing it inside filesystem Sha Zhengju
2012-12-28 0:41 ` Kamezawa Hiroyuki
2012-12-25 17:26 ` [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting Sha Zhengju
2013-01-02 10:44 ` Michal Hocko
2013-01-05 4:48 ` Sha Zhengju
2013-01-06 20:02 ` Hugh Dickins
2013-01-07 7:49 ` Kamezawa Hiroyuki
2013-01-09 5:15 ` Hugh Dickins
2013-01-09 7:24 ` Kamezawa Hiroyuki
2013-01-09 14:35 ` Sha Zhengju
2013-01-09 14:47 ` Michal Hocko
2013-01-07 7:25 ` Kamezawa Hiroyuki
2013-01-09 15:02 ` Sha Zhengju
2013-01-10 2:16 ` Kamezawa Hiroyuki [this message]
2013-01-10 4:26 ` Sha Zhengju
2013-01-10 5:03 ` Kamezawa Hiroyuki
2013-01-10 8:28 ` Sha Zhengju
2013-05-03 9:11 ` Michal Hocko
2013-05-03 9:59 ` Sha Zhengju
2013-01-06 20:07 ` Greg Thelen
2013-01-09 9:45 ` Sha Zhengju
2012-12-25 17:26 ` [PATCH V3 5/8] memcg: add per cgroup writeback " Sha Zhengju
2012-12-28 0:52 ` Kamezawa Hiroyuki
2013-01-02 11:15 ` Michal Hocko
2013-01-06 20:07 ` Greg Thelen
2013-01-09 9:08 ` Sha Zhengju
2012-12-25 17:27 ` [PATCH V3 6/8] memcg: Don't account root_mem_cgroup page statistics Sha Zhengju
2012-12-28 1:04 ` Kamezawa Hiroyuki
2013-01-05 7:38 ` Sha Zhengju
2013-01-02 12:27 ` Michal Hocko
2013-01-05 10:52 ` Sha Zhengju
2013-01-09 12:57 ` Michal Hocko
2012-12-25 17:27 ` [PATCH V3 7/8] memcg: disable memcg page stat accounting code when not in use Sha Zhengju
2012-12-28 1:06 ` Kamezawa Hiroyuki
2012-12-28 1:45 ` Kamezawa Hiroyuki
2013-01-05 11:06 ` Sha Zhengju
2013-01-02 13:35 ` Michal Hocko
2012-12-25 17:28 ` [PATCH V3 8/8] memcg: Document cgroup dirty/writeback memory statistics Sha Zhengju
2012-12-28 1:10 ` Kamezawa Hiroyuki
2013-01-06 2:55 ` Sha Zhengju
2013-01-02 13:36 ` Michal Hocko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=50EE247B.6090405@jp.fujitsu.com \
--to=kamezawa.hiroyu@jp.fujitsu.com \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=dchinner@redhat.com \
--cc=fengguang.wu@intel.com \
--cc=glommer@parallels.com \
--cc=gthelen@google.com \
--cc=handai.szj@gmail.com \
--cc=handai.szj@taobao.com \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox