From: Sha Zhengju <handai.szj@gmail.com>
To: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>, Hugh Dickins <hughd@google.com>,
Johannes Weiner <hannes@cmpxchg.org>,
linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
akpm@linux-foundation.org, gthelen@google.com,
fengguang.wu@intel.com, glommer@parallels.com,
dchinner@redhat.com, Sha Zhengju <handai.szj@taobao.com>
Subject: Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
Date: Wed, 9 Jan 2013 23:02:56 +0800 [thread overview]
Message-ID: <CAFj3OHXMgRG6u2YoM7y5WuPo2ZNA1yPmKRV29FYj9B6Wj_c6Lw@mail.gmail.com> (raw)
In-Reply-To: <50EA7860.6030300@jp.fujitsu.com>
On Mon, Jan 7, 2013 at 3:25 PM, Kamezawa Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> (2013/01/05 13:48), Sha Zhengju wrote:
>>
>> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko@suse.cz> wrote:
>>>
>>> On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
>>>>
>>>> From: Sha Zhengju <handai.szj@taobao.com>
>>>>
>>>> This patch adds memcg routines to count dirty pages, which allows memory
>>>> controller
>>>> to maintain an accurate view of the amount of its dirty memory and can
>>>> provide some
>>>> info for users while cgroup's direct reclaim is working.
>>>
>>>
>>> I guess you meant targeted resp. (hard/soft) limit reclaim here,
>>> right? It is true that this is direct reclaim but it is not clear to me
>>
>>
>> Yes, I meant memcg hard/soft reclaim here which is triggered directly
>> by allocation and is distinct from background kswapd reclaim (global).
>>
>>> why the usefulnes should be limitted to the reclaim for users. I would
>>> understand this if the users was in fact in-kernel users.
>>>
>>
>> One of the reasons I'm trying to accounting the dirty pages is to get a
>> more board overall view of memory usages because memcg hard/soft
>> reclaim may have effect on response time of user application.
>> Yeah, the beneficiary can be application administrator or kernel users.
>> :P
>>
>>> [...]
>>>>
>>>> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
>>>> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
>>>> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
>>>> So we need to make mapping->tree_lock ahead of TestSetPageDirty in
>>>> __set_page_dirty()
>>>> and __set_page_dirty_nobuffers(). But in order to avoiding useless
>>>> spinlock contention,
>>>> a prepare PageDirty() checking is added.
>>>
>>>
>>> But there is another AA deadlock here I believe.
>>> page_remove_rmap
>>> mem_cgroup_begin_update_page_stat <<< 1
>>> set_page_dirty
>>> __set_page_dirty_buffers
>>> __set_page_dirty
>>> mem_cgroup_begin_update_page_stat <<< 2
>>> move_lock_mem_cgroup
>>> spin_lock_irqsave(&memcg->move_lock, *flags);
>>>
>>> mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
>>> because we might race with the moving charges:
>>> CPU0 CPU1
>>> page_remove_rmap
>>> mem_cgroup_can_attach
>>> mem_cgroup_begin_update_page_stat (1)
>>> rcu_read_lock
>>> mem_cgroup_start_move
>>>
>>> atomic_inc(&memcg_moving)
>>>
>>> atomic_inc(&memcg->moving_account)
>>> synchronize_rcu
>>> __mem_cgroup_begin_update_page_stat
>>> mem_cgroup_stolen <<< TRUE
>>> move_lock_mem_cgroup
>>> [...]
>>> mem_cgroup_begin_update_page_stat (2)
>>> __mem_cgroup_begin_update_page_stat
>>> mem_cgroup_stolen <<< still TRUE
>>> move_lock_mem_cgroup <<< DEADLOCK
>>> [...]
>>> mem_cgroup_end_update_page_stat
>>> rcu_unlock
>>> # wake up from
>>> synchronize_rcu
>>> [...]
>>> mem_cgroup_move_task
>>> mem_cgroup_move_charge
>>> walk_page_range
>>>
>>> mem_cgroup_move_account
>>>
>>> move_lock_mem_cgroup
>>>
>>>
>>> Maybe I have missed some other locking which would prevent this from
>>> happening but the locking relations are really complicated in this area
>>> so if mem_cgroup_{begin,end}_update_page_stat might be called
>>> recursively then we need a fat comment which justifies that.
>>>
>>
>> Ohhh...good catching! I didn't notice there is a recursive call of
>> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
>> The mem_cgroup_{begin,end}_update_page_stat() design has depressed
>> me a lot recently as the lock granularity is a little bigger than I
>> thought.
>> Not only the resource but also some code logic is in the range of locking
>> which may be deadlock prone. The problem still exists if we are trying to
>> add stat account of other memcg page later, may I make bold to suggest
>> that we dig into the lock again...
>>
>> But with regard to the current lock implementation, I doubt if we can we
>> can
>> account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just
>> try to get move_lock once in the beginning. IMHO we can make
>> mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm
>> thinking now is changing memcg->move_lock to rw-spinlock from the
>> original spinlock:
>> mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which
>> make it
>> reenterable and memcg moving task side try to get the write spinlock.
>> Then the race may be following:
>>
>> CPU0 CPU1
>> page_remove_rmap
>> mem_cgroup_can_attach
>> mem_cgroup_begin_update_page_stat (1)
>> rcu_read_lock
>> mem_cgroup_start_move
>>
>> atomic_inc(&memcg_moving)
>>
>> atomic_inc(&memcg->moving_account)
>> synchronize_rcu
>> __mem_cgroup_begin_update_page_stat
>> mem_cgroup_stolen <<< TRUE
>> move_lock_mem_cgroup <<<< read-spinlock success
>> [...]
>> mem_cgroup_begin_update_page_stat (2)
>> __mem_cgroup_begin_update_page_stat
>> mem_cgroup_stolen <<< still TRUE
>> move_lock_mem_cgroup <<<< read-spinlock success
>>
>> [...]
>> mem_cgroup_end_update_page_stat <<< locked = true, unlock
>> rcu_unlock
>> # wake up from
>> synchronize_rcu
>> [...]
>> mem_cgroup_move_task
>> mem_cgroup_move_charge
>> walk_page_range
>>
>> mem_cgroup_move_account
>>
>> move_lock_mem_cgroup <<< write-spinlock
>>
>>
>> AFAICS, the deadlock seems to be avoided by both the rcu and rwlock.
>> Is there anything I lost?
>>
>
> rwlock will work with the nest but it seems ugly do updates under read-lock.
>
> How about this straightforward ?
> ==
> /*
> * Once a thread takes memcg_move_lock() on a memcg, it can take the lock on
> * the memcg again for nesting calls
> */
> static void move_lock_mem_cgroup(memcg, flags);
> {
> current->memcg_move_lock_nested += 1;
> if (current->memcg_move_lock_nested > 1) {
> VM_BUG_ON(current->move_locked_memcg != memcg);
> return;
> }
> spin_lock_irqsave(&memcg_move_lock, &flags);
> current->move_lockdev_memcg = memcg;
> }
>
> static void move_unlock_mem_cgroup(memcg, flags)
> {
> current->memcg_move_lock_nested -= 1;
> if (!current->memcg_move_lock_nested) {
> current->move_locked_memcg = NULL;
> spin_unlock_irqrestore(&memcg_move_lock,flags);
> }
> }
>
Does we need to add two
fields(current->memcg_move_lock_nested/move_locked_memcg) to 'struct
task'? Is it feasible?
Now I'm thinking about another synchronization proposal for memcg page
stat updater and move_account, which seems to deal with recursion
issue and deadlock:
CPU A CPU B
move_lock_mem_cgroup
old_memcg = pc->mem_cgroup
TestSetPageDirty(page)
move_unlock_mem_cgroup
move_lock_mem_cgroup
if (PageDirty)
old_memcg->nr_dirty --
new_memcg->nr_dirty ++
pc->mem_cgroup = new_memcgy
move_unlock_mem_cgroup
old_memcg->nr_dirty ++
So nr_dirty of old_memcg may be minus in a very short
period('old_memcg->nr_dirty --' by CPU B), but it will be revised soon
by CPU A. And the final figures of memcg->nr_dirty is correct.
Meanwhile the move_lock only protect saving old_memcg and
TestSetPageDirty in its critical section and without any irrelevant
logic, so the lock order or deadlock can be handled easily.
But I'm not sure whether I've lost some race conditions, any comments
are welcomed. : )
--
Thanks,
Sha
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2013-01-09 15:02 UTC|newest]
Thread overview: 49+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-12-25 17:18 [PATCH V3 0/8] Per-cgroup page stat accounting Sha Zhengju
2012-12-25 17:20 ` [PATCH V3 1/8] memcg: remove MEMCG_NR_FILE_MAPPED Sha Zhengju
2012-12-25 17:22 ` [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func Sha Zhengju
2012-12-28 0:39 ` Kamezawa Hiroyuki
2013-01-05 2:34 ` Sha Zhengju
2013-01-02 9:08 ` Michal Hocko
2013-01-05 2:49 ` Sha Zhengju
2013-01-05 10:45 ` Michal Hocko
2012-12-25 17:24 ` [PATCH V3 3/8] use vfs __set_page_dirty interface instead of doing it inside filesystem Sha Zhengju
2012-12-28 0:41 ` Kamezawa Hiroyuki
2012-12-25 17:26 ` [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting Sha Zhengju
2013-01-02 10:44 ` Michal Hocko
2013-01-05 4:48 ` Sha Zhengju
2013-01-06 20:02 ` Hugh Dickins
2013-01-07 7:49 ` Kamezawa Hiroyuki
2013-01-09 5:15 ` Hugh Dickins
2013-01-09 7:24 ` Kamezawa Hiroyuki
2013-01-09 14:35 ` Sha Zhengju
2013-01-09 14:47 ` Michal Hocko
2013-01-07 7:25 ` Kamezawa Hiroyuki
2013-01-09 15:02 ` Sha Zhengju [this message]
2013-01-10 2:16 ` Kamezawa Hiroyuki
2013-01-10 4:26 ` Sha Zhengju
2013-01-10 5:03 ` Kamezawa Hiroyuki
2013-01-10 8:28 ` Sha Zhengju
2013-05-03 9:11 ` Michal Hocko
2013-05-03 9:59 ` Sha Zhengju
2013-01-06 20:07 ` Greg Thelen
2013-01-09 9:45 ` Sha Zhengju
2012-12-25 17:26 ` [PATCH V3 5/8] memcg: add per cgroup writeback " Sha Zhengju
2012-12-28 0:52 ` Kamezawa Hiroyuki
2013-01-02 11:15 ` Michal Hocko
2013-01-06 20:07 ` Greg Thelen
2013-01-09 9:08 ` Sha Zhengju
2012-12-25 17:27 ` [PATCH V3 6/8] memcg: Don't account root_mem_cgroup page statistics Sha Zhengju
2012-12-28 1:04 ` Kamezawa Hiroyuki
2013-01-05 7:38 ` Sha Zhengju
2013-01-02 12:27 ` Michal Hocko
2013-01-05 10:52 ` Sha Zhengju
2013-01-09 12:57 ` Michal Hocko
2012-12-25 17:27 ` [PATCH V3 7/8] memcg: disable memcg page stat accounting code when not in use Sha Zhengju
2012-12-28 1:06 ` Kamezawa Hiroyuki
2012-12-28 1:45 ` Kamezawa Hiroyuki
2013-01-05 11:06 ` Sha Zhengju
2013-01-02 13:35 ` Michal Hocko
2012-12-25 17:28 ` [PATCH V3 8/8] memcg: Document cgroup dirty/writeback memory statistics Sha Zhengju
2012-12-28 1:10 ` Kamezawa Hiroyuki
2013-01-06 2:55 ` Sha Zhengju
2013-01-02 13:36 ` Michal Hocko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAFj3OHXMgRG6u2YoM7y5WuPo2ZNA1yPmKRV29FYj9B6Wj_c6Lw@mail.gmail.com \
--to=handai.szj@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=dchinner@redhat.com \
--cc=fengguang.wu@intel.com \
--cc=glommer@parallels.com \
--cc=gthelen@google.com \
--cc=handai.szj@taobao.com \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox