From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx192.postini.com [74.125.245.192]) by kanga.kvack.org (Postfix) with SMTP id B87696B004D for ; Mon, 7 Jan 2013 02:25:58 -0500 (EST) Received: from m4.gw.fujitsu.co.jp (unknown [10.0.50.74]) by fgwmail5.fujitsu.co.jp (Postfix) with ESMTP id 1C7183EE0BD for ; Mon, 7 Jan 2013 16:25:57 +0900 (JST) Received: from smail (m4 [127.0.0.1]) by outgoing.m4.gw.fujitsu.co.jp (Postfix) with ESMTP id EF36C45DE54 for ; Mon, 7 Jan 2013 16:25:56 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (s4.gw.fujitsu.co.jp [10.0.50.94]) by m4.gw.fujitsu.co.jp (Postfix) with ESMTP id D98FE45DE52 for ; Mon, 7 Jan 2013 16:25:56 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id CE2851DB803E for ; Mon, 7 Jan 2013 16:25:56 +0900 (JST) Received: from m1000.s.css.fujitsu.com (m1000.s.css.fujitsu.com [10.240.81.136]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id 832841DB8037 for ; Mon, 7 Jan 2013 16:25:56 +0900 (JST) Message-ID: <50EA7860.6030300@jp.fujitsu.com> Date: Mon, 07 Jan 2013 16:25:20 +0900 From: Kamezawa Hiroyuki MIME-Version: 1.0 Subject: Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting References: <1356455919-14445-1-git-send-email-handai.szj@taobao.com> <1356456367-14660-1-git-send-email-handai.szj@taobao.com> <20130102104421.GC22160@dhcp22.suse.cz> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Sha Zhengju Cc: Michal Hocko , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, akpm@linux-foundation.org, gthelen@google.com, fengguang.wu@intel.com, glommer@parallels.com, dchinner@redhat.com, Sha Zhengju (2013/01/05 13:48), Sha Zhengju wrote: > On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko wrote: >> On Wed 26-12-12 01:26:07, Sha Zhengju wrote: >>> From: Sha Zhengju >>> >>> This patch adds memcg routines to count dirty pages, which allows memory controller >>> to maintain an accurate view of the amount of its dirty memory and can provide some >>> info for users while cgroup's direct reclaim is working. >> >> I guess you meant targeted resp. (hard/soft) limit reclaim here, >> right? It is true that this is direct reclaim but it is not clear to me > > Yes, I meant memcg hard/soft reclaim here which is triggered directly > by allocation and is distinct from background kswapd reclaim (global). > >> why the usefulnes should be limitted to the reclaim for users. I would >> understand this if the users was in fact in-kernel users. >> > > One of the reasons I'm trying to accounting the dirty pages is to get a > more board overall view of memory usages because memcg hard/soft > reclaim may have effect on response time of user application. > Yeah, the beneficiary can be application administrator or kernel users. :P > >> [...] >>> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version >>> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order: >>> ->private_lock --> mapping->tree_lock --> memcg->move_lock. >>> So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty() >>> and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention, >>> a prepare PageDirty() checking is added. >> >> But there is another AA deadlock here I believe. >> page_remove_rmap >> mem_cgroup_begin_update_page_stat <<< 1 >> set_page_dirty >> __set_page_dirty_buffers >> __set_page_dirty >> mem_cgroup_begin_update_page_stat <<< 2 >> move_lock_mem_cgroup >> spin_lock_irqsave(&memcg->move_lock, *flags); >> >> mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS >> because we might race with the moving charges: >> CPU0 CPU1 >> page_remove_rmap >> mem_cgroup_can_attach >> mem_cgroup_begin_update_page_stat (1) >> rcu_read_lock >> mem_cgroup_start_move >> atomic_inc(&memcg_moving) >> atomic_inc(&memcg->moving_account) >> synchronize_rcu >> __mem_cgroup_begin_update_page_stat >> mem_cgroup_stolen <<< TRUE >> move_lock_mem_cgroup >> [...] >> mem_cgroup_begin_update_page_stat (2) >> __mem_cgroup_begin_update_page_stat >> mem_cgroup_stolen <<< still TRUE >> move_lock_mem_cgroup <<< DEADLOCK >> [...] >> mem_cgroup_end_update_page_stat >> rcu_unlock >> # wake up from synchronize_rcu >> [...] >> mem_cgroup_move_task >> mem_cgroup_move_charge >> walk_page_range >> mem_cgroup_move_account >> move_lock_mem_cgroup >> >> >> Maybe I have missed some other locking which would prevent this from >> happening but the locking relations are really complicated in this area >> so if mem_cgroup_{begin,end}_update_page_stat might be called >> recursively then we need a fat comment which justifies that. >> > > Ohhh...good catching! I didn't notice there is a recursive call of > mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap(). > The mem_cgroup_{begin,end}_update_page_stat() design has depressed > me a lot recently as the lock granularity is a little bigger than I thought. > Not only the resource but also some code logic is in the range of locking > which may be deadlock prone. The problem still exists if we are trying to > add stat account of other memcg page later, may I make bold to suggest > that we dig into the lock again... > > But with regard to the current lock implementation, I doubt if we can we can > account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just > try to get move_lock once in the beginning. IMHO we can make > mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm > thinking now is changing memcg->move_lock to rw-spinlock from the > original spinlock: > mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which make it > reenterable and memcg moving task side try to get the write spinlock. > Then the race may be following: > > CPU0 CPU1 > page_remove_rmap > mem_cgroup_can_attach > mem_cgroup_begin_update_page_stat (1) > rcu_read_lock > mem_cgroup_start_move > atomic_inc(&memcg_moving) > > atomic_inc(&memcg->moving_account) > synchronize_rcu > __mem_cgroup_begin_update_page_stat > mem_cgroup_stolen <<< TRUE > move_lock_mem_cgroup <<<< read-spinlock success > [...] > mem_cgroup_begin_update_page_stat (2) > __mem_cgroup_begin_update_page_stat > mem_cgroup_stolen <<< still TRUE > move_lock_mem_cgroup <<<< read-spinlock success > > [...] > mem_cgroup_end_update_page_stat <<< locked = true, unlock > rcu_unlock > # wake up from synchronize_rcu > [...] > mem_cgroup_move_task > mem_cgroup_move_charge > walk_page_range > mem_cgroup_move_account > > move_lock_mem_cgroup <<< write-spinlock > > > AFAICS, the deadlock seems to be avoided by both the rcu and rwlock. > Is there anything I lost? > rwlock will work with the nest but it seems ugly do updates under read-lock. How about this straightforward ? == /* * Once a thread takes memcg_move_lock() on a memcg, it can take the lock on * the memcg again for nesting calls */ static void move_lock_mem_cgroup(memcg, flags); { current->memcg_move_lock_nested += 1; if (current->memcg_move_lock_nested > 1) { VM_BUG_ON(current->move_locked_memcg != memcg); return; } spin_lock_irqsave(&memcg_move_lock, &flags); current->move_lockdev_memcg = memcg; } static void move_unlock_mem_cgroup(memcg, flags) { current->memcg_move_lock_nested -= 1; if (!current->memcg_move_lock_nested) { current->move_locked_memcg = NULL; spin_unlock_irqrestore(&memcg_move_lock,flags); } } == But, hmm, this kind of (ugly) hack may cause trouble as Hugh said. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org