From: Vladimir Davydov <vdavydov@parallels.com>
To: Dave Chinner <david@fromorbit.com>
Cc: dchinner@redhat.com, hannes@cmpxchg.org, mhocko@suse.cz,
akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, cgroups@vger.kernel.org, devel@openvz.org,
glommer@openvz.org, glommer@gmail.com,
Al Viro <viro@zeniv.linux.org.uk>,
Balbir Singh <bsingharora@gmail.com>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Subject: Re: [PATCH v13 11/16] mm: list_lru: add per-memcg lists
Date: Thu, 12 Dec 2013 13:50:10 +0400 [thread overview]
Message-ID: <52A986D2.6010606@parallels.com> (raw)
In-Reply-To: <20131212014023.GG31386@dastard>
On 12/12/2013 05:40 AM, Dave Chinner wrote:
>>>> +int list_lru_grow_memcg(struct list_lru *lru, size_t new_array_size)
>>>> +{
>>>> + int i;
>>>> + struct list_lru_one **memcg_lrus;
>>>> +
>>>> + memcg_lrus = kcalloc(new_array_size, sizeof(*memcg_lrus), GFP_KERNEL);
>>>> + if (!memcg_lrus)
>>>> + return -ENOMEM;
>>>> +
>>>> + if (lru->memcg) {
>>>> + for_each_memcg_cache_index(i) {
>>>> + if (lru->memcg[i])
>>>> + memcg_lrus[i] = lru->memcg[i];
>>>> + }
>>>> + }
>>> Um, krealloc()?
>> Not exactly. We have to keep the old version until we call sync_rcu.
> Ah, of course. Could you add a big comment explaining this so that
> the next reader doesn't suggest replacing it with krealloc(), too?
Sure.
>>>> +int memcg_list_lru_init(struct list_lru *lru)
>>>> +{
>>>> + int err = 0;
>>>> + int i;
>>>> + struct mem_cgroup *memcg;
>>>> +
>>>> + lru->memcg = NULL;
>>>> + lru->memcg_old = NULL;
>>>> +
>>>> + mutex_lock(&memcg_create_mutex);
>>>> + if (!memcg_kmem_enabled())
>>>> + goto out_list_add;
>>>> +
>>>> + lru->memcg = kcalloc(memcg_limited_groups_array_size,
>>>> + sizeof(*lru->memcg), GFP_KERNEL);
>>>> + if (!lru->memcg) {
>>>> + err = -ENOMEM;
>>>> + goto out;
>>>> + }
>>>> +
>>>> + for_each_mem_cgroup(memcg) {
>>>> + int memcg_id;
>>>> +
>>>> + memcg_id = memcg_cache_id(memcg);
>>>> + if (memcg_id < 0)
>>>> + continue;
>>>> +
>>>> + err = list_lru_memcg_alloc(lru, memcg_id);
>>>> + if (err) {
>>>> + mem_cgroup_iter_break(NULL, memcg);
>>>> + goto out_free_lru_memcg;
>>>> + }
>>>> + }
>>>> +out_list_add:
>>>> + list_add(&lru->list, &all_memcg_lrus);
>>>> +out:
>>>> + mutex_unlock(&memcg_create_mutex);
>>>> + return err;
>>>> +
>>>> +out_free_lru_memcg:
>>>> + for (i = 0; i < memcg_limited_groups_array_size; i++)
>>>> + list_lru_memcg_free(lru, i);
>>>> + kfree(lru->memcg);
>>>> + goto out;
>>>> +}
>>> That will probably scale even worse. Think about what happens when we
>>> try to mount a bunch of filesystems in parallel - they will now
>>> serialise completely on this memcg_create_mutex instantiating memcg
>>> lists inside list_lru_init().
>> Yes, the scalability seems to be the main problem here. I have a couple
>> of thoughts on how it could be improved. Here they go:
>>
>> 1) We can turn memcg_create_mutex to rw-semaphore (or introduce an
>> rw-semaphore, which we would take for modifying list_lru's) and take it
>> for reading in memcg_list_lru_init() and for writing when we create a
>> new memcg (memcg_init_all_lrus()).
>> This would remove the bottleneck from the mount path, but every memcg
>> creation would still iterate over all LRUs under a memcg mutex. So I
>> guess it is not an option, isn't it?
> Right - it's not so much that there is a mutex to protect the init,
> it's how long it's held that will be the issue. I mean, we don't
> need to hold the memcg_create_mutex until we've completely
> initialised the lru structure and are ready to add it to the
> all_memcg_lrus list, right?
>
> i.e. restructing it so that you don't need to hold the mutex until
> you make the LRU list globally visible would solve the problem just
> as well. if we can iterate the memcgs lists without holding a lock,
> then we can init the per-memcg lru lists without holding a lock
> because nobody will access them through the list_lru structure
> because it's not yet been published.
>
> That keeps the locking simple, and we get scalability because we've
> reduced the lock's scope to just a few instructures instead of a
> memcg iteration and a heap of memory allocation....
Unfortunately that's not that easy as it seems to be :-(
Currently I hold the memcg_create_mutex while initializing per-memcg
LRUs in memcg_list_lru_init() in order to be sure that I won't miss a
memcg that is added during initialization.
I mean, let's try to move per-memcg LRUs allocation outside the lock and
only register the LRU there:
memcg_list_lru_init():
1) allocate lru->memcg array
2) for_each_kmem_active_memcg(m)
allocate lru->memcg[m]
3) lock memcg_create_mutex
add lru to all_memcg_lrus_list
unlock memcg_create_mutex
Then if a new kmem-active memcg is added after step 2 and before step 3,
it won't see the new lru, because it has not been registered yet, and
thus won't initialize its list in this lru. As a result, we will end up
with a partially initialized list_lru. Note that this will happen even
if the whole memcg initialization proceeds under the memcg_create_mutex.
Provided we could freeze memcg_limited_groups_array_size, it would be
possible to fix this problem by swapping steps 2 and 3 and making step 2
initialize lru->memcg[m] using cmpxchg() only if it was not initialized.
However we still have to hold the memcg_create_mutex during the whole
kmemcg activation path (memcg_init_all_lrus()).
Let's see if we can get rid of the lock in memcg_init_all_lrus() by
making the all_memcg_lrus RCU-protected so that we could iterate over
all list_lrus w/o holding any locks and turn memcg_init_all_lrus() to
something like this:
memcg_init_all_lrus():
1) for_each_list_lru_rcu(lru)
allocate lru->memcg[new_memcg_id]
2) mark new_memcg as kmem-active
The problem is that if memcg_list_lru_init(new_lru) starts and completes
between steps 1 and 2, we will not initialize
new_lru->memcg[new_memcg_id] neither in memcg_init_all_lrus() nor in
memcg_list_lru_init().
The problem here is that on kmemcg creation (memcg_init_all_lrus()) we
have to iterate over all list_lrus while on list_lru creation
(memcg_list_lru_init()) we have to iterate over all memcgs. Currently I
can't figure out how we can do it w/o holding any mutexes at least while
calling one of these functions, but I'm keep thinking on it.
>
>> 2) We could use cmpxchg() instead of a mutex in list_lru_init_memcg()
>> and memcg_init_all_lrus() to assure a memcg LRU is initialized only
>> once. But again, this would not remove iteration over all LRUs from
>> memcg_init_all_lrus().
>>
>> 3) We can try to initialize per-memcg LRUs lazily only when we actually
>> need them, similar to how we now handle per-memcg kmem caches creation.
>> If list_lru_add() cannot find appropriate LRU, it will schedule a
>> background worker for its initialization.
> I'd prefer not to add complexity to the list_lru_add() path here.
> It's frequently called, so it's a code hot path and so we should
> keep it as simply as possible.
>
>> The benefits of this approach are clear: we do not introduce any
>> bottlenecks, and we lower memory consumption in case different memcgs
>> use different mounts exclusively.
>> However, there is one thing that bothers me. Some objects accounted to a
>> memcg will go into the global LRU, which will postpone actual memcg
>> destruction until global reclaim.
> Yeah, that's messy. best to avoid it by doing the work at list init
> time, IMO.
I also think so, because the benefits of this are rather doubtful:
1) We actually remove bottlenecks from slow paths (memcg creation and fs
mount) executed relatively rare.
2) In contrast to kmem_cache, list_lru_node is a very small structure so
that making per-memcg lists initialized lazily would not save us much
memory.
But currently I guess it would be the easiest way to get rid of the
memcg_create_mutex held in the initialization paths.
Thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2013-12-12 9:50 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-12-09 8:05 [PATCH v13 00/16] kmemcg shrinkers Vladimir Davydov
2013-12-09 8:05 ` [PATCH v13 01/16] memcg: make cache index determination more robust Vladimir Davydov
2013-12-09 8:05 ` [PATCH v13 02/16] memcg: consolidate callers of memcg_cache_id Vladimir Davydov
2013-12-09 8:05 ` [PATCH v13 03/16] memcg: move initialization to memcg creation Vladimir Davydov
2013-12-09 8:05 ` [PATCH v13 04/16] memcg: move memcg_caches_array_size() function Vladimir Davydov
2013-12-10 8:04 ` Glauber Costa
2013-12-09 8:05 ` [PATCH v13 05/16] vmscan: move call to shrink_slab() to shrink_zones() Vladimir Davydov
2013-12-10 8:10 ` Glauber Costa
2013-12-09 8:05 ` [PATCH v13 06/16] vmscan: remove shrink_control arg from do_try_to_free_pages() Vladimir Davydov
2013-12-09 8:05 ` [PATCH v13 07/16] vmscan: call NUMA-unaware shrinkers irrespective of nodemask Vladimir Davydov
2013-12-09 8:05 ` [PATCH v13 08/16] mm: list_lru: require shrink_control in count, walk functions Vladimir Davydov
2013-12-10 1:36 ` Dave Chinner
2013-12-09 8:05 ` [PATCH v13 09/16] fs: consolidate {nr,free}_cached_objects args in shrink_control Vladimir Davydov
2013-12-10 1:38 ` Dave Chinner
2013-12-09 8:05 ` [PATCH v13 10/16] vmscan: shrink slab on memcg pressure Vladimir Davydov
2013-12-10 2:11 ` Dave Chinner
2013-12-09 8:05 ` [PATCH v13 11/16] mm: list_lru: add per-memcg lists Vladimir Davydov
2013-12-10 5:00 ` Dave Chinner
2013-12-10 10:05 ` Vladimir Davydov
2013-12-12 1:40 ` Dave Chinner
2013-12-12 9:50 ` Vladimir Davydov [this message]
2013-12-12 20:24 ` Vladimir Davydov
2013-12-14 20:03 ` Vladimir Davydov
2013-12-12 20:48 ` Glauber Costa
2013-12-09 8:05 ` [PATCH v13 12/16] fs: mark list_lru based shrinkers memcg aware Vladimir Davydov
2013-12-10 4:17 ` Dave Chinner
2013-12-11 11:08 ` Steven Whitehouse
2013-12-09 8:05 ` [PATCH v13 13/16] vmscan: take at least one pass with shrinkers Vladimir Davydov
2013-12-10 4:18 ` Dave Chinner
2013-12-10 11:50 ` Vladimir Davydov
2013-12-10 12:38 ` Glauber Costa
2013-12-09 8:05 ` [PATCH v13 14/16] vmpressure: in-kernel notifications Vladimir Davydov
2013-12-10 8:12 ` Glauber Costa
2013-12-09 8:05 ` [PATCH v13 15/16] memcg: reap dead memcgs upon global memory pressure Vladimir Davydov
2013-12-09 8:05 ` [PATCH v13 16/16] memcg: flush memcg items upon memcg destruction Vladimir Davydov
2013-12-10 8:02 ` [PATCH v13 00/16] kmemcg shrinkers Glauber Costa
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=52A986D2.6010606@parallels.com \
--to=vdavydov@parallels.com \
--cc=akpm@linux-foundation.org \
--cc=bsingharora@gmail.com \
--cc=cgroups@vger.kernel.org \
--cc=david@fromorbit.com \
--cc=dchinner@redhat.com \
--cc=devel@openvz.org \
--cc=glommer@gmail.com \
--cc=glommer@openvz.org \
--cc=hannes@cmpxchg.org \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.cz \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox