Re: [PATCH v2 07/13] memcg: Slab accounting.

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Suleiman Souhlal <suleiman@google.com>
To: Glauber Costa <glommer@parallels.com>
Cc: Suleiman Souhlal <ssouhlal@freebsd.org>,
	cgroups@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com,
	cl@linux.com, yinghan@google.com, hughd@google.com,
	gthelen@google.com, peterz@infradead.org,
	dan.magenheimer@oracle.com, hannes@cmpxchg.org, mgorman@suse.de,
	James.Bottomley@hansenpartnership.com, linux-mm@kvack.org,
	devel@openvz.org, linux-kernel@vger.kernel.org,
	Pekka Enberg <penberg@kernel.org>
Subject: Re: [PATCH v2 07/13] memcg: Slab accounting.
Date: Tue, 13 Mar 2012 15:50:01 -0700	[thread overview]
Message-ID: <CABCjUKDsYyg4ONGTEeh1oen-L=OuBrP53qRdpHAT8AYYQ-JqWA@mail.gmail.com> (raw)
In-Reply-To: <4F5C7D82.7030904@parallels.com>

On Sun, Mar 11, 2012 at 3:25 AM, Glauber Costa <glommer@parallels.com> wrote:
> On 03/10/2012 12:39 AM, Suleiman Souhlal wrote:
>> +static inline void
>> +mem_cgroup_kmem_cache_prepare_sleep(struct kmem_cache *cachep)
>> +{
>> +       /*
>> +        * Make sure the cache doesn't get freed while we have interrupts
>> +        * enabled.
>> +        */
>> +       kmem_cache_get_ref(cachep);
>> +       rcu_read_unlock();
>> +}
>
>
> Is this really needed ? After this function call in slab.c, the slab code
> itself accesses cachep a thousand times. If it could be freed, it would
> already explode today for other reasons?
> Am I missing something here?

We need this because once we drop the rcu_read_lock and go to sleep,
the memcg could get deleted, which could lead to the cachep from
getting deleted as well.

So, we need to grab a reference to the cache, to make sure that the
cache doesn't disappear from under us.

>> diff --git a/init/Kconfig b/init/Kconfig
>> index 3f42cd6..e7eb652 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -705,7 +705,7 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
>>          then swapaccount=0 does the trick).
>>  config CGROUP_MEM_RES_CTLR_KMEM
>>        bool "Memory Resource Controller Kernel Memory accounting
>> (EXPERIMENTAL)"
>> -       depends on CGROUP_MEM_RES_CTLR&&  EXPERIMENTAL
>> +       depends on CGROUP_MEM_RES_CTLR&&  EXPERIMENTAL&&  !SLOB
>
> Orthogonal question: Will we ever want this (SLOB) ?

I honestly don't know why someone would want to use this and slob at
the same time.
It really doesn't seem like a required feature, in my opinion.
Especially at first.

>> +static struct kmem_cache *
>> +memcg_create_kmem_cache(struct mem_cgroup *memcg, struct kmem_cache
>> *cachep)
>> +{
>> +       struct kmem_cache *new_cachep;
>> +       struct dentry *dentry;
>> +       char *name;
>> +       int idx;
>> +
>> +       idx = cachep->memcg_params.id;
>> +
>> +       dentry = memcg->css.cgroup->dentry;
>> +       BUG_ON(dentry == NULL);
>> +
>> +       /* Preallocate the space for "dead" at the end */
>> +       name = kasprintf(GFP_KERNEL, "%s(%d:%s)dead",
>> +           cachep->name, css_id(&memcg->css), dentry->d_name.name);
>> +       if (name == NULL)
>> +               return cachep;
>> +       /* Remove "dead" */
>> +       name[strlen(name) - 4] = '\0';
>> +
>> +       new_cachep = kmem_cache_create_memcg(cachep, name);
>> +
>> +       /*
>> +        * Another CPU is creating the same cache?
>> +        * We'll use it next time.
>> +        */
>
> This comment is a bit misleading. Is it really the only reason
> it can fail?
>
> The impression I got is that it can also fail under the normal conditions in
> which kmem_cache_create() fails.

kmem_cache_create() isn't expected to fail often.
I wasn't making an exhaustive lists of why this condition can happen,
just what I think is the most common one is.

>> +/*
>> + * Enqueue the creation of a per-memcg kmem_cache.
>> + * Called with rcu_read_lock.
>> + */
>> +static void
>> +memcg_create_cache_enqueue(struct mem_cgroup *memcg, struct kmem_cache
>> *cachep)
>> +{
>> +       struct create_work *cw;
>> +       unsigned long flags;
>> +
>> +       spin_lock_irqsave(&create_queue_lock, flags);
>
> If we can sleep, why not just create the cache now?
>
> Maybe it would be better to split this in two, and create the cache if
> possible, and a worker if not possible. Then w

That's how I had it in my initial patch, but I was under the
impression that you preferred if we always kicked off the creation to
the workqueue?

Which way do you prefer?

>> @@ -1756,17 +1765,23 @@ static void *kmem_getpages(struct kmem_cache
>> *cachep, gfp_t flags, int nodeid)
>>        if (cachep->flags&  SLAB_RECLAIM_ACCOUNT)
>>
>>                flags |= __GFP_RECLAIMABLE;
>>
>> +       nr_pages = (1<<  cachep->gfporder);
>> +       if (!mem_cgroup_charge_slab(cachep, flags, nr_pages * PAGE_SIZE))
>> +               return NULL;
>> +
>>        page = alloc_pages_exact_node(nodeid, flags | __GFP_NOTRACK,
>> cachep->gfporder);
>> -       if (!page)
>> +       if (!page) {
>> +               mem_cgroup_uncharge_slab(cachep, nr_pages * PAGE_SIZE);
>>                return NULL;
>> +       }
>
>
>
> Can't the following happen:
>
>  *) mem_cgroup_charge_slab() is the first one to touch the slab.
>    Therefore, this first one is billed to root.
>  *) A slab is queued for creation.
>  *) alloc_pages sleep.
>  *) our workers run, and create the cache, therefore filling
>    cachep->memcg_param.memcg
>  *) alloc_pages still can't allocate.
>  *) uncharge tries to uncharge from cachep->memcg_param.memcg,
>    which doesn't have any charges...
>
> Unless you have a strong oposition to this, to avoid this kind of
> corner cases, we could do what I was doing in the slub:
> Allocate the page first, and then account it.
> (freeing the page if it fails).
>
> I know it is not the way it is done for the user pages, but I believe it to
> be better suited for the slab.

I don't think the situation you're describing can happen, because the
memcg caches get created and selected at the beginning of the slab
allocation, in mem_cgroup_get_kmem_cache() and not in
mem_cgroup_charge_slab(), which is much later.

Once we are in mem_cgroup_charge_slab() we know that the allocation
will be charged to the cgroup.

>> @@ -2269,10 +2288,12 @@ kmem_cache_create (const char *name, size_t size,
>> size_t align,
>>                }
>>
>>                if (!strcmp(pc->name, name)) {
>> -                       printk(KERN_ERR
>> -                              "kmem_cache_create: duplicate cache %s\n",
>> name);
>> -                       dump_stack();
>> -                       goto oops;
>> +                       if (!memcg) {
>> +                               printk(KERN_ERR "kmem_cache_create:
>> duplicate"
>> +                                   " cache %s\n", name);
>> +                               dump_stack();
>> +                               goto oops;
>> +                       }
>
> Why? Since we are apending the memcg name at the end anyway, duplicates
> still aren't expected.

Duplicates can happen if you have hierarchies, because we're only
appending the basename of the cgroup.

>> @@ -2703,12 +2787,74 @@ void kmem_cache_destroy(struct kmem_cache *cachep)
>>        if (unlikely(cachep->flags&  SLAB_DESTROY_BY_RCU))
>>
>>                rcu_barrier();
>>
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> +       /* Not a memcg cache */
>> +       if (cachep->memcg_params.id != -1) {
>> +               __clear_bit(cachep->memcg_params.id, cache_types);
>> +               mem_cgroup_flush_cache_create_queue();
>> +       }
>> +#endif
>
>
> This will clear the id when a leaf cache is destroyed. It seems it is not
> what we want, right? We want this id to be cleared only when
> the parent cache is gone.

id != -1, for parent caches (that's what the comment is trying to point out).
I will improve the comment.

>> +static void
>> +kmem_cache_destroy_work_func(struct work_struct *w)
>> +{
>> +       struct kmem_cache *cachep;
>> +       char *name;
>> +
>> +       spin_lock_irq(&destroy_lock);
>> +       while (!list_empty(&destroyed_caches)) {
>> +               cachep = container_of(list_first_entry(&destroyed_caches,
>> +                   struct mem_cgroup_cache_params, destroyed_list),
>> struct
>> +                   kmem_cache, memcg_params);
>> +               name = (char *)cachep->name;
>> +               list_del(&cachep->memcg_params.destroyed_list);
>> +               spin_unlock_irq(&destroy_lock);
>> +               synchronize_rcu();
>> +               kmem_cache_destroy(cachep);
>
>                ^^^^^^
>                will destroy the id.

See my previous comment.

>> @@ -3866,9 +4030,35 @@ void kmem_cache_free(struct kmem_cache *cachep,
>> void *objp)
>>
>>        local_irq_save(flags);
>>        debug_check_no_locks_freed(objp, obj_size(cachep));
>> +
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> +       {
>> +               struct kmem_cache *actual_cachep;
>> +
>> +               actual_cachep = virt_to_cache(objp);
>> +               if (actual_cachep != cachep) {
>> +                       VM_BUG_ON(actual_cachep->memcg_params.id != -1);
>> +                       VM_BUG_ON(actual_cachep->memcg_params.orig_cache
>> !=
>> +                           cachep);
>> +                       cachep = actual_cachep;
>> +               }
>> +               /*
>> +                * Grab a reference so that the cache is guaranteed to
>> stay
>> +                * around.
>> +                * If we are freeing the last object of a dead memcg
>> cache,
>> +                * the kmem_cache_drop_ref() at the end of this function
>> +                * will end up freeing the cache.
>> +                */
>> +               kmem_cache_get_ref(cachep);
>
> 1) Another obvious candidate to be wrapped by static_branch()...
> 2) I don't trully follow why we need those references here. Can you
>   give us an example of a situation in which the cache can go away?
>
> Also note that we are making a function that used to operate mostly on
> local data now issue two atomic operations.

Yes, improving this is in my v3 TODO already.

The situation is very simple, and will happen every time we are
freeing the last object of a dead cache.
When we free the last object, kmem_freepages() will drop the last
reference, which will cause the kmem_cache to be destroyed right
there.
Grabbing an additional reference before freeing the page is just a
hack to avoid this situation.

It might be possible to just wrap the free path in rcu_read_lock(), or
if that isn't enough, to delay the destruction until the end. I still
have to think about this a bit more, to be sure.

Thanks for the detailed review,
-- Suleiman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2012-03-13 22:50 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-03-09 20:39 [PATCH v2 00/13] Memcg Kernel Memory Tracking Suleiman Souhlal
2012-03-09 20:39 ` [PATCH v2 01/13] memcg: Consolidate various flags into a single flags field Suleiman Souhlal
2012-03-11  7:50   ` Glauber Costa
2012-03-09 20:39 ` [PATCH v2 02/13] memcg: Kernel memory accounting infrastructure Suleiman Souhlal
2012-03-11  8:12   ` Glauber Costa
2012-03-13  6:24     ` KAMEZAWA Hiroyuki
2012-03-13 10:37       ` Glauber Costa
2012-03-13 17:00         ` Greg Thelen
2012-03-13 17:31           ` Glauber Costa
2012-03-14  0:15         ` KAMEZAWA Hiroyuki
2012-03-14 12:29           ` Glauber Costa
2012-03-15  0:48             ` KAMEZAWA Hiroyuki
2012-03-15 11:07               ` Glauber Costa
2012-03-15 11:13                 ` Peter Zijlstra
2012-03-15 11:21                   ` Glauber Costa
2012-03-12 12:38   ` Glauber Costa
2012-03-09 20:39 ` [PATCH v2 03/13] memcg: Uncharge all kmem when deleting a cgroup Suleiman Souhlal
2012-03-11  8:19   ` Glauber Costa
2012-03-13 23:16     ` Suleiman Souhlal
2012-03-14 11:59       ` Glauber Costa
2012-03-13  6:27   ` KAMEZAWA Hiroyuki
2012-03-09 20:39 ` [PATCH v2 04/13] memcg: Make it possible to use the stock for more than one page Suleiman Souhlal
2012-03-11 10:49   ` Glauber Costa
2012-03-09 20:39 ` [PATCH v2 05/13] memcg: Reclaim when more than one page needed Suleiman Souhlal
2012-03-09 20:39 ` [PATCH v2 06/13] slab: Add kmem_cache_gfp_flags() helper function Suleiman Souhlal
2012-03-11 10:53   ` Glauber Costa
2012-03-13 23:21     ` Suleiman Souhlal
2012-03-14 11:48       ` Glauber Costa
2012-03-14 22:08         ` Suleiman Souhlal
2012-03-09 20:39 ` [PATCH v2 07/13] memcg: Slab accounting Suleiman Souhlal
2012-03-11 10:25   ` Glauber Costa
2012-03-13 22:50     ` Suleiman Souhlal [this message]
2012-03-14 10:47       ` Glauber Costa
2012-03-14 22:04         ` Suleiman Souhlal
2012-03-15 11:40           ` Glauber Costa
2012-03-09 20:39 ` [PATCH v2 08/13] memcg: Make dentry slab memory accounted in kernel memory accounting Suleiman Souhlal
2012-03-09 20:39 ` [PATCH v2 09/13] memcg: Account for kmalloc " Suleiman Souhlal
2012-03-11 12:21   ` Glauber Costa
2012-03-09 20:39 ` [PATCH v2 10/13] memcg: Track all the memcg children of a kmem_cache Suleiman Souhlal
2012-03-09 20:39 ` [PATCH v2 11/13] memcg: Handle bypassed kernel memory charges Suleiman Souhlal
2012-03-09 20:39 ` [PATCH v2 12/13] memcg: Per-memcg memory.kmem.slabinfo file Suleiman Souhlal
2012-03-11 10:35   ` Glauber Costa
2012-03-09 20:39 ` [PATCH v2 13/13] memcg: Document kernel memory accounting Suleiman Souhlal
2012-03-11 10:42   ` Glauber Costa
2012-03-10  6:25 ` [PATCH v2 00/13] Memcg Kernel Memory Tracking Suleiman Souhlal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CABCjUKDsYyg4ONGTEeh1oen-L=OuBrP53qRdpHAT8AYYQ-JqWA@mail.gmail.com' \
    --to=suleiman@google.com \
    --cc=James.Bottomley@hansenpartnership.com \
    --cc=cgroups@vger.kernel.org \
    --cc=cl@linux.com \
    --cc=dan.magenheimer@oracle.com \
    --cc=devel@openvz.org \
    --cc=glommer@parallels.com \
    --cc=gthelen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=penberg@kernel.org \
    --cc=peterz@infradead.org \
    --cc=ssouhlal@freebsd.org \
    --cc=yinghan@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox