From: Vladimir Davydov <vdavydov@parallels.com>
To: Greg Thelen <gthelen@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Johannes Weiner <hannes@cmpxchg.org>,
Michal Hocko <mhocko@suse.cz>, Glauber Costa <glommer@gmail.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
devel@openvz.org, Christoph Lameter <cl@linux-foundation.org>,
Pekka Enberg <penberg@kernel.org>
Subject: Re: [PATCH -mm 1/4] sl[au]b: do not charge large allocations to memcg
Date: Fri, 28 Mar 2014 11:56:29 +0400 [thread overview]
Message-ID: <53352B2D.4000306@parallels.com> (raw)
In-Reply-To: <CAHH2K0YFB9yXF_oyxhQt9EiD_kuBuK7py6ah8YEy2H70P8SC_A@mail.gmail.com>
On 03/28/2014 12:42 AM, Greg Thelen wrote:
> On Thu, Mar 27, 2014 at 12:37 AM, Vladimir Davydov
> <vdavydov@parallels.com> wrote:
>> On 03/27/2014 08:31 AM, Greg Thelen wrote:
>>> Before this change both of the following allocations are charged to
>>> memcg (assuming kmem accounting is enabled):
>>> a = kmalloc(KMALLOC_MAX_CACHE_SIZE, GFP_KERNEL)
>>> b = kmalloc(KMALLOC_MAX_CACHE_SIZE + 1, GFP_KERNEL)
>>>
>>> After this change only 'a' is charged; 'b' goes directly to page
>>> allocator which no longer does accounting.
>>
>> Why do we need to charge 'b' in the first place? Can the userspace
>> trigger such allocations massively? If there can only be one or two such
>> allocations from a cgroup, is there any point in charging them?
>
> Of the top of my head I don't know of any >8KIB kmalloc()s so I can't
> say if they're directly triggerable by user space en masse. But we
> recently ran into some order:3 allocations in networking. The
> networking allocations used a non-generic kmem_cache (rather than
> kmalloc which started this discussion). For details, see ed98df3361f0
> ("net: use __GFP_NORETRY for high order allocations"). I can't say if
> such allocations exist in device drivers, but given the networking
> example, it's conceivable that they may (or will) exist.
Hmm, also not sure about device drivers, but the sock frag pages you
mentioned are worth charging I guess.
For such non-generic kmem allocations, we have two options: either go
with __GFP_KMEMCG, or introduce special alloc/free_kmem_pages methods,
which would be used instead of kmalloc for large allocations (e.g.
threadinfo, sock frags). I vote for the second, because I dislike having
kmemcg charging in the general allocation path.
Anyway, that brings us back to the necessity to reliably track arbitrary
pages in kmemcg to allow reparenting.
> With slab this isn't a problem because sla has kmalloc kmem_caches for
> all supported allocation sizes. However, slub shows this issue for
> any kmalloc() allocations larger than 8KIB (at least on x86_64). It
> seems like a strange directly to take kmem accounting to say that
> kmalloc allocations are kmem limited, but only if they are either less
> than a threshold size or done with slab. Simply increasing the size
> of a data structure doesn't seem like it should automatically cause
> the memory to become exempt from kmem limits.
Sounds fair.
>> In fact, do we actually need to charge every random kmem allocation? I
>> guess not. For instance, filesystems often allocate data shared among
>> all the FS users. It's wrong to charge such allocations to a particular
>> memcg, IMO. That said the next step is going to be adding a per kmem
>> cache flag specifying if allocations from this cache should be charged
>> so that accounting will work only for those caches that are marked so
>> explicitly.
>
> It's a question of what direction to approach kmem slab accounting
> from: either opt-out (as the code currently is), or opt-in (with per
> kmem_cache flags as you suggest). I agree that some structures end up
> being shared (e.g. filesystem block bit map structures). In an
> opt-out system these are charged to a memcg initially and remain
> charged there until the memcg is deleted at which point the shared
> objects are reparented to a shared location. While this isn't
> perfect, it's unclear if it's better or worse than analyzing each
> class of allocation and deciding if they should be opt'd-in. One
> could (though I'm not) make the case that even dentries are easily
> shareable between containers and thus shouldn't be accounted to a
> single memcg. But given user space's ability to DoS a machine with
> dentires, they should be accounted.
Again, you must be right. After a bit of thinking I agree that deciding
which caches should be accounted and which shouldn't would be
cumbersome. Opt-out would be clearer, I guess.
>> There is one more argument for removing kmalloc_large accounting - we
>> don't have an easy way to track such allocations, which prevents us from
>> reparenting kmemcg charges on css offline. Of course, we could link
>> kmalloc_large pages in some sort of per-memcg list which would allow us
>> to find them on css offline, but I don't think such a complication is
>> justified.
>
> I assume that reparenting of such non kmem_cache allocations (e.g.
> large kmalloc) is difficult because such pages refer to the memcg,
> which we're trying to delete and the memcg has no index of such pages.
> If such zombie memcg are undesirable, then an alternative to indexing
> the pages is to define a kmem context object which such large pages
> point to. The kmem context would be reparented without needing to
> adjust the individual large pages. But there are plenty of options.
I like the idea about the context object. For usual kmalloc'ed data, we
already have one - the kmem_cache itself. For non-generic kmem (e.g.
threadinfo pages), we could easily introduce a separate one with the
pointer to the owning memcg on it. Reparenting wouldn't be a problem at
all then.
I guess I'll give it a try in the next iteration. Thank you!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2014-03-28 7:56 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-03-26 15:28 [PATCH -mm 0/4] kmemcg: get rid of __GFP_KMEMCG Vladimir Davydov
2014-03-26 15:28 ` [PATCH -mm 1/4] sl[au]b: do not charge large allocations to memcg Vladimir Davydov
2014-03-26 21:53 ` Michal Hocko
2014-03-27 7:34 ` Vladimir Davydov
2014-03-27 20:40 ` Michal Hocko
2014-03-27 4:31 ` Greg Thelen
2014-03-27 7:37 ` Vladimir Davydov
2014-03-27 20:42 ` Greg Thelen
2014-03-28 7:56 ` Vladimir Davydov [this message]
2014-03-27 20:43 ` Michal Hocko
2014-03-28 7:58 ` Vladimir Davydov
2014-03-26 15:28 ` [PATCH -mm 2/4] sl[au]b: charge slabs to memcg explicitly Vladimir Davydov
2014-03-26 21:58 ` Michal Hocko
2014-03-27 7:38 ` Vladimir Davydov
2014-03-27 20:38 ` Michal Hocko
2014-03-26 15:28 ` [PATCH -mm 3/4] fork: charge threadinfo " Vladimir Davydov
2014-03-26 22:00 ` Michal Hocko
2014-03-27 7:39 ` Vladimir Davydov
2014-03-26 15:28 ` [PATCH -mm 4/4] mm: kill __GFP_KMEMCG Vladimir Davydov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=53352B2D.4000306@parallels.com \
--to=vdavydov@parallels.com \
--cc=akpm@linux-foundation.org \
--cc=cl@linux-foundation.org \
--cc=devel@openvz.org \
--cc=glommer@gmail.com \
--cc=gthelen@google.com \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.cz \
--cc=penberg@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox