Re: Higher slub memory consumption on 64K page-size systems?

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Bharata B Rao <bharata@linux.ibm.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cl@linux.com,
	rientjes@google.com, iamjoonsoo.kim@lge.com,
	akpm@linux-foundation.org, guro@fb.com, shakeelb@google.com,
	hannes@cmpxchg.org, aneesh.kumar@linux.ibm.com
Subject: Re: Higher slub memory consumption on 64K page-size systems?
Date: Wed, 11 Nov 2020 14:32:46 +0530	[thread overview]
Message-ID: <20201111090246.GA1006690@in.ibm.com> (raw)
In-Reply-To: <5150e942-516b-83c8-8e52-e0f294138a71@suse.cz>

On Thu, Nov 05, 2020 at 05:47:03PM +0100, Vlastimil Babka wrote:
> On 10/28/20 6:50 AM, Bharata B Rao wrote:
> > slub_max_order
> > --------------
> > The most promising tunable that shows consistent reduction in slab memory
> > is slub_max_order. Here is a table that shows the number of slabs that
> > end up with different orders and the total slab consumption at boot
> > for different values of slub_max_order:
> > -------------------------------------------
> > slub_max_order	Order	NrSlabs	Slab memory
> > -------------------------------------------
> > 		0	276
> > 	3	1	16	207488 kB
> >      (default)	2	4
> > 		3	11
> > -------------------------------------------
> > 		0	276
> > 	2	1	16	166656 kB
> > 		2	4
> > -------------------------------------------
> > 		0	276	144128 kB
> > 	1	1	31
> > -------------------------------------------
> > 
> > Though only a few bigger sized caches fall into order-2 or order-3, they
> > seem to make a considerable difference to the overall slab consumption.
> > If we take task_struct cache as an example, this is how it ends up when
> > slub_max_order is varied:
> > 
> > task_struct, objsize=9856
> > --------------------------------------------
> > slub_max_order	objperslab	pagesperslab
> > --------------------------------------------
> > 3		53		8
> > 2		26		4
> > 1		13		2
> > --------------------------------------------
> > 
> > The slab page-order and hence the number of objects in a slab has a
> > bearing on the performance, but I wonder if some caches like task_struct
> > above can be auto-tuned to fall into a conservative order and do good
> > both wrt both memory and performance?
> 
> Hmm ideally this should be based on objperslab so if there's larger page
> sizes, then the calculated order becomes smaller, even 0?

It is indeed based on number of objects that could be optimally
fit within a slab. As I explain below, curently we start with a
minimum objects value that ends up pushing the page order higher
for some slab sizes and page size combination. The question is can
we start with a more conservative/lower value for min_objects in
calculate_order()?

> 
> > mm/slub.c:calulate_order() has the logic which determines the the
> > page-order for the slab. It starts with min_objects and attempts
> > to arrive at the best configuration for the slab. The min_objects
> > is starts like this:
> > 
> > min_objects = 4 * (fls(nr_cpu_ids) + 1);
> > 
> > Here nr_cpu_ids depends on the maxcpus and hence this can have a
> > significant effect on those systems which define maxcpus. Slab numbers
> > post-boot for a KVM pseries guest that has 16 boottime CPUs and varying
> > number of maxcpus look like this:
> > -------------------------------
> > maxcpus		Slab memory(kB)
> > -------------------------------
> > 64		209280
> > 256		253824
> > 512		293824
> > -------------------------------
> 
> Yeah IIRC nr_cpu_ids is related to number of possible cpus which is rather
> excessive on some systems, so a relation to actually online cpus would make
> more sense.

May be I can send a patch to change the above calculation of
min_objects to be based on online cpus and see how it is received.

> 
> > Page-order is a one time setting and obviously can't be tweaked dynamically
> > on CPU hotplug, but just wanted to bring out the effect of the same.
> > 
> > And that constant multiplicative factor of 4 was infact added by the commit
> > 9b2cd506e5f2 - "slub: Calculate min_objects based on number of processors."
> > 
> > Reducing that to say 2, does give some reduction in the slab memory
> > and also same hackbench performance with reduced slab memory, but I am not
> > sure if that could be assumed to be beneficial for all scenarios.
> > 
> > MIN_PARTIAL
> > -----------
> > This determines the number of slabs left on the partial list even if they
> > are empty. My initial thought was that the default MIN_PARTIAL value of 5
> > is on the higher side and we are accumulating MIN_PARTIAL number of
> > empty slabs in all caches without freeing them. However I hardly find
> > the case where an empty slab is retained during freeing on account of
> > partial slabs being lesser than MIN_PARTIAL.
> > 
> > However what I find in practice is that we are accumulating a lot of partial
> > slabs with just one in-use object in the whole slab. High number of such
> > partial slabs is indeed contributing to the increased slab memory consumption.
> > 
> > For example, after a hackbench run, I find the distribution of objects
> > like this for kmalloc-2k cache:
> > 
> > total_objects		3168
> > objects			1611
> > Nr partial slabs	54
> > Nr parital slabs with
> > just 1 inuse object	38
> > 
> > With 64K page-size, so many partial slabs with just 1 inuse object can
> > result in high memory usage. Is there any workaround possible prevent this
> > kind of situation?
> 
> Probably not, this is just fundamental internal fragmentation problem and
> that we can't predict which objects will have similar lifetime and thus put
> it together. Larger pages make just make the effect more pronounced. It
> would be wrong if we allocated new pages instead of reusing the partial
> ones, but that's not the case, IIUC?

Correct, that shouldn't be the case, I will check by adding some
instrumentation and ascertain if it indeed the case.

> 
> But you are measuring "after a hackbench run", so is that an important data
> point? If the system was in some kind of steady state workload, the pages
> would be better used I'd expect.

May be, I am not sure, we will have to check. I measured at two points: immediately
after boot as initial state and after hackbench run as an exteme state. I chose
hackbench as I see that earlier changes to some of these slab code/tunables
have been supported by hackbench numbers.

> 
> > cpu_partial
> > -----------
> > Here is how the slab consumption post-boot varies when all the slab
> > caches are forced with the fixed cpu_partial value:
> > ---------------------------
> > cpu_partial	Slab Memory
> > ---------------------------
> > 0		175872 kB
> > 2		187136 kB
> > 4		191616 kB
> > default		204864 kB
> > ---------------------------
> > 
> > It has been suggested earlier that reducing cpu_partial and/or making
> > cpu_partial 64K page-size aware will benefit. In set_cpu_partial(),
> > for bigger sized slabs (size > PAGE_SIZE), cpu_partial is already set
> > to 2. A bit of tweaking there to introduce cpu_partial=1 for certain
> > slabs does give some benefit.
> > 
> > diff --git a/mm/slub.c b/mm/slub.c
> > index a28ed9b8fc61..e09eff1199bf 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -3626,7 +3626,9 @@ static void set_cpu_partial(struct kmem_cache *s)
> >           */
> >          if (!kmem_cache_has_cpu_partial(s))
> >                  slub_set_cpu_partial(s, 0);
> > -       else if (s->size >= PAGE_SIZE)
> > +       else if (s->size >= 8192)
> > +               slub_set_cpu_partial(s, 1);
> > +       else if (s->size >= 4096)
> >                  slub_set_cpu_partial(s, 2);
> >          else if (s->size >= 1024)
> >                  slub_set_cpu_partial(s, 6);
> > 
> > With the above change, the slab consumption post-boot reduces to 186048 kB.
> 
> Yeah, making it agnostic to PAGE_SIZE makes sense.

Ok, let me send a separate patch for this.

Thanks for your inputs.

Regards,
Bharata.

     prev parent reply	other threads:[~2020-11-11  9:04 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-28  5:50 Bharata B Rao
2020-10-29  0:07 ` Roman Gushchin
2020-11-02 11:33   ` Bharata B Rao
2020-11-05 16:47 ` Vlastimil Babka
2020-11-11  9:02   ` Bharata B Rao [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201111090246.GA1006690@in.ibm.com \
    --to=bharata@linux.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=cl@linux.com \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=rientjes@google.com \
    --cc=shakeelb@google.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox