From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.3 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0F93EC55179 for ; Thu, 5 Nov 2020 16:47:13 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 6294720759 for ; Thu, 5 Nov 2020 16:47:12 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6294720759 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.cz Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 728586B0143; Thu, 5 Nov 2020 11:47:11 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 6B0FE6B0144; Thu, 5 Nov 2020 11:47:11 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5788B6B0145; Thu, 5 Nov 2020 11:47:11 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0083.hostedemail.com [216.40.44.83]) by kanga.kvack.org (Postfix) with ESMTP id 22DBC6B0143 for ; Thu, 5 Nov 2020 11:47:11 -0500 (EST) Received: from smtpin12.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 03B88180AD80F for ; Thu, 5 Nov 2020 16:47:10 +0000 (UTC) X-FDA: 77450944620.12.baby31_290d2bf272cb Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin12.hostedemail.com (Postfix) with ESMTP id C4877180559F1 for ; Thu, 5 Nov 2020 16:47:09 +0000 (UTC) X-HE-Tag: baby31_290d2bf272cb X-Filterd-Recvd-Size: 8174 Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by imf42.hostedemail.com (Postfix) with ESMTP for ; Thu, 5 Nov 2020 16:47:09 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 9E077AB4C; Thu, 5 Nov 2020 16:47:07 +0000 (UTC) To: bharata@linux.ibm.com, linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, cl@linux.com, rientjes@google.com, iamjoonsoo.kim@lge.com, akpm@linux-foundation.org, guro@fb.com, shakeelb@google.com, hannes@cmpxchg.org, aneesh.kumar@linux.ibm.com References: <20201028055030.GA362097@in.ibm.com> From: Vlastimil Babka Subject: Re: Higher slub memory consumption on 64K page-size systems? Message-ID: <5150e942-516b-83c8-8e52-e0f294138a71@suse.cz> Date: Thu, 5 Nov 2020 17:47:03 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.4.0 MIME-Version: 1.0 In-Reply-To: <20201028055030.GA362097@in.ibm.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 10/28/20 6:50 AM, Bharata B Rao wrote: > slub_max_order > -------------- > The most promising tunable that shows consistent reduction in slab memo= ry > is slub_max_order. Here is a table that shows the number of slabs that > end up with different orders and the total slab consumption at boot > for different values of slub_max_order: > ------------------------------------------- > slub_max_order Order NrSlabs Slab memory > ------------------------------------------- > 0 276 > 3 1 16 207488 kB > (default) 2 4 > 3 11 > ------------------------------------------- > 0 276 > 2 1 16 166656 kB > 2 4 > ------------------------------------------- > 0 276 144128 kB > 1 1 31 > ------------------------------------------- >=20 > Though only a few bigger sized caches fall into order-2 or order-3, the= y > seem to make a considerable difference to the overall slab consumption. > If we take task_struct cache as an example, this is how it ends up when > slub_max_order is varied: >=20 > task_struct, objsize=3D9856 > -------------------------------------------- > slub_max_order objperslab pagesperslab > -------------------------------------------- > 3 53 8 > 2 26 4 > 1 13 2 > -------------------------------------------- >=20 > The slab page-order and hence the number of objects in a slab has a > bearing on the performance, but I wonder if some caches like task_struc= t > above can be auto-tuned to fall into a conservative order and do good > both wrt both memory and performance? Hmm ideally this should be based on objperslab so if there's larger page = sizes,=20 then the calculated order becomes smaller, even 0? > mm/slub.c:calulate_order() has the logic which determines the the > page-order for the slab. It starts with min_objects and attempts > to arrive at the best configuration for the slab. The min_objects > is starts like this: >=20 > min_objects =3D 4 * (fls(nr_cpu_ids) + 1); >=20 > Here nr_cpu_ids depends on the maxcpus and hence this can have a > significant effect on those systems which define maxcpus. Slab numbers > post-boot for a KVM pseries guest that has 16 boottime CPUs and varying > number of maxcpus look like this: > ------------------------------- > maxcpus Slab memory(kB) > ------------------------------- > 64 209280 > 256 253824 > 512 293824 > ------------------------------- Yeah IIRC nr_cpu_ids is related to number of possible cpus which is rathe= r=20 excessive on some systems, so a relation to actually online cpus would ma= ke more=20 sense. > Page-order is a one time setting and obviously can't be tweaked dynamic= ally > on CPU hotplug, but just wanted to bring out the effect of the same. >=20 > And that constant multiplicative factor of 4 was infact added by the co= mmit > 9b2cd506e5f2 - "slub: Calculate min_objects based on number of processo= rs." >=20 > Reducing that to say 2, does give some reduction in the slab memory > and also same hackbench performance with reduced slab memory, but I am = not > sure if that could be assumed to be beneficial for all scenarios. >=20 > MIN_PARTIAL > ----------- > This determines the number of slabs left on the partial list even if th= ey > are empty. My initial thought was that the default MIN_PARTIAL value of= 5 > is on the higher side and we are accumulating MIN_PARTIAL number of > empty slabs in all caches without freeing them. However I hardly find > the case where an empty slab is retained during freeing on account of > partial slabs being lesser than MIN_PARTIAL. >=20 > However what I find in practice is that we are accumulating a lot of pa= rtial > slabs with just one in-use object in the whole slab. High number of suc= h > partial slabs is indeed contributing to the increased slab memory consu= mption. >=20 > For example, after a hackbench run, I find the distribution of objects > like this for kmalloc-2k cache: >=20 > total_objects 3168 > objects 1611 > Nr partial slabs 54 > Nr parital slabs with > just 1 inuse object 38 >=20 > With 64K page-size, so many partial slabs with just 1 inuse object can > result in high memory usage. Is there any workaround possible prevent t= his > kind of situation? Probably not, this is just fundamental internal fragmentation problem and= that=20 we can't predict which objects will have similar lifetime and thus put it= =20 together. Larger pages make just make the effect more pronounced. It woul= d be=20 wrong if we allocated new pages instead of reusing the partial ones, but = that's=20 not the case, IIUC? But you are measuring "after a hackbench run", so is that an important da= ta=20 point? If the system was in some kind of steady state workload, the pages= would=20 be better used I'd expect. > cpu_partial > ----------- > Here is how the slab consumption post-boot varies when all the slab > caches are forced with the fixed cpu_partial value: > --------------------------- > cpu_partial Slab Memory > --------------------------- > 0 175872 kB > 2 187136 kB > 4 191616 kB > default 204864 kB > --------------------------- >=20 > It has been suggested earlier that reducing cpu_partial and/or making > cpu_partial 64K page-size aware will benefit. In set_cpu_partial(), > for bigger sized slabs (size > PAGE_SIZE), cpu_partial is already set > to 2. A bit of tweaking there to introduce cpu_partial=3D1 for certain > slabs does give some benefit. >=20 > diff --git a/mm/slub.c b/mm/slub.c > index a28ed9b8fc61..e09eff1199bf 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -3626,7 +3626,9 @@ static void set_cpu_partial(struct kmem_cache *s) > */ > if (!kmem_cache_has_cpu_partial(s)) > slub_set_cpu_partial(s, 0); > - else if (s->size >=3D PAGE_SIZE) > + else if (s->size >=3D 8192) > + slub_set_cpu_partial(s, 1); > + else if (s->size >=3D 4096) > slub_set_cpu_partial(s, 2); > else if (s->size >=3D 1024) > slub_set_cpu_partial(s, 6); >=20 > With the above change, the slab consumption post-boot reduces to 186048= kB. Yeah, making it agnostic to PAGE_SIZE makes sense. > Also, here are the hackbench numbers with and w/o the above change: >=20 > Average of 10 runs of 'hackbench -s 1024 -l 200 -g 200 -f 25 -P' > Slab consumption captured at the end of each run > -------------------------------------------------------------- > Time Slab memory > -------------------------------------------------------------- > Default 11.124s 645580 kB > Patched 11.032s 584352 kB > -------------------------------------------------------------- >=20 > I have mostly looked at reducing the slab memory consumption here. > But I do understand that default tunable values have been arrived > at based on some benchmark numbers. Are there ways or possibilities > to reduce the slub memory consumption with the existing level of > performance is what I would like to understand and explore. >=20 > Regards, > Bharata. >=20