From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.5 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id ADF44C433F5 for ; Mon, 13 Sep 2021 17:01:56 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 3290760FC1 for ; Mon, 13 Sep 2021 17:01:56 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 3290760FC1 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.cz Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 5EEBE6B006C; Mon, 13 Sep 2021 13:01:55 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 59D966B0071; Mon, 13 Sep 2021 13:01:55 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 48D13900002; Mon, 13 Sep 2021 13:01:55 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0092.hostedemail.com [216.40.44.92]) by kanga.kvack.org (Postfix) with ESMTP id 35FEC6B006C for ; Mon, 13 Sep 2021 13:01:55 -0400 (EDT) Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id DC2D78249980 for ; Mon, 13 Sep 2021 17:01:54 +0000 (UTC) X-FDA: 78583167348.13.AE5EA60 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by imf30.hostedemail.com (Postfix) with ESMTP id 5326AE001989 for ; Mon, 13 Sep 2021 17:01:54 +0000 (UTC) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id E8E3221D1A; Mon, 13 Sep 2021 17:01:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1631552512; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=b9fYgzV2F3E6jybacf+q6QUjSxdILkNO2qAvK0C1GYA=; b=oRTkbEsa7OAwDMuob1yjPZrzvOk/g/qXprR83dA5cq9mFcYbbPYjwVBS3qHPU/FxIE3xNK WeWqVhOfq7Aas95AAfJesUnRAy6wBakpx1JcLOxq5Rqetmue2FRYY1e5eUQs3Dvhu9ffvb vOsWqHB3YT3CwX0kr66wXUITqXxnnC8= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1631552512; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=b9fYgzV2F3E6jybacf+q6QUjSxdILkNO2qAvK0C1GYA=; b=XmRg1pROFaG73QYKnIeHl6oB2ocGJS2KeLnrDlXz3WxwsO8MPttUjtxiS39/E1FRoLyth7 uYw1Ln94zJCwmiBQ== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id BFDEA13AAB; Mon, 13 Sep 2021 17:01:52 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id HgUqLgCEP2FaPAAAMHmgww (envelope-from ); Mon, 13 Sep 2021 17:01:52 +0000 From: Vlastimil Babka To: linux-mm@kvack.org, Christoph Lameter , David Rientjes , Joonsoo Kim , Pekka Enberg , Jann Horn Cc: linux-kernel@vger.kernel.org, Roman Gushchin , Vlastimil Babka Subject: [RFC PATCH] mm, slub: change percpu partial accounting from objects to pages Date: Mon, 13 Sep 2021 19:01:48 +0200 Message-Id: <20210913170148.10992-1-vbabka@suse.cz> X-Mailer: git-send-email 2.33.0 MIME-Version: 1.0 X-Developer-Signature: v=1; a=openpgp-sha256; l=12138; h=from:subject; bh=2Uo9g0q4F74quuKCkX0qcsMIortUYQKwxxBTciHJkSY=; b=owEBbQGS/pANAwAIAeAhynPxiakQAcsmYgBhP4P5j1Fjc4+IeC/x3GcKgxm+00Une/4aCvqfDdEw u4SmFeaJATMEAAEIAB0WIQSNS5MBqTXjGL5IXszgIcpz8YmpEAUCYT+D+QAKCRDgIcpz8YmpEHroCA CJmdJ1c7ZMMuAKITwI+QXKu3QqMaDKfMbTCTBrpMgTZ2v+6OPfNky9XeJSxq/AhZjL7R/mbhMcjRdu mBRG/f2hRc0cnn0RckHQ0WfJ33LhPmg5y4Nvk+aHeSY6SjrFXofYrE8J6ONFeuqqW0IvwzE7El7gjp Q5OUy7Yc4ZUZI2ymeIHVizcn8aTWPCq4JFQeEY6WuHoGqnfMu9m6P4fZ1Fuixqy3uqeCUW+SWmW1R2 6FPskwKkTAAZFAkO4DU7edzWkKUpc995DL54IXPtFgwktU6+a66hhR8LULthZ8CRdkTekDepiYM6H5 xax5onRXtYZpdWJrrlGH+fsQ1jH78c X-Developer-Key: i=vbabka@suse.cz; a=openpgp; fpr=A940D434992C2E8E99103D50224FA7E7CC82A664 X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 5326AE001989 Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=oRTkbEsa; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=XmRg1pRO; dmarc=none; spf=pass (imf30.hostedemail.com: domain of vbabka@suse.cz designates 195.135.220.28 as permitted sender) smtp.mailfrom=vbabka@suse.cz X-Stat-Signature: cwgcpxe78gd3c161s54a1pexc7zxqh9q X-HE-Tag: 1631552514-370189 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: With CONFIG_SLUB_CPU_PARTIAL enabled, SLUB keeps a percpu list of partial slabs that can be promoted to cpu slab when the previous one is depleted, without accessing the shared partial list. A slab can be added to this li= st by 1) refill of an empty list from get_partial_node() - once we really ha= ve to access the shared partial list, we acquire multiple slabs to amortize the= cost of locking, and 2) first free to a previously full slab - instead of putt= ing the slab on a shared partial list, we can more cheaply freeze it and put = it on the per-cpu list. To control how large a percpu partial list can grow for a kmem cache, set_cpu_partial() calculates a target number of free objects on each cpu'= s percpu partial list, and this can be also set by the sysfs file cpu_parti= al. However, the tracking of actual number of objects is imprecise, in order = to limit overhead from cpu X freeing an objects to a slab on percpu partial list of cpu Y. Basically, the percpu partial slabs form a single linked l= ist, and when we add a new slab to the list with current head "oldpage", we se= t in the struct page of the slab we're adding: page->pages =3D oldpage->pages + 1; // this is precise page->pobjects =3D oldpage->pobjects + (page->objects - page->inuse); page->next =3D oldpage; Thus the real number of free objects in the slab (objects - inuse) is onl= y determined at the moment of adding the slab to the percpu partial list, a= nd further freeing doesn't update the pobjects counter nor propagate it to t= he current list head. As Jann reports [1], this can easily lead to large inaccuracies, where the target number of objects (up to 30 by default) ca= n translate to the same number of (empty) slab pages on the list. In case 2= ) above, we put a slab with 1 free object on the list, thus only increase page->pobjects by 1, even if there are subsequent frees on the same slab.= Jann has noticed this in practice and so did we [2] when investigating signifi= cant increase of kmemcg usage after switching from SLAB to SLUB. While this is no longer a problem in kmemcg context thanks to the account= ing rewrite in 5.9, the memory waste is still not ideal and it's questionable whether it makes sense to perform free object count based control when ob= ject counts can easily become so much inaccurate. So this patch converts the accounting to be based on number of pages only (which is precise) and rem= oves the page->pobjects field completely. This is also ultimately simpler. To retain the existing set_cpu_partial() heuristic, first calculate the t= arget number of objects as previously, but then convert it to target number of = pages by assuming the pages will be half-filled on average. This assumption mig= ht obviously also be inaccurate in practice, but cannot degrade to actual nu= mber of pages being equal to the target number of objects. We could also skip the intermediate step with target number of objects an= d rewrite the heuristic in terms of pages. However we still have the sysfs = file cpu_partial which uses number of objects and could break existing users i= f it suddenly becomes number of pages, so this patch doesn't do that. In practice, after this patch the heuristics limit the size of percpu par= tial list up to 2 pages. In case of a reported regression (which would mean so= me workload has benefited from the previous imprecise object based counting)= , we can tune the heuristics to get a better compromise within the new scheme,= while still avoid the unexpectedly long percpu partial lists. [1] https://lore.kernel.org/linux-mm/CAG48ez2Qx5K1Cab-m8BdSibp6wLTip6ro4=3D= -umR7BLsEgjEYzA@mail.gmail.com/ [2] https://lore.kernel.org/all/2f0f46e8-2535-410a-1859-e9cfa4e57c18@suse= .cz/ Reported-by: Jann Horn Signed-off-by: Vlastimil Babka --- include/linux/mm_types.h | 2 - include/linux/slub_def.h | 13 +----- mm/slub.c | 89 ++++++++++++++++++++++++++-------------- 3 files changed, 61 insertions(+), 43 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 7f8ee09c711f..68ffa064b7a8 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -124,10 +124,8 @@ struct page { struct page *next; #ifdef CONFIG_64BIT int pages; /* Nr of pages left */ - int pobjects; /* Approximate count */ #else short int pages; - short int pobjects; #endif }; }; diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h index 85499f0586b0..0fa751b946fa 100644 --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -99,6 +99,8 @@ struct kmem_cache { #ifdef CONFIG_SLUB_CPU_PARTIAL /* Number of per cpu partial objects to keep around */ unsigned int cpu_partial; + /* Number of per cpu partial pages to keep around */ + unsigned int cpu_partial_pages; #endif struct kmem_cache_order_objects oo; =20 @@ -141,17 +143,6 @@ struct kmem_cache { struct kmem_cache_node *node[MAX_NUMNODES]; }; =20 -#ifdef CONFIG_SLUB_CPU_PARTIAL -#define slub_cpu_partial(s) ((s)->cpu_partial) -#define slub_set_cpu_partial(s, n) \ -({ \ - slub_cpu_partial(s) =3D (n); \ -}) -#else -#define slub_cpu_partial(s) (0) -#define slub_set_cpu_partial(s, n) -#endif /* CONFIG_SLUB_CPU_PARTIAL */ - #ifdef CONFIG_SYSFS #define SLAB_SUPPORTS_SYSFS void sysfs_slab_unlink(struct kmem_cache *); diff --git a/mm/slub.c b/mm/slub.c index 3d2025f7163b..3757f31c5d97 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -414,6 +414,29 @@ static inline unsigned int oo_objects(struct kmem_ca= che_order_objects x) return x.x & OO_MASK; } =20 +#ifdef CONFIG_SLUB_CPU_PARTIAL +static void slub_set_cpu_partial(struct kmem_cache *s, unsigned int nr_o= bjects) +{ + unsigned int nr_pages; + + s->cpu_partial =3D nr_objects; + + /* + * We take the number of objects but actually limit the number of + * pages on the per cpu partial list, in order to limit excessive + * growth of the list. For simplicity we assume that the pages will + * be half-full. + */ + nr_pages =3D DIV_ROUND_UP(nr_objects * 2, oo_objects(s->oo)); + s->cpu_partial_pages =3D nr_pages; +} +#else +static inline void +slub_set_cpu_partial(struct kmem_cache *s, unsigned int nr_objects) +{ +} +#endif /* CONFIG_SLUB_CPU_PARTIAL */ + /* * Per slab locking using the pagelock */ @@ -2045,7 +2068,7 @@ static inline void remove_partial(struct kmem_cache= _node *n, */ static inline void *acquire_slab(struct kmem_cache *s, struct kmem_cache_node *n, struct page *page, - int mode, int *objects) + int mode) { void *freelist; unsigned long counters; @@ -2061,7 +2084,6 @@ static inline void *acquire_slab(struct kmem_cache = *s, freelist =3D page->freelist; counters =3D page->counters; new.counters =3D counters; - *objects =3D new.objects - new.inuse; if (mode) { new.inuse =3D page->objects; new.freelist =3D NULL; @@ -2099,9 +2121,8 @@ static void *get_partial_node(struct kmem_cache *s,= struct kmem_cache_node *n, { struct page *page, *page2; void *object =3D NULL; - unsigned int available =3D 0; unsigned long flags; - int objects; + unsigned int partial_pages =3D 0; =20 /* * Racy check. If we mistakenly see no partial slabs then we @@ -2119,11 +2140,10 @@ static void *get_partial_node(struct kmem_cache *= s, struct kmem_cache_node *n, if (!pfmemalloc_match(page, gfpflags)) continue; =20 - t =3D acquire_slab(s, n, page, object =3D=3D NULL, &objects); + t =3D acquire_slab(s, n, page, object =3D=3D NULL); if (!t) break; =20 - available +=3D objects; if (!object) { *ret_page =3D page; stat(s, ALLOC_FROM_PARTIAL); @@ -2131,10 +2151,15 @@ static void *get_partial_node(struct kmem_cache *= s, struct kmem_cache_node *n, } else { put_cpu_partial(s, page, 0); stat(s, CPU_PARTIAL_NODE); + partial_pages++; } +#ifdef CONFIG_SLUB_CPU_PARTIAL if (!kmem_cache_has_cpu_partial(s) - || available > slub_cpu_partial(s) / 2) + || partial_pages > s->cpu_partial_pages / 2) break; +#else + break; +#endif =20 } spin_unlock_irqrestore(&n->list_lock, flags); @@ -2539,14 +2564,13 @@ static void put_cpu_partial(struct kmem_cache *s,= struct page *page, int drain) struct page *page_to_unfreeze =3D NULL; unsigned long flags; int pages =3D 0; - int pobjects =3D 0; =20 local_lock_irqsave(&s->cpu_slab->lock, flags); =20 oldpage =3D this_cpu_read(s->cpu_slab->partial); =20 if (oldpage) { - if (drain && oldpage->pobjects > slub_cpu_partial(s)) { + if (drain && oldpage->pages >=3D s->cpu_partial_pages) { /* * Partial array is full. Move the existing set to the * per node partial list. Postpone the actual unfreezing @@ -2555,16 +2579,13 @@ static void put_cpu_partial(struct kmem_cache *s,= struct page *page, int drain) page_to_unfreeze =3D oldpage; oldpage =3D NULL; } else { - pobjects =3D oldpage->pobjects; pages =3D oldpage->pages; } } =20 pages++; - pobjects +=3D page->objects - page->inuse; =20 page->pages =3D pages; - page->pobjects =3D pobjects; page->next =3D oldpage; =20 this_cpu_write(s->cpu_slab->partial, page); @@ -3980,6 +4001,8 @@ static void set_min_partial(struct kmem_cache *s, u= nsigned long min) static void set_cpu_partial(struct kmem_cache *s) { #ifdef CONFIG_SLUB_CPU_PARTIAL + unsigned int nr_objects; + /* * cpu_partial determined the maximum number of objects kept in the * per cpu partial lists of a processor. @@ -3989,24 +4012,22 @@ static void set_cpu_partial(struct kmem_cache *s) * filled up again with minimal effort. The slab will never hit the * per node partial lists and therefore no locking will be required. * - * This setting also determines - * - * A) The number of objects from per cpu partial slabs dumped to the - * per node list when we reach the limit. - * B) The number of objects in cpu partial slabs to extract from the - * per node list when we run out of per cpu objects. We only fetch - * 50% to keep some capacity around for frees. + * For backwards compatibility reasons, this is determined as number + * of objects, even though we now limit maximum number of pages, see + * slub_set_cpu_partial() */ if (!kmem_cache_has_cpu_partial(s)) - slub_set_cpu_partial(s, 0); + nr_objects =3D 0; else if (s->size >=3D PAGE_SIZE) - slub_set_cpu_partial(s, 2); + nr_objects =3D 2; else if (s->size >=3D 1024) - slub_set_cpu_partial(s, 6); + nr_objects =3D 6; else if (s->size >=3D 256) - slub_set_cpu_partial(s, 13); + nr_objects =3D 13; else - slub_set_cpu_partial(s, 30); + nr_objects =3D 30; + + slub_set_cpu_partial(s, nr_objects); #endif } =20 @@ -5379,7 +5400,12 @@ SLAB_ATTR(min_partial); =20 static ssize_t cpu_partial_show(struct kmem_cache *s, char *buf) { - return sysfs_emit(buf, "%u\n", slub_cpu_partial(s)); + unsigned int nr_partial =3D 0; +#ifdef CONFIG_SLUB_CPU_PARTIAL + nr_partial =3D s->cpu_partial; +#endif + + return sysfs_emit(buf, "%u\n", nr_partial); } =20 static ssize_t cpu_partial_store(struct kmem_cache *s, const char *buf, @@ -5450,12 +5476,12 @@ static ssize_t slabs_cpu_partial_show(struct kmem= _cache *s, char *buf) =20 page =3D slub_percpu_partial(per_cpu_ptr(s->cpu_slab, cpu)); =20 - if (page) { + if (page) pages +=3D page->pages; - objects +=3D page->pobjects; - } } =20 + /* Approximate half-full pages , see slub_set_cpu_partial() */ + objects =3D (pages * oo_objects(s->oo)) / 2; len +=3D sysfs_emit_at(buf, len, "%d(%d)", objects, pages); =20 #ifdef CONFIG_SMP @@ -5463,9 +5489,12 @@ static ssize_t slabs_cpu_partial_show(struct kmem_= cache *s, char *buf) struct page *page; =20 page =3D slub_percpu_partial(per_cpu_ptr(s->cpu_slab, cpu)); - if (page) + if (page) { + pages =3D READ_ONCE(page->pages); + objects =3D (pages * oo_objects(s->oo)) / 2; len +=3D sysfs_emit_at(buf, len, " C%d=3D%d(%d)", - cpu, page->pobjects, page->pages); + cpu, objects, pages); + } } #endif len +=3D sysfs_emit_at(buf, len, "\n"); --=20 2.33.0