From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=ytkm=GQ=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-5.3 required=3.0 tests=BAYES_00,
	HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,
	SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 027E2C433E0
	for <linux-mm@archiver.kernel.org>; Wed, 13 Jan 2021 19:14:18 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 46AD722DFA
	for <linux-mm@archiver.kernel.org>; Wed, 13 Jan 2021 19:14:17 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 46AD722DFA
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.cz
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 8F75E8D008B; Wed, 13 Jan 2021 14:14:16 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 8CD3E8D006A; Wed, 13 Jan 2021 14:14:16 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 80A4C8D008B; Wed, 13 Jan 2021 14:14:16 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0183.hostedemail.com [216.40.44.183])
	by kanga.kvack.org (Postfix) with ESMTP id 6B4198D006A
	for <linux-mm@kvack.org>; Wed, 13 Jan 2021 14:14:16 -0500 (EST)
Received: from smtpin01.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id 35291180AD820
	for <linux-mm@kvack.org>; Wed, 13 Jan 2021 19:14:16 +0000 (UTC)
X-FDA: 77701702512.01.owl71_2c021e827520
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin01.hostedemail.com (Postfix) with ESMTP id 290211004F916
	for <linux-mm@kvack.org>; Wed, 13 Jan 2021 19:14:15 +0000 (UTC)
X-HE-Tag: owl71_2c021e827520
X-Filterd-Recvd-Size: 5909
Received: from mx2.suse.de (mx2.suse.de [195.135.220.15])
	by imf24.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Wed, 13 Jan 2021 19:14:14 +0000 (UTC)
X-Virus-Scanned: by amavisd-new at test-mx.suse.de
Received: from relay2.suse.de (unknown [195.135.221.27])
	by mx2.suse.de (Postfix) with ESMTP id 9B342ACF4;
	Wed, 13 Jan 2021 19:14:12 +0000 (UTC)
To: Jann Horn <jannh@google.com>, Christoph Lameter <cl@linux.com>,
 Pekka Enberg <penberg@kernel.org>, David Rientjes <rientjes@google.com>,
 Joonsoo Kim <iamjoonsoo.kim@lge.com>,
 Andrew Morton <akpm@linux-foundation.org>
Cc: Linux-MM <linux-mm@kvack.org>, kernel list
 <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>,
 Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
 Roman Gushchin <guro@fb.com>, Johannes Weiner <hannes@cmpxchg.org>,
 Shakeel Butt <shakeelb@google.com>, Suren Baghdasaryan <surenb@google.com>,
 Minchan Kim <minchan@kernel.org>, Michal Hocko <mhocko@kernel.org>
References: <CAG48ez2Qx5K1Cab-m8BdSibp6wLTip6ro4=-umR7BLsEgjEYzA@mail.gmail.com>
From: Vlastimil Babka <vbabka@suse.cz>
Subject: Re: SLUB: percpu partial object count is highly inaccurate, causing
 some memory wastage and maybe also worse tail latencies?
Message-ID: <2f0f46e8-2535-410a-1859-e9cfa4e57c18@suse.cz>
Date: Wed, 13 Jan 2021 20:14:11 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.6.0
MIME-Version: 1.0
In-Reply-To: <CAG48ez2Qx5K1Cab-m8BdSibp6wLTip6ro4=-umR7BLsEgjEYzA@mail.gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On 1/12/21 12:12 AM, Jann Horn wrote:
> [This is not something I intend to work on myself. But since I
> stumbled over this issue, I figured I should at least document/report
> it, in case anyone is willing to pick it up.]
>=20
> Hi!

Hi, thanks for saving me a lot of typing!

...

> This means that in practice, SLUB actually ends up keeping as many
> **pages** on the percpu partial lists as it intends to keep **free
> objects** there.

Yes, I concluded the same thing.

...

> I suspect that this may have also contributed to the memory wastage
> problem with memory cgroups that was fixed in v5.9
> (https://lore.kernel.org/linux-mm/20200623174037.3951353-1-guro@fb.com/=
);
> meaning that servers with lots of CPU cores running pre-5.9 kernels
> with memcg and systemd (which tends to stick every service into its
> own memcg) might be even worse off.

Very much yes. Investigating an increase of kmemcg usage of a workload be=
tween
an older kernel with SLAB and 5.3-based kernel with SLUB led us to find t=
he same
issue as you did. It doesn't help that slabinfo (global or per-memcg) is =
also
inaccurate as it cannot count free objects on per-cpu partial slabs and t=
hus
reports them as active. I was aware that some empty slab pages might ling=
er on
per-cpu lists, but only after seeing how many were freed after "echo 1 >
.../shrink" made me realize the extent of this.

> It also seems unsurprising to me that flushing ~30 pages out of the
> percpu partial caches at once with IRQs disabled would cause tail
> latency spikes (as noted by Joonsoo Kim and Christoph Lameter in
> commit 345c905d13a4e "slub: Make cpu partial slab support
> configurable").
>=20
> At first I thought that this wasn't a significant issue because SLUB
> has a reclaim path that can trim the percpu partial lists; but as it
> turns out, that reclaim path is not actually wired up to the page
> allocator's reclaim logic. The SLUB reclaim stuff is only triggered by
> (very rare) subsystem-specific calls into SLUB for specific slabs and
> by sysfs entries. So in userland processes will OOM even if SLUB still
> has megabytes of entirely unused pages lying around.

Yeah, we considered to wire the shrinking to memcg OOM, but it's a poor
solution. I'm considering introducing a proper shrinker that would be reg=
istered
and work like other shrinkers for reclaimable caches. Then we would make =
it
memcg-aware in our backport - upstream after v5.9 doesn't need that obvio=
usly.

> It might be a good idea to figure out whether it is possible to
> efficiently keep track of a more accurate count of the free objects on

As long as there are some inuse objects, it shouldn't matter much if the =
slab is
sitting on per-cpu partial list or per-node list, as it can't be freed an=
yway.
It becomes a real problem only after the slab become fully free. If we de=
tected
that in __slab_free() also for already-frozen slabs, we would need to kno=
w which
CPU this slab belongs to (currently that's not tracked afaik), and send i=
t an
IPI to do some light version of unfreeze_partials() that would only remov=
e empty
slabs. The trick would be not to cause too many IPI's by this, obviously =
:/

Actually I'm somewhat wrong above. If a CPU and per-node partial list run=
s out
of free objects, it's wasteful to allocate new slabs if almost-empty slab=
s sit
on another CPU's per-node partial list.

> percpu partial lists; and if not, maybe change the accounting to
> explicitly track the number of partial pages, and use limits that are

That would be probably the simplest solution. Maybe sufficient  upstream =
where
the wastage only depends on number of caches and not memcgs. For pre-5.9 =
I also
considered limiting the number of pages only for the per-memcg clones :/
Currently writing to the /sys/.../<cache>/cpu_partial file is propagated =
to all
the clones and root cache.

> more appropriate for that? And perhaps the page allocator reclaim path
> should also occasionally rip unused pages out of the percpu partial
> lists?

That would be best done by the a shrinker?

BTW, SLAB does this by reaping of its per-cpu and shared arrays by timers=
 (which
works, but is not ideal) They also can't grow that large like this.