From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.3 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE, SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 027E2C433E0 for ; Wed, 13 Jan 2021 19:14:18 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 46AD722DFA for ; Wed, 13 Jan 2021 19:14:17 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 46AD722DFA Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.cz Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 8F75E8D008B; Wed, 13 Jan 2021 14:14:16 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 8CD3E8D006A; Wed, 13 Jan 2021 14:14:16 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 80A4C8D008B; Wed, 13 Jan 2021 14:14:16 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0183.hostedemail.com [216.40.44.183]) by kanga.kvack.org (Postfix) with ESMTP id 6B4198D006A for ; Wed, 13 Jan 2021 14:14:16 -0500 (EST) Received: from smtpin01.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 35291180AD820 for ; Wed, 13 Jan 2021 19:14:16 +0000 (UTC) X-FDA: 77701702512.01.owl71_2c021e827520 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin01.hostedemail.com (Postfix) with ESMTP id 290211004F916 for ; Wed, 13 Jan 2021 19:14:15 +0000 (UTC) X-HE-Tag: owl71_2c021e827520 X-Filterd-Recvd-Size: 5909 Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by imf24.hostedemail.com (Postfix) with ESMTP for ; Wed, 13 Jan 2021 19:14:14 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 9B342ACF4; Wed, 13 Jan 2021 19:14:12 +0000 (UTC) To: Jann Horn , Christoph Lameter , Pekka Enberg , David Rientjes , Joonsoo Kim , Andrew Morton Cc: Linux-MM , kernel list , Thomas Gleixner , Sebastian Andrzej Siewior , Roman Gushchin , Johannes Weiner , Shakeel Butt , Suren Baghdasaryan , Minchan Kim , Michal Hocko References: From: Vlastimil Babka Subject: Re: SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies? Message-ID: <2f0f46e8-2535-410a-1859-e9cfa4e57c18@suse.cz> Date: Wed, 13 Jan 2021 20:14:11 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.6.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 1/12/21 12:12 AM, Jann Horn wrote: > [This is not something I intend to work on myself. But since I > stumbled over this issue, I figured I should at least document/report > it, in case anyone is willing to pick it up.] >=20 > Hi! Hi, thanks for saving me a lot of typing! ... > This means that in practice, SLUB actually ends up keeping as many > **pages** on the percpu partial lists as it intends to keep **free > objects** there. Yes, I concluded the same thing. ... > I suspect that this may have also contributed to the memory wastage > problem with memory cgroups that was fixed in v5.9 > (https://lore.kernel.org/linux-mm/20200623174037.3951353-1-guro@fb.com/= ); > meaning that servers with lots of CPU cores running pre-5.9 kernels > with memcg and systemd (which tends to stick every service into its > own memcg) might be even worse off. Very much yes. Investigating an increase of kmemcg usage of a workload be= tween an older kernel with SLAB and 5.3-based kernel with SLUB led us to find t= he same issue as you did. It doesn't help that slabinfo (global or per-memcg) is = also inaccurate as it cannot count free objects on per-cpu partial slabs and t= hus reports them as active. I was aware that some empty slab pages might ling= er on per-cpu lists, but only after seeing how many were freed after "echo 1 > .../shrink" made me realize the extent of this. > It also seems unsurprising to me that flushing ~30 pages out of the > percpu partial caches at once with IRQs disabled would cause tail > latency spikes (as noted by Joonsoo Kim and Christoph Lameter in > commit 345c905d13a4e "slub: Make cpu partial slab support > configurable"). >=20 > At first I thought that this wasn't a significant issue because SLUB > has a reclaim path that can trim the percpu partial lists; but as it > turns out, that reclaim path is not actually wired up to the page > allocator's reclaim logic. The SLUB reclaim stuff is only triggered by > (very rare) subsystem-specific calls into SLUB for specific slabs and > by sysfs entries. So in userland processes will OOM even if SLUB still > has megabytes of entirely unused pages lying around. Yeah, we considered to wire the shrinking to memcg OOM, but it's a poor solution. I'm considering introducing a proper shrinker that would be reg= istered and work like other shrinkers for reclaimable caches. Then we would make = it memcg-aware in our backport - upstream after v5.9 doesn't need that obvio= usly. > It might be a good idea to figure out whether it is possible to > efficiently keep track of a more accurate count of the free objects on As long as there are some inuse objects, it shouldn't matter much if the = slab is sitting on per-cpu partial list or per-node list, as it can't be freed an= yway. It becomes a real problem only after the slab become fully free. If we de= tected that in __slab_free() also for already-frozen slabs, we would need to kno= w which CPU this slab belongs to (currently that's not tracked afaik), and send i= t an IPI to do some light version of unfreeze_partials() that would only remov= e empty slabs. The trick would be not to cause too many IPI's by this, obviously = :/ Actually I'm somewhat wrong above. If a CPU and per-node partial list run= s out of free objects, it's wasteful to allocate new slabs if almost-empty slab= s sit on another CPU's per-node partial list. > percpu partial lists; and if not, maybe change the accounting to > explicitly track the number of partial pages, and use limits that are That would be probably the simplest solution. Maybe sufficient upstream = where the wastage only depends on number of caches and not memcgs. For pre-5.9 = I also considered limiting the number of pages only for the per-memcg clones :/ Currently writing to the /sys/...//cpu_partial file is propagated = to all the clones and root cache. > more appropriate for that? And perhaps the page allocator reclaim path > should also occasionally rip unused pages out of the percpu partial > lists? That would be best done by the a shrinker? BTW, SLAB does this by reaping of its per-cpu and shared arrays by timers= (which works, but is not ideal) They also can't grow that large like this.