From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C7D23C004C0 for ; Mon, 23 Oct 2023 15:46:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5B67B6B00D9; Mon, 23 Oct 2023 11:46:20 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 565CF6B00DA; Mon, 23 Oct 2023 11:46:20 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 42D106B00DB; Mon, 23 Oct 2023 11:46:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 303BC6B00D9 for ; Mon, 23 Oct 2023 11:46:20 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id EB64F1CB579 for ; Mon, 23 Oct 2023 15:46:19 +0000 (UTC) X-FDA: 81377152878.19.36865C3 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by imf25.hostedemail.com (Postfix) with ESMTP id 94361A0012 for ; Mon, 23 Oct 2023 15:46:16 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=WbmBpWI0; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=sxDLaKH0; dmarc=none; spf=pass (imf25.hostedemail.com: domain of vbabka@suse.cz designates 195.135.220.28 as permitted sender) smtp.mailfrom=vbabka@suse.cz ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1698075977; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=vPoHScL+a5WWnXsYqdVwkaQU1uk8opVpnJPT0SQsd6k=; b=S0y5GjZGYIO7aQX2klqQkLesi/l1m8AMkiIW0OLA7cGXWcTqOL5fqVyfDogBePB/ZnYYuQ ykudfA5AIsPAsbyt+fLJ+bI09oxFjzvgJ5gHDfrD5taS7NC4CZlsgVmoXHDUoEWK1G1IjI 5XUZ8RWGzrWNRB42dL01NpYknPksEzM= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=WbmBpWI0; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=sxDLaKH0; dmarc=none; spf=pass (imf25.hostedemail.com: domain of vbabka@suse.cz designates 195.135.220.28 as permitted sender) smtp.mailfrom=vbabka@suse.cz ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698075977; a=rsa-sha256; cv=none; b=KC8mEjY4Jsjp8vrdtlYr6kOhPcezKg8oi6/0CirDzriEiBn/vyibLZ+pNSxUdXu/6e2t6D 6glmkVgJWElMKj0HnE3wcyeSQCWkLYEBKU/KFvpZt0LQEVQe15aqbfG1v0ORBr35ubhdFQ 4pHUtO3EEzscJrY7kEb84FgGZ0fx2jA= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 0E82C21ACA; Mon, 23 Oct 2023 15:46:14 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1698075974; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=vPoHScL+a5WWnXsYqdVwkaQU1uk8opVpnJPT0SQsd6k=; b=WbmBpWI0UJ3IA0s0eZgs1xZmBIZq4lA2GcowtO4P+Hv58+LQebxkr04Bh7Lzok/CiFi4LF ehEaFv53RjHiew8eLbw/ZWbjWeCUMwfzclZwt1X6dnO/9UlM+D+8HzbdkMOLkMNqOF+tbk SRD9Cmk8FXmv4vrfuafch66uMieuBnI= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1698075974; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=vPoHScL+a5WWnXsYqdVwkaQU1uk8opVpnJPT0SQsd6k=; b=sxDLaKH0jFGJbuBLcEbLOlpQTjfftURqA7JJ9AEl15T8uPOfsvRCCEbbw0XRN7fDyvj314 FJSxK4/Gl3Sc53CQ== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id B5684132FD; Mon, 23 Oct 2023 15:46:13 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id Sq16K0WVNmVYGwAAMHmgww (envelope-from ); Mon, 23 Oct 2023 15:46:13 +0000 Message-ID: <4134b039-fa99-70cd-3486-3d0c7632e4a3@suse.cz> Date: Mon, 23 Oct 2023 17:46:13 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.15.1 Subject: Re: [RFC PATCH v2 0/6] slub: Delay freezing of CPU partial slabs To: chengming.zhou@linux.dev, cl@linux.com, penberg@kernel.org Cc: rientjes@google.com, iamjoonsoo.kim@lge.com, akpm@linux-foundation.org, roman.gushchin@linux.dev, 42.hyeyoo@gmail.com, willy@infradead.org, pcc@google.com, tytso@mit.edu, maz@kernel.org, ruansy.fnst@fujitsu.com, vishal.moola@gmail.com, lrh2000@pku.edu.cn, hughd@google.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Chengming Zhou References: <20231021144317.3400916-1-chengming.zhou@linux.dev> Content-Language: en-US From: Vlastimil Babka In-Reply-To: <20231021144317.3400916-1-chengming.zhou@linux.dev> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 94361A0012 X-Stat-Signature: moze9npowdynp5hcaqj6ptyh1814obgp X-HE-Tag: 1698075976-982204 X-HE-Meta: U2FsdGVkX1/V4ILqIVJuiY6+4ATLQh+ehgFL0dX+dJ45PUlq7nGL0Kaj2Sb2j6rPoW0xfkTLQVCqTjjDYEvCyitMqLDIqoCX5w264JqNLwQY/qxT0NUA9OJ/rUAn7GnmbKOdfiBT7H4Eocovw16wcP9QbXcGklJ5ysy8VzcDpWcIBdpmfOL5HZf5qnN5D44H0JXxy9zUQpCY9NBa1QiHmvLTclGp3VOr5fLc/qTJPItG819lJpTGbDkCqk2FvwgG5qQVsXnED8gz320eqxqLYQT9/AMAgo3OxOlw2nr2Y8gZI7S1nfApIzWyCN7981glFW4JRELl6kBDhZ6w5y9m3fdkYg5TgK/+7bBcNyZZBhF64/yDCBDsYCtelHQW09grItYr99rs1P17JIaJDPyNy4eZY+MqZsavvG1CkU5c9BxCclT9PqkxN3/pOmW6MczruVUwFhwwaDdmYHv4jKo56uNGz36YBZj2rTVEUIpS3tfgOOc7WOtcWWwSFMyUsR0IMKmEU6iMzNPJhfPosc8bBwW1oRM9VevgxP9c7AWYDepcbIcMhn4JJlZp9A0L6jTvUNi6qy7IBPBFn0+faYyV6r4efBUzcGg+6SzuvJh2i5gRoya/CSco/bDaZWyIHvpnynF4xdhmdQOz/bK1ykvlsDblkubThqYKkX/PylhVf5+FUkYSdn9iNXunxyKAcvvRNoH62tlPL3eu4hT795elPcK0e/AUZ4gyjJnZO7HyLCfeQCiggiqT3+CIyMqoEaV46pe1eeOAUw0nB2Tf/pnsUMuLeckHzzo2GFyx18oMtdhNn/8bkXo0vLiqatyA2h0d9TJAM93YwkI1aPJnW3ybPDFSg+Oi3clX0VEyAuy2NBWjek35v7UuWm1PJrt6xyMB7yUE82pp1SSfrs5YdoeTENBDm6aTVUhZqYQJ7UYuoCrrKxsz32rATqx94nvd2hRxOtz+UVVoZ7uwDEZfLgB 3MOaDJ3M zElhGz0TMfwGxGx6AaX2ekBm+jHyTEHCuo+AarsgyTYuhpXb1bYlr593u2std55NQsxbqr/I4UlDvl/c8VQ/+sI58zLXPJCGlNxRwnUHkPshRelQ2s5N9q4LfDKI06pvI25B10CttF3EjQEyUJgHhPeU3M6q5jc/JV/Iv4IaNpcXuKAtDaDuW4at9WKOS+lCgkgkjOIYeTtQeDubdWxhe1UyShvA6KwkYVhISA/gEJXLXIHBbi+ZwE1iRLwYtKYLOthWXKU6BEq/ERgc9aPu9ha9+JLbbXVnYxuFPACx9s6x97Fs7AUTDlY25CyzoR84+9BjkqJG3kXXYT3NY9VkKTNqkuuIlqAkwcwmR2KeVM3B401Ilzu5yk5YWqp8Fqc9no27iALxLYhmxEmI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 10/21/23 16:43, chengming.zhou@linux.dev wrote: > From: Chengming Zhou Hi! > Changes in RFC v2: > - Reuse PG_workingset bit to keep track of whether slub is on the > per-node partial list, as suggested by Matthew Wilcox. > - Fix OOM problem on kernel without CONFIG_SLUB_CPU_PARTIAL, which > is caused by leak of partial slabs when get_partial_node(). > - Add a patch to simplify acquire_slab(). > - Reorder patches a little. > - v1: https://lore.kernel.org/all/20231017154439.3036608-1-chengming.zhou@linux.dev/ > > 1. Problem > ========== > Now we have to freeze the slab when get from the node partial list, and > unfreeze the slab when put to the node partial list. Because we need to > rely on the node list_lock to synchronize the "frozen" bit changes. > > This implementation has some drawbacks: > > - Alloc path: twice cmpxchg_double. > It has to get some partial slabs from node when the allocator has used > up the CPU partial slabs. So it freeze the slab (one cmpxchg_double) > with node list_lock held, put those frozen slabs on its CPU partial > list. Later ___slab_alloc() will cmpxchg_double try-loop again if that > slab is picked to use. > > - Alloc path: amplified contention on node list_lock. > Since we have to synchronize the "frozen" bit changes under the node > list_lock, the contention of slab (struct page) can be transferred > to the node list_lock. On machine with many CPUs in one node, the > contention of list_lock will be amplified by all CPUs' alloc path. > > The current code has to workaround this problem by avoiding using > cmpxchg_double try-loop, which will just break and return when > contention of page encountered and the first cmpxchg_double failed. > But this workaround has its own problem. I'd note here: For more context, see 9b1ea29bc0d7 ("Revert "mm, slub: consider rest of partial list if acquire_slab() fails"") > - Free path: redundant unfreeze. > __slab_free() will freeze and cache some slabs on its partial list, > and flush them to the node partial list when exceed, which has to > unfreeze those slabs again under the node list_lock. Actually we > don't need to freeze slab on CPU partial list, in which case we > can save the unfreeze cmpxchg_double operations in flush path. > > 2. Solution > =========== > We solve these problems by leaving slabs unfrozen when moving out of > the node partial list and on CPU partial list, so "frozen" bit is 0. > > These partial slabs won't be manipulate concurrently by alloc path, > the only racer is free path, which may manipulate its list when !inuse. > So we need to introduce another synchronization way to avoid it, we > reuse PG_workingset to keep track of whether the slab is on node partial > list or not, only in that case we can manipulate the slab list. > > The slab will be delay frozen when it's picked to actively use by the > CPU, it becomes full at the same time, in which case we still need to > rely on "frozen" bit to avoid manipulating its list. So the slab will > be frozen only when activate use and be unfrozen only when deactivate. Interesting solution! I wonder if we could go a bit further and remove acquire_slab() completely. Because AFAICS even after your changes, acquire_slab() is still attempted including freezing the slab, which means still doing an cmpxchg_double under the list_lock, and now also handling the special case when it failed, but we at least filled percpu partial lists. What if we only filled the partial list without freezing, and then froze the first slab outside of the list_lock? Or more precisely, instead of returning the acquired "object" we would return the first slab removed from partial list. I think it would simplify the code a bit, and further reduce list_lock holding times. I'll also point out a few more details, but it's not a full detailed review as the suggestion above, and another for 4/5, could mean a rather significant change for v3. Thanks! > 3. Testing > ========== > We just did some simple testing on a server with 128 CPUs (2 nodes) to > compare performance for now. > > - perf bench sched messaging -g 5 -t -l 100000 > baseline RFC > 7.042s 6.966s > 7.022s 7.045s > 7.054s 6.985s > > - stress-ng --rawpkt 128 --rawpkt-ops 100000000 > baseline RFC > 2.42s 2.15s > 2.45s 2.16s > 2.44s 2.17s > > It shows above there is about 10% improvement on stress-ng rawpkt > testcase, although no much improvement on perf sched bench testcase. > > Thanks for any comment and code review! > > Chengming Zhou (6): > slub: Keep track of whether slub is on the per-node partial list > slub: Prepare __slab_free() for unfrozen partial slab out of node > partial list > slub: Don't freeze slabs for cpu partial > slub: Simplify acquire_slab() > slub: Introduce get_cpu_partial() > slub: Optimize deactivate_slab() > > include/linux/page-flags.h | 2 + > mm/slab.h | 19 +++ > mm/slub.c | 245 +++++++++++++++++++------------------ > 3 files changed, 150 insertions(+), 116 deletions(-) >