From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 27734CD1284 for ; Tue, 2 Apr 2024 16:13:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A364E6B0083; Tue, 2 Apr 2024 12:13:27 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9C0226B0088; Tue, 2 Apr 2024 12:13:27 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 838DE6B0089; Tue, 2 Apr 2024 12:13:27 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 632866B0083 for ; Tue, 2 Apr 2024 12:13:27 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 20472A0DEB for ; Tue, 2 Apr 2024 16:13:27 +0000 (UTC) X-FDA: 81965086854.02.44346F8 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.223.130]) by imf14.hostedemail.com (Postfix) with ESMTP id CACDC100013 for ; Tue, 2 Apr 2024 16:13:24 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=n52Zmm0E; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=RGhEWK3R; dmarc=none; spf=pass (imf14.hostedemail.com: domain of vbabka@suse.cz designates 195.135.223.130 as permitted sender) smtp.mailfrom=vbabka@suse.cz ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1712074405; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=0Gc2HfYT61VQfZurFzA2oPT2Lo+9CbieuFO4RlOzRD4=; b=TYPqn9lnC8UdqSwE0aVc3y1f45LDxFeckj9kH6u5Qu7g/0dhKbG+Eem5W3Vj0oBlHaHgBp TR/MkQWViUio7rEJ/7cKCG1DBctpzIlAQeod6z1Ye0WC0Fy9eOJrBUNTQE/BxK05icsvSR ya9KWGAmOQUwHQDZ7NVZBFYITzVza1w= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=n52Zmm0E; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=RGhEWK3R; dmarc=none; spf=pass (imf14.hostedemail.com: domain of vbabka@suse.cz designates 195.135.223.130 as permitted sender) smtp.mailfrom=vbabka@suse.cz ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1712074405; a=rsa-sha256; cv=none; b=gkX4KWOupAaymqA6gftZZu+IkRl3pgGF0IYIHPi8iELfVj+/svI1D1kXqwy/RDzRUNyieh rf+WVGCye8P1nxFUbT6Xk1MfHEgvqdSQhzK1vdNQ6GTmH9viQlpD7V0yUkQWjw3WKZu0EC B7m4bgbs2oF38sxnJoQwsPt4oex+J7E= Received: from imap2.dmz-prg2.suse.org (imap2.dmz-prg2.suse.org [IPv6:2a07:de40:b281:104:10:150:64:98]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 5939934899; Tue, 2 Apr 2024 16:13:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1712074403; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=0Gc2HfYT61VQfZurFzA2oPT2Lo+9CbieuFO4RlOzRD4=; b=n52Zmm0EYlXT2jX0QLQ85bf035iy0rqSIG5hwqHrLSTV5BIntIJR+kv3kOX3s2yJGL3VoP UBWduxp81AtLcE9rw4VRlclSXi6ClRedU1YwAl4tv4vL9i7oARfRyms9+1LorG7ZK7/iOw wO+vcpYoN/1sFXHxyrdldXFUn4DHTBw= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1712074403; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=0Gc2HfYT61VQfZurFzA2oPT2Lo+9CbieuFO4RlOzRD4=; b=RGhEWK3R5xS1ooylPHE+M0hKjOfcTBWm4i00+gVhs0XgJc6vJY9c7htNir522Z9Vnz451G Is9sdTSds4OFEFDA== Received: from imap2.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap2.dmz-prg2.suse.org (Postfix) with ESMTPS id 3555D13A90; Tue, 2 Apr 2024 16:13:23 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap2.dmz-prg2.suse.org with ESMTPSA id rqGuDKMuDGZkDgAAn2gu4w (envelope-from ); Tue, 02 Apr 2024 16:13:23 +0000 Message-ID: Date: Tue, 2 Apr 2024 18:13:22 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] slub: fix slub segmentation Content-Language: en-US To: Chengming Zhou , Ming Yang , cl@linux.com, penberg@kernel.org, rientjes@google.com, iamjoonsoo.kim@lge.com, akpm@linux-foundation.org, roman.gushchin@linux.dev, 42.hyeyoo@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: zhangliang5@huawei.com, wangzhigang17@huawei.com, liushixin2@huawei.com, alex.chen@huawei.com, pengyi.pengyi@huawei.com, xiqi2@huawei.com References: <20240402031025.1097-1-yangming73@huawei.com> From: Vlastimil Babka In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spamd-Bar: / X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: CACDC100013 X-Stat-Signature: t8uiwaaf17yzcxcexf53tejka7yjjpsz X-HE-Tag: 1712074404-146355 X-HE-Meta: U2FsdGVkX1+LHYZkKKGAEgw8bFVsvfC6sem1YYbTCEsxnQ8DchGU8oLeeJG1GMndqLJgwcl3Tuz9IMg3XKfCoBzSXpJ9CPQ29+iUwOgZx7AiEOnHE6TjBqXYI3vr7NK3rynatPY/jRzcgqBTqIF8VLXmD1wx0kDTtTReTYgFz5shqtPIPnR11gnDNLjvpwCizyu/CcXoG+91EvDFA8ApVHwFk8aJzVX8eLWxDGw6yg2CmsIl8XlyHSa5QkPrWNvb4LhTFisrqGf4ohpc79iic/FQ382EkYzwiggJjrTr4GFIMlTUGVYex+xebcVLOOa/YvvXjQtJi6q4DDtb9lJrpDa3HYyli3hkZNkp0/L3o9q1T9iRGAp4GAOhoIJm16WPIQo7G6jDb0s4rWpngKJaWpFzqVDHiLXkcc6RT6YIsZSlLFS4rUlPv0A4dOq2SjGdTva4vHv0uxZvJoNardiO0V93/2L02kJXP0OskXioKwZqYpoKAhyaukeBZzKqPj24vnGIG+MqCqwF+tig5C+2t3fVHbFmiDwFt1mGx4xuvrQP8Hp5SDy9abwEpm/0NtWBHTnlRvIqeZPXXgi4x5C5t30mFgd8B2CLlPg8bH3Twaa20W95ToLphjejhqL7ODHwSqhWxCAKx81ObAZlZ+eXVhL+92+Dtr+oFBtqoTG8yr6mmSWdRNfazgjQsx/gbmGPmfsd4K/kkWfbPGdpMWDHqwQRVD1JJtxbXqABKBMyFycNkjsCnQaUk+fVnQA6YdSOaHZmXaSZYY/uyOlOXxZ2tOrP/xH7RiaPM8v6FvDUDbWOX+Mjo5p5mRxPMoewg3+M6hL5IfOUG374sKzGpix4KAJx6qP+AmfsXnE+9Xkb8nq1/FL9JIvfnUi+EJVeAbWGuanLZSQMaSGWTO59L3MRsBZqjM6Hx2MuMOzbE3KeDeT1w8dXKvUiho69rID1PtCZ7B0Z4n1ZHen44snsgFZ 25j6rxLq azPr/BCIhmHQHDHEd9++mj+kla7bLLiX6kh58tXAV4orsnqYkf6zBH8w9/x0QJDuuXqgR/bBPIugCWVpQTzi18wfu8SUQNyiws7z2y4hXpXw7CVS05aQPjIBFpR36oX5/adfZcT3g8Tko8hhLX9FKTFtk6EXhUIkIZavhtuqwtrsuEJrks45b581I9LDncA+mgNHsCLwb/+ROcMJzG6hsP4KLTfTISdqhIcT5wG2zZLHLBREuTIUWOLGd9nVTdTcyb4+MT+aT518Z7gcffku6SCKrACnVoHXW/ivPxD8mLi02VAlhMLj4KnqBQZPYPlln7jbYlbP0kknX+ZIVn8gLi7DsoHJp2Mgm1krGD2G2q3yIW1lNTZI7Ylx1vBUFcla5ZW5ozGBt8OjTsFJgANuACLkfM8LffH3xt5pPYfzWXsw6KDs= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 4/2/24 5:45 AM, Chengming Zhou wrote: > On 2024/4/2 11:10, Ming Yang wrote: >> When one of numa nodes runs out of memory and lots of processes still >> booting, slabinfo shows much slub segmentation exits. The following You mean fragmentation not segmentation, right? >> shows some of them: >> >> tunables : slabdata >> >> kmalloc-512 84309 380800 1024 32 8 : >> tunables 0 0 0 : slabdata 11900 11900 0 >> kmalloc-256 65869 365408 512 32 4 : >> tunables 0 0 0 : slabdata 11419 11419 0 >> >> 365408 "kmalloc-256" objects are alloced but only 65869 of them are >> used; While 380800 "kmalloc-512" objects are alloced but only 84309 >> of them are used. >> >> This problem exits in the following senario: >> 1. Multiple numa nodes, e.g. four nodes. >> 2. Lack of memory in any one node. >> 3. Functions which alloc many slub memory in certain numa nodes, >> like alloc_fair_sched_group. >> >> The slub segmentation generated because of the following reason: >> In function "___slab_alloc" a new slab is attempted to be gotten via >> function "get_partial". If the argument 'node' is assigned but there >> are neither partial memory nor buddy memory in that assigned node, no >> slab could be gotten. And then the program attempt to alloc new slub >> from buddy system, as mentationed before: no buddy memory in that >> assigned node left, a new slub might be alloced from the buddy system >> of other node directly, no matter whether there is free partil memory >> left on other node. As a result slub segmentation generated. >> >> The key point of above allocation flow is: the slab should be alloced >> from the partial of other node first, instead of the buddy system of >> other node directly. >> >> In this commit a new slub allocation flow is proposed: >> 1. Attempt to get a slab via function get_partial (first step in >> new_objects lable). >> 2. If no slab is gotten and 'node' is assigned, try to alloc a new >> slab just from the assigned node instead of all node. >> 3. If no slab could be alloced from the assigned node, try to alloc >> slub from partial of other node. >> 4. If the alloctation in step 3 fails, alloc a new slub from buddy >> system of all node. > > FYI, there is another patch to the very same problem: > > https://lore.kernel.org/all/20240330082335.29710-1-chenjun102@huawei.com/ Yeah and I have just taken that one to slab/for-6.10 >> >> Signed-off-by: Ming Yang >> Signed-off-by: Liang Zhang >> Signed-off-by: Zhigang Wang >> Reviewed-by: Shixin Liu >> --- >> This patch can be tested and verified by following steps: >> 1. First, try to run out memory on node0. echo 1000(depending on your memory) > >> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages. >> 2. Second, boot 10000(depending on your memory) processes which use setsid >> systemcall, as the setsid systemcall may likely call function >> alloc_fair_sched_group. >> 3. Last, check slabinfo, cat /proc/slabinfo. >> >> Hardware info: >> Memory : 8GiB >> CPU (total #): 120 >> numa node: 4 >> >> Test clang code example: >> int main() { >> void *p = malloc(1024); >> setsid(); >> while(1); >> } >> >> mm/slub.c | 11 +++++++++++ >> 1 file changed, 11 insertions(+) >> >> diff --git a/mm/slub.c b/mm/slub.c >> index 1bb2a93cf7..3eb2e7d386 100644 >> --- a/mm/slub.c >> +++ b/mm/slub.c >> @@ -3522,7 +3522,18 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, >> } >> >> slub_put_cpu_ptr(s->cpu_slab); >> + if (node != NUMA_NO_NODE) { >> + slab = new_slab(s, gfpflags | __GFP_THISNODE, node); >> + if (slab) >> + goto slab_alloced; >> + >> + slab = get_any_partial(s, &pc); >> + if (slab) >> + goto slab_alloced; >> + } >> slab = new_slab(s, gfpflags, node); >> + >> +slab_alloced: >> c = slub_get_cpu_ptr(s->cpu_slab); >> >> if (unlikely(!slab)) {