From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4CCD7C48BC3 for ; Mon, 19 Feb 2024 09:29:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D8C956B0083; Mon, 19 Feb 2024 04:29:13 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D13E66B0085; Mon, 19 Feb 2024 04:29:13 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B8D6A6B0088; Mon, 19 Feb 2024 04:29:13 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id A01DF6B0083 for ; Mon, 19 Feb 2024 04:29:13 -0500 (EST) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 39F331202FA for ; Mon, 19 Feb 2024 09:29:13 +0000 (UTC) X-FDA: 81808029786.23.C6D6396 Received: from mail-pf1-f177.google.com (mail-pf1-f177.google.com [209.85.210.177]) by imf09.hostedemail.com (Postfix) with ESMTP id 8CEAE140019 for ; Mon, 19 Feb 2024 09:29:10 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=Cyf+Kfgq; spf=pass (imf09.hostedemail.com: domain of zhouchengming@bytedance.com designates 209.85.210.177 as permitted sender) smtp.mailfrom=zhouchengming@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1708334951; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=eP/nvuVRApSzC8OZKm0vUFPbTbVhx//KVCst0ly11Aw=; b=hWi+45ByQnG30Kv5b66YJoobZQlKg+Gt0ZXA4IRRibbzTqoVATMXzfkteAoOd3IoLDre3y OmiF6uo4wvJVi0aymjasxG7rdIFIrfsjeGMeaEap8dehjwwLny6bVLegSKMkpSGq/u0SHp NBrHcUPfzLh2pBd2gT+JgGCZPYpld/0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1708334951; a=rsa-sha256; cv=none; b=ueTt9Q9zcItZP25ptMDtPlK65nxA2bUeDhV37LBGMGD+CnnM2sSIfM/eok6jy8jaqmbZhO 8tdAkqALUQfBKwNuNrRp+KtRQASpDMi6szpbgHtQwSVD/EzcW6H4Yf4yrN22d7WCgC/UwW gDLvCBVmX06lun2uxRPkj/QjOQ7PBgA= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=Cyf+Kfgq; spf=pass (imf09.hostedemail.com: domain of zhouchengming@bytedance.com designates 209.85.210.177 as permitted sender) smtp.mailfrom=zhouchengming@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com Received: by mail-pf1-f177.google.com with SMTP id d2e1a72fcca58-6e45bd5014dso90117b3a.1 for ; Mon, 19 Feb 2024 01:29:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1708334949; x=1708939749; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=eP/nvuVRApSzC8OZKm0vUFPbTbVhx//KVCst0ly11Aw=; b=Cyf+KfgqE+PpRp6L+e+8zytC9KacXSze4dflGGYO/GYYQWUIU6GhR6xnWZBsDqkjlP +m39ZA5Sbr0/hTrLTzS+rN6xSxNBaJ23CMVLPMnAZ9FnG4DkNCf8ko0BGkmBOIWFNyVo Yb0Igrr/cjNUVFVwNISskqyUilkS+4B0dzfCHd6+mVdaD1Bs8XEYg6+4KEEa70lBrNmN N+QyWNqx2tCtnB8qXCHYEiEJzAQNTzBg3IiSbDSHU7I9AvgphbX8A3vePIUDcJNZOOeV 4Q2yv2w8F90BY80OMHoZImveRGG4EJvH9J4NquqT0FZGcOFsu56OYs0QZmbL9T5EKHMs 2M8w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1708334949; x=1708939749; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=eP/nvuVRApSzC8OZKm0vUFPbTbVhx//KVCst0ly11Aw=; b=g4XqESPwlB2K3Q5IWj4F3f3XgrRpU/VxiEcTlxZFyT2dmrKMi2h7/PyZTHZ8nMQX4V 5PL1zz+U81hQkpHHl3dzK0yG6/wWME3uhzzEPhBbAeu5LDztuxt/rFLo//4PN39oIDyJ j/FiyUT6I9qgsrCbnQBkqGdgrqwe7TWiQXQv/KjtK2NZEyP/8FCGdPbULyXgr/jOCgzX ng1vMOPvA0wUxdTmDDtrBMQ5MM8cA8pTxKIpwfZO4Otj+J6hL3wOJV8mnbCk4vWDyfmr vA35G1OCJRDa4DJkikzYQliPB/Lg6VVtRqxcm7utg790OFKw0Vtbfu3/YTXH+aJFOt4H 1akQ== X-Forwarded-Encrypted: i=1; AJvYcCX/SsfMUJSmmC+LooMc2ykxbqHniNq2FcoUhq6DUvvKsTbOTOnTeQ1l4L2FlLh9cYm8HuP4lc9ozuTNXTuKCnkmngw= X-Gm-Message-State: AOJu0YxJ9zeETXYjlnZfpf/G5aCNGqdCs/FtSdw9fLVJrVLua3OeiynO Gx+qSc2GT246SB6cnaCWBbPMZ+OEHOVhVSlCYDHOnVE2Xfdt6oHmdi0zMSSTsoM= X-Google-Smtp-Source: AGHT+IEDWc2aeMYUyqp4/+7uxFHY+P3vcz/cOChgNrp3m3uNKEBcxdYHq7qiIwEQVSIz+3pL/G1IUw== X-Received: by 2002:a62:cec5:0:b0:6e4:6484:1a36 with SMTP id y188-20020a62cec5000000b006e464841a36mr1883032pfg.21.1708334949026; Mon, 19 Feb 2024 01:29:09 -0800 (PST) Received: from [10.4.192.10] ([139.177.225.254]) by smtp.gmail.com with ESMTPSA id n15-20020aa7984f000000b006e0651ec052sm4355808pfq.32.2024.02.19.01.29.05 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 19 Feb 2024 01:29:08 -0800 (PST) Message-ID: <5cf40e33-d1ae-4ac9-9d01-559b86f853a8@bytedance.com> Date: Mon, 19 Feb 2024 17:29:02 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] slub: avoid scanning all partial slabs in get_slabinfo() Content-Language: en-US To: Vlastimil Babka , David Rientjes , Jianfeng Wang Cc: cl@linux.com, penberg@kernel.org, iamjoonsoo.kim@lge.com, akpm@linux-foundation.org, roman.gushchin@linux.dev, 42.hyeyoo@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20240215211457.32172-1-jianfeng.w.wang@oracle.com> <6b58d81f-8e8f-3732-a5d4-40eece75013b@google.com> From: Chengming Zhou In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Stat-Signature: 6h54hr7uf9w935g6yqe6dywgg6yzgsoh X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 8CEAE140019 X-Rspam-User: X-HE-Tag: 1708334950-121272 X-HE-Meta: U2FsdGVkX198O7f7qlIXiaK0NeA7NsFpduTQeqvkRpvikQbpt5KnT6QfdQ9Xqw11HywWPVyUXLFcReLbgq4rxQ0WRjGumwko50zK6W8VmuaDTSqic5kDJbJow8IGlDnlcf6n79ZzODQYoosH2lpJwR2y09YkHi1HZcT68kyEVbTg3XFrX3LI+J6UhKc+UG4TWnMXXjPkmV5Y7ANTpHg8smwsJsuE/1LWGW1ThNJ3o3y3c+FYjQpsgCHgHdBf0wlun/RdlYwOfM98wyjPpNUqbJj9sclcKbFdT+tS+uE1jgjoemgmk6AUASxSltQlbdhADkU0RONSxvlyUZjLhQ3Sqg8xy32qisLFap5HzgaObVxWazXNpk2WK6bNXk8c/W5D+qm0LXtAMBDrGlxmGpeTwxga2vFlTJ+OmlxnwpEwAGzG/cHFcPeRQCWq8RmHbKSefIC1Qm+fkPfZxZc7Cm1+kuvhRzLNS1Dd/zJYgL0/cGSJ4FulK+ia3u4ZII3UHbbS2/oVgIL7MBsUc4Uwnbgu8lk6BmpoVf65lWXGMtupooXAhVzvxBVCTEL50C858h0GVDOc7UVjQ2Lnv6xP9RrQsEV9ZWo3nLT5DVQz9jB+yyL3I6vFpsiZMNHAp1IsZwyvuJT+zxYN+XWvhN4EQym0hS/YeKMslbXfi1m6wgKx7ivgTIucakUnIeX+BT0j6tts5i7g+YyfeCxXQDAVq52FVyFVhCnsjAk5/S0mEOJ7HTonadds/JM4ahOg69Yqr9M6D2RqK5baCkDtgvKgdQP4rX76Y2nx5xFY3Fa/TEAo2M2rZTFKB5Kx+SEGNgMZR1la40mcAJ29PtdukTE6527yjs9Y6UltY6QDoqsRUlJpzfq+OFyv2ztv1sQcdCKr6OGcHW76Qag3/TANwEDaOb2CKJJhlvczLbtDSgjfuxuHnjwMSnXuOJV2tsQNhOOEn2L52iIfhgheDW7p4dGH+ux 3s3a+Lbm e62cLAMkU8ssIHH/9U/VStDcjrO02w7z/pRHtilEMONAtuuu3mpnlqS09ojChAvOCc8vR/BIFpDfMVgeoEIvS/KxH/uW9968ytb0rEO6EKqaTCQo6fuA1vFbo0agvtadmJaAZnZ/c8I0iV18Of+28aSbcXJTARUud9xPlO0A7r+10oJoUJOpS0cG/Kw/OK28tZY+P04q5cRfjHaQDnJgqaHMnPdo470vTfTEXrFkdZVXTPehoMbgu0OSwoEFZjWYayoPjlTrGyLcCQAC59V4XzDj8K8+ocVY/dgH+KM2lynFnJRcKB3jJ4zUbADIwhIKv6xNaBeASLfS+0OJvxhm8FCx3oHdZIMdMl7tPJgTND/brx6okjpk9F2qS5c2I7gCjfZS8ynilyOIY1YAO2KBiqzNxsCMVB+qwVCdpHBiozTZV5rRVRnCHX8/8TfbVg5b55PMQ/GdJFYJD37CrENzZYUAVUmvFljYM+tkrMXpxMAR02JdfoqTws31GYACyrXq02r88rbmc8qllUsnXOcS5UVi8Ag== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/2/19 16:30, Vlastimil Babka wrote: > On 2/18/24 20:25, David Rientjes wrote: >> On Thu, 15 Feb 2024, Jianfeng Wang wrote: >> >>> When reading "/proc/slabinfo", the kernel needs to report the number of >>> free objects for each kmem_cache. The current implementation relies on >>> count_partial() that counts the number of free objects by scanning each >>> kmem_cache_node's partial slab list and summing free objects from all >>> partial slabs in the list. This process must hold per kmem_cache_node >>> spinlock and disable IRQ. Consequently, it can block slab allocation >>> requests on other CPU cores and cause timeouts for network devices etc., >>> if the partial slab list is long. In production, even NMI watchdog can >>> be triggered because some slab caches have a long partial list: e.g., >>> for "buffer_head", the number of partial slabs was observed to be ~1M >>> in one kmem_cache_node. This problem was also observed by several Not sure if this situation is normal? It maybe very fragmented, right? SLUB completely depend on the timing order to place partial slabs in node, which maybe suboptimal in some cases. Maybe we could introduce anti-fragment mechanism like fullness grouping in zsmalloc to have multiple lists based on fullness grouping? Just some random thoughts... :) >>> others [1-2] in the past. >>> >>> The fix is to maintain a counter of free objects for each kmem_cache. >>> Then, in get_slabinfo(), use the counter rather than count_partial() >>> when reporting the number of free objects for a slab cache. per-cpu >>> counter is used to minimize atomic or lock operation. >>> >>> Benchmark: run hackbench on a dual-socket 72-CPU bare metal machine >>> with 256 GB memory and Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.3 GHz. >>> The command is "hackbench 18 thread 20000". Each group gets 10 runs. >>> >> >> This seems particularly intrusive for the common path to optimize for >> reading of /proc/slabinfo, and that's shown in the benchmark result. >> >> Could you discuss the /proc/slabinfo usage model a bit? It's not clear if >> this is being continuously read, or whether even a single read in >> isolation is problematic. >> >> That said, optimizing for reading /proc/slabinfo at the cost of runtime >> performance degradation doesn't sound like the right trade-off. > > It should be possible to make this overhead smaller by restricting the > counter only to partial list slabs, as [2] did. This would keep it out of > the fast paths, where it's really not acceptable. > Note [2] used atomic_long_t and the percpu counters used here should be > lower overhead. So basically try to get the best of both attemps. Right, the current count_partial() also only iterate over slabs on the node partial list, doesn't include slabs on the cpu partial list. So this new percpu counter should also only include slabs on node partial list. Then the overhead should be lower.