From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8307BC4345F for ; Tue, 23 Apr 2024 04:56:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DA9FD6B00AE; Tue, 23 Apr 2024 00:56:02 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D5F096B00B1; Tue, 23 Apr 2024 00:56:02 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AC4756B00AF; Tue, 23 Apr 2024 00:56:02 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 804856B00AF for ; Tue, 23 Apr 2024 00:56:02 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 1A93F1C0F69 for ; Tue, 23 Apr 2024 04:56:02 +0000 (UTC) X-FDA: 82039584564.05.6BFB8FA Received: from mx0a-00069f02.pphosted.com (mx0a-00069f02.pphosted.com [205.220.165.32]) by imf19.hostedemail.com (Postfix) with ESMTP id 03DAD1A0005 for ; Tue, 23 Apr 2024 04:55:59 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=oracle.com header.s=corp-2023-11-20 header.b=iel3GcP4; spf=pass (imf19.hostedemail.com: domain of jianfeng.w.wang@oracle.com designates 205.220.165.32 as permitted sender) smtp.mailfrom=jianfeng.w.wang@oracle.com; dmarc=pass (policy=quarantine) header.from=oracle.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1713848160; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=QpW3LdLlD0n/aMD1cX3DNmVCkrLHGgdSgN6Vc52lEOA=; b=GDlyE6BE3fgwshKVcDh4ZlpKptQb1oBcN9+P5iQ5q9mcKy/1FVhCqLvIvw31R9Pzwpam5s ou+sGojzBuwBB0yVMQZTmwdF0vrS2MMV0sxKBsuG/Zpr3kzwi2/BmuOufG5q9tJSf9sPUr SAVhAzGqFov6iRBXBkSyu+5af9iydBk= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=oracle.com header.s=corp-2023-11-20 header.b=iel3GcP4; spf=pass (imf19.hostedemail.com: domain of jianfeng.w.wang@oracle.com designates 205.220.165.32 as permitted sender) smtp.mailfrom=jianfeng.w.wang@oracle.com; dmarc=pass (policy=quarantine) header.from=oracle.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1713848160; a=rsa-sha256; cv=none; b=8PUiMK34eK80zUZr89mA14xsLPDrsaYLQXMzl3TBvJurjg8gjr43s6a7GQF1LMZ6YCcKnL fP62hagharKkZ019AzFLSaK7txCF016XWM2GJ5/pJ/BcYyo/qlLjhuCM0uU4VNDKrUTyjJ N65zfcFAsQctgp6lck6EOkwSD5Y3m+c= Received: from pps.filterd (m0246627.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 43N4iMEP013913; Tue, 23 Apr 2024 04:55:57 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=corp-2023-11-20; bh=QpW3LdLlD0n/aMD1cX3DNmVCkrLHGgdSgN6Vc52lEOA=; b=iel3GcP4y34a421TPykahW6iKhPxU+4yePRtP9Nf8KXsmuQJF2EexY4K/Fcm7S3/yhLB ZIiyJD3B52J++SlYV6qZYAcSL3G9ZjfTfu/Dz5l/adLx1ZABKY4EyhPYHINWCFrC9mYt bxuLLhcnTuYFKAbfmryl0vaiQVM83wa1RtuhD15B9DvP1VS+jm3j+HnlrE/sLOUq4hNd zS0L8ZfV2BDtGSi9/6bkmsxhdutKS++d4eY37M/bWO//ptjQDatpLEfu4taaLRJWFre6 lrpN1GPtgRlw1ECIKYo+4dNMYJO5tKPT1htLYW8SEPNCH02gDFMS9YmgfTnv+hcqX0xi jA== Received: from phxpaimrmta01.imrmtpd1.prodappphxaev1.oraclevcn.com (phxpaimrmta01.appoci.oracle.com [138.1.114.2]) by mx0b-00069f02.pphosted.com (PPS) with ESMTPS id 3xm4a2cek2-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 23 Apr 2024 04:55:57 +0000 Received: from pps.filterd (phxpaimrmta01.imrmtpd1.prodappphxaev1.oraclevcn.com [127.0.0.1]) by phxpaimrmta01.imrmtpd1.prodappphxaev1.oraclevcn.com (8.17.1.19/8.17.1.19) with ESMTP id 43N2RWK0034978; Tue, 23 Apr 2024 04:55:56 GMT Received: from pps.reinject (localhost [127.0.0.1]) by phxpaimrmta01.imrmtpd1.prodappphxaev1.oraclevcn.com (PPS) with ESMTPS id 3xm456n1c2-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 23 Apr 2024 04:55:56 +0000 Received: from phxpaimrmta01.imrmtpd1.prodappphxaev1.oraclevcn.com (phxpaimrmta01.imrmtpd1.prodappphxaev1.oraclevcn.com [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 43N4tt9G007016; Tue, 23 Apr 2024 04:55:55 GMT Received: from jfwang-mac.us.oracle.com (dhcp-10-65-141-153.vpn.oracle.com [10.65.141.153]) by phxpaimrmta01.imrmtpd1.prodappphxaev1.oraclevcn.com (PPS) with ESMTP id 3xm456n1ar-2; Tue, 23 Apr 2024 04:55:55 +0000 From: Jianfeng Wang To: linux-mm@kvack.org Cc: vbabka@suse.cz, cl@linux.com, rientjes@google.com, akpm@linux-foundation.org, jianfeng.w.wang@oracle.com Subject: [PATCH v4 1/2] slub: introduce count_partial_free_approx() Date: Mon, 22 Apr 2024 21:55:53 -0700 Message-ID: <20240423045554.15045-2-jianfeng.w.wang@oracle.com> X-Mailer: git-send-email 2.42.1 In-Reply-To: <20240423045554.15045-1-jianfeng.w.wang@oracle.com> References: <20240423045554.15045-1-jianfeng.w.wang@oracle.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1011,Hydra:6.0.619,FMLib:17.11.176.26 definitions=2024-04-23_02,2024-04-22_01,2023-05-22_02 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 phishscore=0 suspectscore=0 mlxscore=0 mlxlogscore=999 bulkscore=0 malwarescore=0 spamscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2404010000 definitions=main-2404230013 X-Proofpoint-GUID: WTtnacgpnhd19Tl5CwPvCp8YWMJWaQys X-Proofpoint-ORIG-GUID: WTtnacgpnhd19Tl5CwPvCp8YWMJWaQys X-Rspamd-Queue-Id: 03DAD1A0005 X-Stat-Signature: y34f8facrj7xyb67ys5kne7oh36xqeg4 X-Rspam-User: X-Rspamd-Server: rspam07 X-HE-Tag: 1713848159-177851 X-HE-Meta: U2FsdGVkX1+gQuFoEXV0IONuDW8cuR242qAv6l3rm4tPPARcTwarSinTd6jpjuDN22N19/5DCGnFuQf7pvp3DU9wlCgTVrvoafeQ91Bx6aD3h9t0HJPiI0os0R7sMGQZ5eOigHMuzUMYDZ/aB77nRTfY4/ckfUsGXxk4GPOUncksoKbrQmnK2Uu//PEorYCUuGv8MgqXzTXrvMlkGkO2slAiaNSVm6jERzukdiibwv1z3p3zLDNjZs/vag5/HcVBKLdzcHDKmV771MdMeGLuPgrIcXuMF1Ao8n9IqHUraUI7t7LIS39QvIFc8wb1aVh5JohaJ7i9fzFARe7j2C57iOnlx1DrI8+8er74CXGMXgwXo8JbTDUDdLO1zy9aKgitTeysKNxOb+PSCrh095OQxnW1ts7zZjrrH4pUmvHE+zdQzmFmpoyQjM7P5dR9rSwXCmPmyRFFVEkTP7uG//kDcaZRHKonbUOWb/pbtz3hsGxbDYI9HbLls2JXsqCaS4UFGmL1nD+oiQajzGezFJJovc0NxYGYNNAPutVWI/QlnToHgG5LYTZHz0P0RFux7qFNJoHPNGmxYEz8CZ3h/ttu9d/bLVDLoiENtKYdFg7zsqv1ClIQ192C6tAThVEMknyVLaaVBd3Biaxe+2Gog+9IAZNBxNi4PSq5QWzt5RFGSdXounJzT1Wb8rUHAi52K/pujLfkzU5OUOdPLZwsEAruv8ADqEsJs/gronqYjrKWEz+0l8Bh/TIcW5O0CrF8ujM8U+ys1QyUDgps1V/YPjcWeSU242NOoludtDrhVOBJABb0RcEROJq+PqkxtWNZHQCUjYvCI2hOcXnu90IuEyIXnd/HL3duSKBb0ocT0vtP+YGcEuOBUQX6VDljUz34SOFdZPigz2NySi+tyS3yWq20Lipa5C+UjA77xsHYVBqWfz1EDLeOAHstvXEwWAhJzGW4+3SUMBousKT27TBecdL woklqPiF 2aVFZW2mqQD16ZHP9ivZP6Ivs0+hhuGJJmXq91qcE/jfTyW+rntSOWOcwzQpQ9NAYqIFHy7YOfmXXTBtL+EsOZVDpXB85ZKyYyVUy1DUdq82iSCm8RDF6W9k+1EGWrIIb4A4+XG4V4itnKluljmaWgdaYl7/ELIS4avXDspMDQs9GSuUQRKbbqM3pLqGb7Zez69XnhfF7GzIxxS7SizIedtxg4ddOCF1hwYjZ2DnFtLfL74gZWBu7Ax+N9q/DmG9HifylA+udRAYxpjDdPZMQUNu+D5LEvhjnQ7273lEAX+D4jkr4PBiQoEQXBANYS1i8EzjG0QNXQxbCQhFCvh+FywYGbshJ6oj85gtV/sGaJOY0V+3gHBXPzZ3Qlx8gpgB3uAGP+0z2GG8NB/G/NQSq0LWqBT97Zf9QZhv+F+Q9KfzOO/0atWYm+Euk8dSbZzSTd+zLnJi+1scmwOo= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: When reading "/proc/slabinfo", the kernel needs to report the number of free objects for each kmem_cache. The current implementation uses count_partial() to get it by scanning each kmem_cache_node's partial slab list and summing free objects from every partial slab. This process must hold per-kmem_cache_node spinlock and disable IRQ, and may take a long time. Consequently, it can block slab allocations on other CPUs and cause timeouts for network devices, when the partial list is long. In production, even NMI watchdog can be triggered due to this matter: e.g., for "buffer_head", the number of partial slabs was observed to be ~1M in one kmem_cache_node. This problem was also confirmed by others [1-3]. Iterating a partial list to get the exact count of objects can cause soft lockups for a long list with or without the lock (e.g., if preemption is disabled), and may not be very useful: the object count can change after the lock is released. The approach of maintaining free-object counters requires atomic operations on the fast path [3]. So, the fix is to introduce count_partial_free_approx(). This function can be used for getting the free object count in a kmem_cache_node's partial list. It limits the number of slabs to scan and avoids scanning the whole list by giving an approximation for a long list. Suppose the limit is N. If the list's length is not greater than N, output the exact count by traversing the list; if its length is greater than N, output an approximated count by traversing a subset of the list. The proposed method is to scan N/2 slabs from the list's head and N/2 slabs from the tail. For a partial list with ~280K slabs, benchmarks show that it performs better than just counting from the list's head, after slabs get sorted by kmem_cache_shrink(). Default the limit to 10000, as it produces an approximation within 1% of the exact count for both scenarios. Then, use count_partial_free_approx() in get_slabinfo(). Benchmarks: Diff = (exact - approximated) / exact * Normal case (w/o kmem_cache_shrink()): | MAX_TO_SCAN | Diff (count from head)| Diff (count head+tail)| | 1000 | 0.43 % | 1.09 % | | 5000 | 0.06 % | 0.37 % | | 10000 | 0.02 % | 0.16 % | | 20000 | 0.009 % | -0.003 % | * Skewed case (w/ kmem_cache_shrink()): | MAX_TO_SCAN | Diff (count from head)| Diff (count head+tail)| | 1000 | 12.46 % | 6.75 % | | 5000 | 5.38 % | 1.27 % | | 10000 | 4.99 % | 0.22 % | | 20000 | 4.86 % | -0.06 % | [1] https://lore.kernel.org/linux-mm/alpine.DEB.2.21.2003031602460.1537@www.lameter.com/T/ [2] https://lore.kernel.org/lkml/alpine.DEB.2.22.394.2008071258020.55871@www.lameter.com/T/ [3] https://lore.kernel.org/lkml/1e01092b-140d-2bab-aeba-321a74a194ee@linux.com/T/ Signed-off-by: Jianfeng Wang Acked-by: David Rientjes --- mm/slub.c | 39 ++++++++++++++++++++++++++++++++++++++- 1 file changed, 38 insertions(+), 1 deletion(-) diff --git a/mm/slub.c b/mm/slub.c index 1bb2a93cf7b6..6d8ecad07daf 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -3229,6 +3229,43 @@ static unsigned long count_partial(struct kmem_cache_node *n, #endif /* CONFIG_SLUB_DEBUG || SLAB_SUPPORTS_SYSFS */ #ifdef CONFIG_SLUB_DEBUG +#define MAX_PARTIAL_TO_SCAN 10000 + +static unsigned long count_partial_free_approx(struct kmem_cache_node *n) +{ + unsigned long flags; + unsigned long x = 0; + struct slab *slab; + + spin_lock_irqsave(&n->list_lock, flags); + if (n->nr_partial <= MAX_PARTIAL_TO_SCAN) { + list_for_each_entry(slab, &n->partial, slab_list) + x += slab->objects - slab->inuse; + } else { + /* + * For a long list, approximate the total count of objects in + * it to meet the limit on the number of slabs to scan. + * Scan from both the list's head and tail for better accuracy. + */ + unsigned long scanned = 0; + + list_for_each_entry(slab, &n->partial, slab_list) { + x += slab->objects - slab->inuse; + if (++scanned == MAX_PARTIAL_TO_SCAN / 2) + break; + } + list_for_each_entry_reverse(slab, &n->partial, slab_list) { + x += slab->objects - slab->inuse; + if (++scanned == MAX_PARTIAL_TO_SCAN) + break; + } + x = mult_frac(x, n->nr_partial, scanned); + x = min(x, node_nr_objs(n)); + } + spin_unlock_irqrestore(&n->list_lock, flags); + return x; +} + static noinline void slab_out_of_memory(struct kmem_cache *s, gfp_t gfpflags, int nid) { @@ -7089,7 +7126,7 @@ void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo) for_each_kmem_cache_node(s, node, n) { nr_slabs += node_nr_slabs(n); nr_objs += node_nr_objs(n); - nr_free += count_partial(n, count_free); + nr_free += count_partial_free_approx(n); } sinfo->active_objs = nr_objs - nr_free; -- 2.42.1