From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A63D8C52D6F for ; Wed, 21 Aug 2024 06:58:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3751F6B00A7; Wed, 21 Aug 2024 02:58:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2FD336B00A8; Wed, 21 Aug 2024 02:58:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 19E386B00A9; Wed, 21 Aug 2024 02:58:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id EB6C86B00A7 for ; Wed, 21 Aug 2024 02:58:32 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 770C4120ADA for ; Wed, 21 Aug 2024 06:58:32 +0000 (UTC) X-FDA: 82475349264.23.77CBCAB Received: from szxga04-in.huawei.com (szxga04-in.huawei.com [45.249.212.190]) by imf13.hostedemail.com (Postfix) with ESMTP id B0A732000D for ; Wed, 21 Aug 2024 06:58:29 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=none; spf=pass (imf13.hostedemail.com: domain of liuyongqiang13@huawei.com designates 45.249.212.190 as permitted sender) smtp.mailfrom=liuyongqiang13@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1724223471; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=P246SO6w7+aD+e4gdAv9RzbDJZ+dTWdSd++ADBiucak=; b=ct+y7KewRQLbmaPQ9cNPpV8sO4e/SfUnsqIPATRfQLmLNO0EPu6yphEFy2g5xKEyMhvarl qEDxi5TXwMXUeiJu9hUE+unoWFp7i5nlLVhQb5A9VN6wPG15aDq8GTUNwPIFv8Rt5w52jC 6Rfu54oFPEr7M/7/b8qN/eru9zMSEOA= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=none; spf=pass (imf13.hostedemail.com: domain of liuyongqiang13@huawei.com designates 45.249.212.190 as permitted sender) smtp.mailfrom=liuyongqiang13@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1724223471; a=rsa-sha256; cv=none; b=lUysizSC2jvYq0gOVgPC0TLDWv0kyqCceEcuzmu+TPh+1hhwrs4pRimdDREf6niXLVFopB 9HG1wmXesXPqUI5VSJEG3cOv0zHZG+chd2hqauomrxamYD7ppLBdQIWhj+v0efVlgB7pxl ckGQI7lllwGHDAsAqy8yWjpWPDVWMRE= Received: from mail.maildlp.com (unknown [172.19.88.234]) by szxga04-in.huawei.com (SkyGuard) with ESMTP id 4WpccG4nNBz2Cn71; Wed, 21 Aug 2024 14:58:22 +0800 (CST) Received: from dggpeml500005.china.huawei.com (unknown [7.185.36.59]) by mail.maildlp.com (Postfix) with ESMTPS id C21251400CA; Wed, 21 Aug 2024 14:58:24 +0800 (CST) Received: from [10.174.178.155] (10.174.178.155) by dggpeml500005.china.huawei.com (7.185.36.59) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.39; Wed, 21 Aug 2024 14:58:24 +0800 Message-ID: <6e744d2b-bbb3-4e1f-bd61-e0e971f974db@huawei.com> Date: Wed, 21 Aug 2024 14:58:23 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] mm, slub: prefetch freelist in ___slab_alloc() To: Hyeonggon Yoo <42.hyeyoo@gmail.com> CC: , , , , , , , , , , References: <20240819070204.753179-1-liuyongqiang13@huawei.com> From: Yongqiang Liu In-Reply-To: Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.178.155] X-ClientProxiedBy: dggems706-chm.china.huawei.com (10.3.19.183) To dggpeml500005.china.huawei.com (7.185.36.59) X-Rspamd-Server: rspam03 X-Rspam-User: X-Rspamd-Queue-Id: B0A732000D X-Stat-Signature: 3mt85ajfndn5w5jiybms7jt6u5of6jhr X-HE-Tag: 1724223509-916772 X-HE-Meta: U2FsdGVkX18U8fVZYu0VNfk6mqKhtsP4pwu1J0K+LJvyVfGgqrSoeiPyQGyuBBi+p6TisvACUF2c6QV4tGt0/VRyHVtePGvbgNzcfQQOExNhbsVQDadnoZRfGYlk8dtbNNzYlSyuu8nuHcVnvFmWiC2MpApZEdP7ud5tPA1Rf2YbO8zDwcGcqmMbL97k2/vtVKEwDxNl+4qvCm5d3f2ga+uf9VFFJIeKk+rHlvv24Q79sUWxW8F+V87xMkAM791Oo1aaRwU8fcvuXP78SVS9pvzVFKXRJxEI5VKKDYDRTgYCIevqIQAAEPvUdaBVCO8Whwt9eoyL6Wo4L92v3Rtvaxt7RCmCFp1C21FADN1WPO+48/ZYI2sD41WjU07ijhmrmMd2Xf7h9u3xfUjnR2lgqpZBLXXHYZKMlXBPI1r5FB7z8gwwQnxKJizV/XIEZ3oxYfvdF9ZFTQVvKHcBpzVFIFJQCBU3V3rXCiH4eZHNmXM0IpkiYO1TLaEUFgO9aK34GO7ECq5WUXgM0mNU13rsPOQQDBGPHgDMZ0mn/47QwNODgRy88k4XdW0ZWKkAM7XHMJMPq/4ur2GA6owSLBsUC8wDm/Rw5aCRgRJXCgsmI4D4OUIHnJszDwuYjdr+SGbPpufpcL6LzmI4oh1tsn4/MsMxTUtb/1C7Piatv6rK5cx6An5vtBL1bqhE7uNGUMKeSfw53NGIupmTUqCoRaZg2AUhXn1XLgEWVe0wtUaO4OwZiHruBUlnJeZoKZGHElMQpuIqspl3URS3Nrh2lUPXDg21iufMsBxnNRRos1lnNu1PhK6lEC8toeaPyZAghJL1b1mPUnkqKpprNAz4Ihrz7FbnEHhx1V9R0FunHtL6KlskIsUj6B05O3n3j4O48zfiknZFVNsGyLkG+zhi7Np4o8CjtcrJST/qUeeoNq5NeriGEI+5tk9YhC7fDrtMBOhEkSiu1z2wVvsCUIZZQLH 3NNF3VPo P+SI3lCAXUO2WTqe+ic6/exTccyeFJv+WtjALYgWEJqNgEA09Y9FzMWM7LM1YkqfZVJW66wWYRrL7pPnxjOiCqVhlEghXyAUvm4vnsBZiSQQd8lm/8ohH4Xh86vZG56E2uXVzvytQQ5gEJfASCqByDo3l6NZyEX1w21GNUtrHXZqZkvY965Q214ouwUBgBYb4XsWzAAHyu9Nhv/HM42gHZfYa/GCymq3rx5ML+eRiWqvg9GaWuVRQq8OrMV9PvHXa4cJGFvcxrdI9p7m7uwUhV9ztPVLD7WkOaOafJPflO6EsJtu991RxHHe5AZmeMyGFe+nUNaL7YdcN/9k2lJ6Iglweyw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: 在 2024/8/19 17:33, Hyeonggon Yoo 写道: > On Mon, Aug 19, 2024 at 4:02 PM Yongqiang Liu wrote: >> commit 0ad9500e16fe ("slub: prefetch next freelist pointer in >> slab_alloc()") introduced prefetch_freepointer() for fastpath >> allocation. Use it at the freelist firt load could have a bit >> improvement in some workloads. Here is hackbench results at >> arm64 machine(about 3.8%): >> >> Before: >> average time cost of 'hackbench -g 100 -l 1000': 17.068 >> >> Afther: >> average time cost of 'hackbench -g 100 -l 1000': 16.416 >> >> There is also having about 5% improvement at x86_64 machine >> for hackbench. > I think adding more prefetch might not be a good idea unless we have > more real-world data supporting it because prefetch might help when slab > is frequently used, but it will end up unnecessarily using more cache > lines when slab is not frequently used. Yes, prefetching unnecessary objects is a bad idea. But I think the slab entered in slowpath that means it will more likely need more objects. I've tested the cases from commit 0ad9500e16fe ("slub: prefetch next freelist pointer in slab_alloc()"). Here is the result: Before: Performance counter stats for './hackbench 50 process 4000' (32 runs):                 2545.28 msec task-clock                #    6.938 CPUs utilized        ( +-  1.75% )                      6166     context-switches          #    0.002 M/sec                    ( +-  1.58% )                     1129      cpu-migrations            #    0.444 K/sec                     ( +-  2.16% )                   13298      page-faults                  # 0.005 M/sec                    ( +-  0.38% )         4435113150      cycles                           # 1.742 GHz                         ( +-  1.22% )         2259717630      instructions                 #    0.51 insn per cycle           ( +-  0.05% )           385847392      branches                     #  151.593 M/sec                    ( +-  0.06% )              6205369       branch-misses            #    1.61% of all branches       ( +-  0.56% )            0.36688 +- 0.00595 seconds time elapsed  ( +-  1.62% ) After:  Performance counter stats for './hackbench 50 process 4000' (32 runs):                2277.61 msec task-clock                #    6.855 CPUs utilized            ( +-  0.98% )                     5653      context-switches         #    0.002 M/sec                       ( +-  1.62% )                     1081      cpu-migrations           #    0.475 K/sec                        ( +-  1.89% )                   13217      page-faults                 # 0.006 M/sec                       ( +-  0.48% )         3751509945      cycles                          #    1.647 GHz                          ( +-  1.14% )         2253177626      instructions                #    0.60 insn per cycle             ( +-  0.06% )           384509166      branches                    #    168.821 M/sec                    ( +-  0.07% )               6045031      branch-misses           #    1.57% of all branches          ( +-  0.58% )            0.33225 +- 0.00321 seconds time elapsed  ( +-  0.97% ) > > Also I don't understand how adding prefetch in slowpath affects the performance > because most allocs/frees should be done in the fastpath. Could you > please explain? By adding some debug info to count the slowpath for the hackbench: 'hackbench -g 100 -l 1000' slab alloc total: 80416886, and the slowpath: 7184236. About 9% slowpath in total allocation. The perf stats in arm64 as follow: Before:  Performance counter stats for './hackbench -g 100 -l 1000' (32 runs):        34766611220 branches                      ( +-  0.01% )            382593804      branch-misses                  # 1.10% of all branches          ( +-  0.14% )          1120091414 cache-misses                 ( +-  0.08% )        76810485402 L1-dcache-loads               ( +-  0.03% )          1120091414      L1-dcache-load-misses     #    1.46% of all L1-dcache hits    ( +-  0.08% )            23.8854 +- 0.0804 seconds time elapsed  ( +-  0.34% ) After:  Performance counter stats for './hackbench -g 100 -l 1000' (32 runs):        34812735277 branches                  ( +-  0.01% )            393449644      branch-misses             #    1.13% of all branches           ( +-  0.15% )          1095185949 cache-misses             ( +-  0.15% )        76995789602 L1-dcache-loads             ( +-  0.03% )          1095185949      L1-dcache-load-misses     #    1.42% of all L1-dcache hits    ( +-  0.15% )             23.341 +- 0.104 seconds time elapsed  ( +-  0.45% ) It seems having less L1-dcache-load-misses. > >> Signed-off-by: Yongqiang Liu >> --- >> mm/slub.c | 1 + >> 1 file changed, 1 insertion(+) >> >> diff --git a/mm/slub.c b/mm/slub.c >> index c9d8a2497fd6..f9daaff10c6a 100644 >> --- a/mm/slub.c >> +++ b/mm/slub.c >> @@ -3630,6 +3630,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, >> VM_BUG_ON(!c->slab->frozen); >> c->freelist = get_freepointer(s, freelist); >> c->tid = next_tid(c->tid); >> + prefetch_freepointer(s, c->freelist); >> local_unlock_irqrestore(&s->cpu_slab->lock, flags); >> return freelist; >> >> -- >> 2.25.1 >>