Re: [PATCH] mm, slub: prefetch freelist in ___slab_alloc()

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yongqiang Liu <liuyongqiang13@huawei.com>
To: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
	<zhangxiaoxu5@huawei.com>, <cl@linux.com>,
	<wangkefeng.wang@huawei.com>, <penberg@kernel.org>,
	<rientjes@google.com>, <iamjoonsoo.kim@lge.com>,
	<akpm@linux-foundation.org>, <vbabka@suse.cz>,
	<roman.gushchin@linux.dev>
Subject: Re: [PATCH] mm, slub: prefetch freelist in ___slab_alloc()
Date: Wed, 21 Aug 2024 14:58:23 +0800	[thread overview]
Message-ID: <6e744d2b-bbb3-4e1f-bd61-e0e971f974db@huawei.com> (raw)
In-Reply-To: <CAB=+i9TxYRcr+ZRMD31SDay+899RXOwTvQevC=8sv7b27ZO1Vg@mail.gmail.com>


在 2024/8/19 17:33, Hyeonggon Yoo 写道:
> On Mon, Aug 19, 2024 at 4:02 PM Yongqiang Liu <liuyongqiang13@huawei.com> wrote:
>> commit 0ad9500e16fe ("slub: prefetch next freelist pointer in
>> slab_alloc()") introduced prefetch_freepointer() for fastpath
>> allocation. Use it at the freelist firt load could have a bit
>> improvement in some workloads. Here is hackbench results at
>> arm64 machine(about 3.8%):
>>
>> Before:
>>    average time cost of 'hackbench -g 100 -l 1000': 17.068
>>
>> Afther:
>>    average time cost of 'hackbench -g 100 -l 1000': 16.416
>>
>> There is also having about 5% improvement at x86_64 machine
>> for hackbench.
> I think adding more prefetch might not be a good idea unless we have
> more real-world data supporting it because prefetch might help when slab
> is frequently used, but it will end up unnecessarily using more cache
> lines when slab is not frequently used.

Yes, prefetching unnecessary objects is a bad idea. But I think the slab 
entered

in slowpath that means it will more likely need more objects. I've 
tested the

cases from commit 0ad9500e16fe ("slub: prefetch next freelist pointer in
slab_alloc()"). Here is the result:

Before:

Performance counter stats for './hackbench 50 process 4000' (32 runs):

                 2545.28 msec task-clock                #    6.938 CPUs 
utilized        ( +-  1.75% )
                      6166     context-switches          #    0.002 
M/sec                    ( +-  1.58% )
                     1129      cpu-migrations            #    0.444 
K/sec                     ( +-  2.16% )
                   13298      page-faults                  # 0.005 
M/sec                    ( +-  0.38% )
         4435113150      cycles                           # 1.742 
GHz                         ( +-  1.22% )
         2259717630      instructions                 #    0.51 insn per 
cycle           ( +-  0.05% )
           385847392      branches                     #  151.593 
M/sec                    ( +-  0.06% )
              6205369       branch-misses            #    1.61% of all 
branches       ( +-  0.56% )

            0.36688 +- 0.00595 seconds time elapsed  ( +-  1.62% )
After:

  Performance counter stats for './hackbench 50 process 4000' (32 runs):

                2277.61 msec task-clock                #    6.855 CPUs 
utilized            ( +-  0.98% )
                     5653      context-switches         #    0.002 
M/sec                       ( +-  1.62% )
                     1081      cpu-migrations           #    0.475 
K/sec                        ( +-  1.89% )
                   13217      page-faults                 # 0.006 
M/sec                       ( +-  0.48% )
         3751509945      cycles                          #    1.647 
GHz                          ( +-  1.14% )
         2253177626      instructions                #    0.60 insn per 
cycle             ( +-  0.06% )
           384509166      branches                    #    168.821 
M/sec                    ( +-  0.07% )
               6045031      branch-misses           #    1.57% of all 
branches          ( +-  0.58% )

            0.33225 +- 0.00321 seconds time elapsed  ( +-  0.97% )

>
> Also I don't understand how adding prefetch in slowpath affects the performance
> because most allocs/frees should be done in the fastpath. Could you
> please explain?

By adding some debug info to count the slowpath for the hackbench:

'hackbench -g 100 -l 1000' slab alloc total: 80416886, and the slowpath: 
7184236.

About 9% slowpath in total allocation. The perf stats in arm64 as follow：

Before:
  Performance counter stats for './hackbench -g 100 -l 1000' (32 runs):

        34766611220 branches                      ( +-  0.01% )
            382593804      branch-misses                  # 1.10% of all 
branches          ( +-  0.14% )
          1120091414 cache-misses                 ( +-  0.08% )
        76810485402 L1-dcache-loads               ( +-  0.03% )
          1120091414      L1-dcache-load-misses     #    1.46% of all 
L1-dcache hits    ( +-  0.08% )

            23.8854 +- 0.0804 seconds time elapsed  ( +-  0.34% )

After:
  Performance counter stats for './hackbench -g 100 -l 1000' (32 runs):

        34812735277 branches                  ( +-  0.01% )
            393449644      branch-misses             #    1.13% of all 
branches           ( +-  0.15% )
          1095185949 cache-misses             ( +-  0.15% )
        76995789602 L1-dcache-loads             ( +-  0.03% )
          1095185949      L1-dcache-load-misses     #    1.42% of all 
L1-dcache hits    ( +-  0.15% )

             23.341 +- 0.104 seconds time elapsed  ( +-  0.45% )

It seems having less L1-dcache-load-misses.

>
>> Signed-off-by: Yongqiang Liu <liuyongqiang13@huawei.com>
>> ---
>>   mm/slub.c | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/mm/slub.c b/mm/slub.c
>> index c9d8a2497fd6..f9daaff10c6a 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -3630,6 +3630,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>>          VM_BUG_ON(!c->slab->frozen);
>>          c->freelist = get_freepointer(s, freelist);
>>          c->tid = next_tid(c->tid);
>> +       prefetch_freepointer(s, c->freelist);
>>          local_unlock_irqrestore(&s->cpu_slab->lock, flags);
>>          return freelist;
>>
>> --
>> 2.25.1
>>

next prev parent reply	other threads:[~2024-08-21  6:58 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-19  7:02 Yongqiang Liu
2024-08-19  9:33 ` Hyeonggon Yoo
2024-08-21  6:58   ` Yongqiang Liu [this message]
2024-09-14 13:45     ` Hyeonggon Yoo
2024-10-02 11:57       ` Vlastimil Babka

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6e744d2b-bbb3-4e1f-bd61-e0e971f974db@huawei.com \
    --to=liuyongqiang13@huawei.com \
    --cc=42.hyeyoo@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=penberg@kernel.org \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=vbabka@suse.cz \
    --cc=wangkefeng.wang@huawei.com \
    --cc=zhangxiaoxu5@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox