linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Jesper Dangaard Brouer <brouer@redhat.com>
To: Joonsoo Kim <js1304@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Christoph Lameter <cl@linux.com>,
	akpm@linuxfoundation.org, Steven Rostedt <rostedt@goodmis.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Linux Memory Management List <linux-mm@kvack.org>,
	Pekka Enberg <penberg@kernel.org>,
	brouer@redhat.com
Subject: Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1
Date: Wed, 17 Dec 2014 13:08:41 +0100	[thread overview]
Message-ID: <20141217130841.100dac71@redhat.com> (raw)
In-Reply-To: <CAAmzW4NCpx5aJyW36fgOfu3EaDj6=uv6MUiBC+a0ggePWPXndQ@mail.gmail.com>

On Wed, 17 Dec 2014 16:13:49 +0900 Joonsoo Kim <js1304@gmail.com> wrote:

> Ping... and I found another way to remove preempt_disable/enable
> without complex changes.
> 
> What we want to ensure is getting tid and kmem_cache_cpu
> on the same cpu. We can achieve that goal with below condition loop.
> 
> I ran Jesper's benchmark and saw 3~5% win in a fast-path loop over
> kmem_cache_alloc+free in CONFIG_PREEMPT.
> 
> 14.5 ns -> 13.8 ns

Hi Kim,

I've tested you patch.  Full report below patch.

Summary, I'm seeing 18.599 ns -> 17.523 ns (-1.076ns better).

For network overload tests:

Dropping packets in iptables raw, which is hitting the slub fast-path.
Here I'm seeing an improvement of 3ns.

For IP-forward, which is also invoking the slub slower path, I'm seeing
an improvement of 6ns (I were not expecting to see any improvement
here, the kmem_cache_alloc code is 24bytes smaller, so perhaps it's
saving some icache).

Full report below patch...
 
> See following patch.
> 
> Thanks.
> 
> ----------->8-------------
> diff --git a/mm/slub.c b/mm/slub.c
> index 95d2142..e537af5 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2399,8 +2399,10 @@ redo:
>          * on a different processor between the determination of the pointer
>          * and the retrieval of the tid.
>          */
> -       preempt_disable();
> -       c = this_cpu_ptr(s->cpu_slab);
> +       do {
> +               tid = this_cpu_read(s->cpu_slab->tid);
> +               c = this_cpu_ptr(s->cpu_slab);
> +       } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> 
>         /*
>          * The transaction ids are globally unique per cpu and per operation on
> @@ -2408,8 +2410,6 @@ redo:
>          * occurs on the right processor and that there was no operation on the
>          * linked list in between.
>          */
> -       tid = c->tid;
> -       preempt_enable();
> 
>         object = c->freelist;
>         page = c->page;
> @@ -2655,11 +2655,10 @@ redo:
>          * data is retrieved via this pointer. If we are on the same cpu
>          * during the cmpxchg then the free will succedd.
>          */
> -       preempt_disable();
> -       c = this_cpu_ptr(s->cpu_slab);
> -
> -       tid = c->tid;
> -       preempt_enable();
> +       do {
> +               tid = this_cpu_read(s->cpu_slab->tid);
> +               c = this_cpu_ptr(s->cpu_slab);
> +       } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> 
>         if (likely(page == c->page)) {
>                 set_freepointer(s, object, c->freelist);

SLUB evaluation 03
==================

Testing patch from Joonsoo Kim <iamjoonsoo.kim@lge.com> slub fast-path
preempt_{disable,enable} avoidance.

Kernel
======
Compiler: GCC 4.9.1

Kernel config ::

 $ grep PREEMPT .config
 CONFIG_PREEMPT_RCU=y
 CONFIG_PREEMPT_NOTIFIERS=y
 # CONFIG_PREEMPT_NONE is not set
 # CONFIG_PREEMPT_VOLUNTARY is not set
 CONFIG_PREEMPT=y
 CONFIG_PREEMPT_COUNT=y
 # CONFIG_DEBUG_PREEMPT is not set

 $ egrep -e "SLUB|SLAB" .config
 # CONFIG_SLUB_DEBUG is not set
 # CONFIG_SLAB is not set
 CONFIG_SLUB=y
 # CONFIG_SLUB_CPU_PARTIAL is not set
 # CONFIG_SLUB_STATS is not set

On top of::

 commit f96fe225677b3efb74346ebd56fafe3997b02afa
 Merge: 5543798 eea3e8f
 Author: Linus Torvalds <torvalds@linux-foundation.org>
 Date:   Fri Dec 12 16:11:12 2014 -0800

    Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net


Setup
=====

netfilter_unload_modules.sh
netfilter_unload_modules.sh
sudo rmmod nf_reject_ipv4 nf_reject_ipv6

base_device_setup.sh eth4  # 10G sink/receiving interface (ixgbe)
base_device_setup.sh eth5
sudo ethtool --coalesce eth4 rx-usecs 30
sudo ip neigh add 192.168.21.66 dev eth5 lladdr 00:00:ba:d0:ba:d0
sudo ip route add 198.18.0.0/15 via 192.168.21.66 dev eth5


# sudo tuned-adm active
Current active profile: latency-performance

Drop in raw
-----------
alias iptables='sudo iptables'
iptables -t raw -N simple || iptables -t raw -F simple
iptables -t raw -I simple -d 198.18.0.0/15 -j DROP
iptables -t raw -D PREROUTING -j simple
iptables -t raw -I PREROUTING -j simple

Generator
---------
./pktgen02_burst.sh -d 198.18.0.2 -i eth8 -m 90:E2:BA:0A:56:B4 -b 8 -t 3 -s 64


Patch by Joonsoo Kim to avoid preempt in slub
=============================================

baseline: without patch
-----------------------

baseline kernel v3.18-7016-gf96fe22 at commit f96fe22567

Type:kmem fastpath reuse Per elem: 46 cycles(tsc) 18.599 ns
 - (measurement period time:1.859917529 sec time_interval:1859917529)
 - (invoke count:100000000 tsc_interval:4649791431)

alloc N-pattern before free with 256 elements

Type:kmem alloc+free N-pattern Per elem: 100 cycles(tsc) 40.077 ns
 - (measurement period time:1.025993290 sec time_interval:1025993290)
 - (invoke count:25600000 tsc_interval:2564981743)

single flow/CPU
 * IP-forward
  - instant rx:0 tx:1165376 pps n:60 average: rx:0 tx:1165928 pps
    (instant variation TX -0.407 ns (min:-0.828 max:0.507) RX 0.000 ns)
 * Drop in RAW (slab fast-path test)
   - instant rx:3245248 tx:0 pps n:60 average: rx:3245325 tx:0 pps
     (instant variation TX 0.000 ns (min:0.000 max:0.000) RX -0.007 ns)

Christoph's slab_test, baseline kernel (at commit f96fe22567)::

 Single thread testing
 =====================
 1. Kmalloc: Repeatedly allocate then free test
 10000 times kmalloc(8) -> 49 cycles kfree -> 62 cycles
 10000 times kmalloc(16) -> 48 cycles kfree -> 64 cycles
 10000 times kmalloc(32) -> 53 cycles kfree -> 70 cycles
 10000 times kmalloc(64) -> 64 cycles kfree -> 77 cycles
 10000 times kmalloc(128) -> 74 cycles kfree -> 84 cycles
 10000 times kmalloc(256) -> 84 cycles kfree -> 114 cycles
 10000 times kmalloc(512) -> 83 cycles kfree -> 116 cycles
 10000 times kmalloc(1024) -> 81 cycles kfree -> 120 cycles
 10000 times kmalloc(2048) -> 104 cycles kfree -> 136 cycles
 10000 times kmalloc(4096) -> 142 cycles kfree -> 165 cycles
 10000 times kmalloc(8192) -> 238 cycles kfree -> 226 cycles
 10000 times kmalloc(16384) -> 403 cycles kfree -> 264 cycles
 2. Kmalloc: alloc/free test
 10000 times kmalloc(8)/kfree -> 68 cycles
 10000 times kmalloc(16)/kfree -> 68 cycles
 10000 times kmalloc(32)/kfree -> 69 cycles
 10000 times kmalloc(64)/kfree -> 68 cycles
 10000 times kmalloc(128)/kfree -> 68 cycles
 10000 times kmalloc(256)/kfree -> 68 cycles
 10000 times kmalloc(512)/kfree -> 74 cycles
 10000 times kmalloc(1024)/kfree -> 75 cycles
 10000 times kmalloc(2048)/kfree -> 74 cycles
 10000 times kmalloc(4096)/kfree -> 74 cycles
 10000 times kmalloc(8192)/kfree -> 75 cycles
 10000 times kmalloc(16384)/kfree -> 510 cycles

$ nm --print-size vmlinux | egrep -e 'kmem_cache_alloc|kmem_cache_free|is_pointer_to_page'
ffffffff81163bd0 00000000000000e1 T kmem_cache_alloc
ffffffff81163ac0 000000000000010c T kmem_cache_alloc_node
ffffffff81162cb0 000000000000013b T kmem_cache_free


with patch
----------

single flow/CPU
 * IP-forward
  - instant rx:0 tx:1174652 pps n:60 average: rx:0 tx:1174222 pps
    (instant variation TX 0.311 ns (min:-0.230 max:1.018) RX 0.000 ns)
 * compare against baseline:
  - 1174222-1165928 = +8294pps
  - (1/1174222*10^9)-(1/1165928*10^9) = -6.058ns

 * Drop in RAW (slab fast-path test)
  - instant rx:3277440 tx:0 pps n:74 average: rx:3277737 tx:0 pps
    (instant variation TX 0.000 ns (min:0.000 max:0.000) RX -0.028 ns)
 * compare against baseline:
  - 3277737-3245325 = +32412 pps
  - (1/3277737*10^9)-(1/3245325*10^9) = -3.047ns

SLUB fast-path test: time_bench_kmem_cache1
 * modprobe time_bench_kmem_cache1 ; rmmod time_bench_kmem_cache1; sudo dmesg -c

Type:kmem fastpath reuse Per elem: 43 cycles(tsc) 17.523 ns (step:0)
 - (measurement period time:1.752338378 sec time_interval:1752338378)
 - (invoke count:100000000 tsc_interval:4380843588)
  * difference: 17.523 - 18.599 = -1.076ns

alloc N-pattern before free with 256 elements

Type:kmem alloc+free N-pattern Per elem: 100 cycles(tsc) 40.369 ns (step:0)
 - (measurement period time:1.033447112 sec time_interval:1033447112)
 - (invoke count:25600000 tsc_interval:2583616203)
    * difference: 40.369 - 40.077 = +0.292ns


Christoph's slab_test::

 Single thread testing
 =====================
 1. Kmalloc: Repeatedly allocate then free test
 10000 times kmalloc(8) -> 46 cycles kfree -> 61 cycles
 10000 times kmalloc(16) -> 46 cycles kfree -> 63 cycles
 10000 times kmalloc(32) -> 49 cycles kfree -> 69 cycles
 10000 times kmalloc(64) -> 57 cycles kfree -> 76 cycles
 10000 times kmalloc(128) -> 66 cycles kfree -> 83 cycles
 10000 times kmalloc(256) -> 84 cycles kfree -> 110 cycles
 10000 times kmalloc(512) -> 77 cycles kfree -> 114 cycles
 10000 times kmalloc(1024) -> 80 cycles kfree -> 116 cycles
 10000 times kmalloc(2048) -> 102 cycles kfree -> 131 cycles
 10000 times kmalloc(4096) -> 135 cycles kfree -> 163 cycles
 10000 times kmalloc(8192) -> 238 cycles kfree -> 218 cycles
 10000 times kmalloc(16384) -> 399 cycles kfree -> 262 cycles
 2. Kmalloc: alloc/free test
 10000 times kmalloc(8)/kfree -> 65 cycles
 10000 times kmalloc(16)/kfree -> 66 cycles
 10000 times kmalloc(32)/kfree -> 65 cycles
 10000 times kmalloc(64)/kfree -> 66 cycles
 10000 times kmalloc(128)/kfree -> 66 cycles
 10000 times kmalloc(256)/kfree -> 71 cycles
 10000 times kmalloc(512)/kfree -> 72 cycles
 10000 times kmalloc(1024)/kfree -> 71 cycles
 10000 times kmalloc(2048)/kfree -> 71 cycles
 10000 times kmalloc(4096)/kfree -> 71 cycles
 10000 times kmalloc(8192)/kfree -> 65 cycles
 10000 times kmalloc(16384)/kfree -> 511 cycles

$ nm --print-size vmlinux | egrep -e 'kmem_cache_alloc|kmem_cache_free|is_pointer_to_page'
ffffffff81163ba0 00000000000000c9 T kmem_cache_alloc
ffffffff81163aa0 00000000000000f8 T kmem_cache_alloc_node
ffffffff81162cb0 0000000000000133 T kmem_cache_free



Kernel size change
------------------

 $ scripts/bloat-o-meter vmlinux vmlinux-kim-preempt-avoid
 add/remove: 0/0 grow/shrink: 0/8 up/down: 0/-248 (-248)
 function                                     old     new   delta
 kmem_cache_free                              315     307      -8
 kmem_cache_alloc_node                        268     248     -20
 kmem_cache_alloc                             225     201     -24
 kfree                                        274     250     -24
 __kmalloc_node_track_caller                  356     324     -32
 __kmalloc_node                               340     308     -32
 __kmalloc                                    324     273     -51
 __kmalloc_track_caller                       343     286     -57


Qmempool notes:
---------------

On baseline kernel:

Type:qmempool fastpath reuse SOFTIRQ Per elem: 33 cycles(tsc) 13.287 ns
 - (measurement period time:0.398628965 sec time_interval:398628965)
 - (invoke count:30000000 tsc_interval:996571541)

Type:qmempool fastpath reuse BH-disable Per elem: 47 cycles(tsc) 19.180 ns
 - (measurement period time:0.575425927 sec time_interval:575425927)
 - (invoke count:30000000 tsc_interval:1438563781)

qmempool_bench: N-pattern with 256 elements

Type:qmempool alloc+free N-pattern Per elem: 62 cycles(tsc) 24.955 ns (step:0)
 - (measurement period time:0.638871008 sec time_interval:638871008)
 - (invoke count:25600000 tsc_interval:1597176303)


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2014-12-17 12:08 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-12-10 16:30 Christoph Lameter
2014-12-10 16:30 ` [PATCH 1/7] slub: Remove __slab_alloc code duplication Christoph Lameter
2014-12-10 16:39   ` Pekka Enberg
2014-12-10 16:30 ` [PATCH 2/7] slub: Use page-mapping to store address of page frame like done in SLAB Christoph Lameter
2014-12-10 16:45   ` Pekka Enberg
2014-12-10 16:30 ` [PATCH 3/7] slub: Do not use c->page on free Christoph Lameter
2014-12-10 16:54   ` Pekka Enberg
2014-12-10 17:08     ` Christoph Lameter
2014-12-10 17:32       ` Pekka Enberg
2014-12-10 17:37         ` Christoph Lameter
2014-12-11 13:19           ` Jesper Dangaard Brouer
2014-12-11 15:01             ` Christoph Lameter
2014-12-15  8:03   ` Joonsoo Kim
2014-12-15 14:16     ` Christoph Lameter
2014-12-16  2:42       ` Joonsoo Kim
2014-12-16  7:54         ` Andrey Ryabinin
2014-12-16  8:25           ` Joonsoo Kim
2014-12-16 14:53             ` Christoph Lameter
2014-12-16 15:15               ` Jesper Dangaard Brouer
2014-12-16 15:34                 ` Andrey Ryabinin
2014-12-16 15:48                 ` Christoph Lameter
2014-12-17  7:15                   ` Joonsoo Kim
2014-12-16 15:33               ` Andrey Ryabinin
2014-12-16 14:05           ` Jesper Dangaard Brouer
2014-12-10 16:30 ` [PATCH 4/7] slub: Avoid using the page struct address in allocation fastpath Christoph Lameter
2014-12-10 16:56   ` Pekka Enberg
2014-12-10 16:30 ` [PATCH 5/7] slub: Use end_token instead of NULL to terminate freelists Christoph Lameter
2014-12-10 16:59   ` Pekka Enberg
2014-12-10 16:30 ` [PATCH 6/7] slub: Drop ->page field from kmem_cache_cpu Christoph Lameter
2014-12-10 17:29   ` Pekka Enberg
2014-12-10 16:30 ` [PATCH 7/7] slub: Remove preemption disable/enable from fastpath Christoph Lameter
2014-12-11 13:35 ` [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1 Jesper Dangaard Brouer
2014-12-11 15:03   ` Christoph Lameter
2014-12-11 16:50     ` Jesper Dangaard Brouer
2014-12-11 17:18       ` Christoph Lameter
2014-12-11 18:11         ` Jesper Dangaard Brouer
2014-12-11 17:37 ` Jesper Dangaard Brouer
2014-12-12 10:39   ` Jesper Dangaard Brouer
2014-12-12 18:31     ` Christoph Lameter
2014-12-15  7:59 ` Joonsoo Kim
2014-12-17  7:13   ` Joonsoo Kim
2014-12-17 12:08     ` Jesper Dangaard Brouer [this message]
2014-12-18 14:34       ` Joonsoo Kim
2014-12-17 15:36     ` Christoph Lameter
2014-12-18 14:38       ` Joonsoo Kim
2014-12-18 14:57         ` Christoph Lameter
2014-12-18 15:08           ` Joonsoo Kim
2014-12-17 16:10     ` Christoph Lameter
2014-12-17 19:44       ` Christoph Lameter
2014-12-18 14:41       ` Joonsoo Kim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20141217130841.100dac71@redhat.com \
    --to=brouer@redhat.com \
    --cc=akpm@linuxfoundation.org \
    --cc=cl@linux.com \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=js1304@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=penberg@kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox