From: Eric Dumazet <eric.dumazet@gmail.com>
To: Christoph Lameter <cl@linux.com>, David Miller <davem@davemloft.net>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>,
David Rientjes <rientjes@google.com>,
Andi Kleen <andi@firstfloor.org>,
tj@kernel.org, Metathronius Galabant <m.galabant@googlemail.com>,
Matt Mackall <mpm@selenic.com>, Adrian Drzewiecki <z@drze.net>,
Shaohua Li <shaohua.li@intel.com>, Alex Shi <alex.shi@intel.com>,
linux-mm@kvack.org, netdev <netdev@vger.kernel.org>
Subject: Re: [rfc 00/18] slub: irqless/lockless slow allocation paths
Date: Wed, 16 Nov 2011 18:39:58 +0100 [thread overview]
Message-ID: <1321465198.4182.35.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC> (raw)
In-Reply-To: <20111111200711.156817886@linux.com>
Le vendredi 11 novembre 2011 A 14:07 -0600, Christoph Lameter a A(C)crit :
> This is a patchset that makes the allocator slow path also lockless like
> the free paths. However, in the process it is making processing more
> complex so that this is not a performance improvement. I am going to
> drop this series unless someone comes up with a bright idea to fix the
> following performance issues:
>
> 1. Had to reduce the per cpu state kept to two words in order to
> be able to operate without preempt disable / interrupt disable only
> through cmpxchg_double(). This means that the node information and
> the page struct location have to be calculated from the free pointer.
> That is possible but relatively expensive and has to be done frequently
> in fast paths.
>
> 2. If the freepointer becomes NULL then the page struct location can
> no longer be determined. So per cpu slabs must be deactivated when
> the last object is retrieved from them causing more regressions.
>
> If these issues remain unresolved then I am fine with the way things are
> right now in slub. Currently interrupts are disabled in the slow paths and
> then multiple fields in the kmem_cache_cpu structure are modified without
> regard to instruction atomicity.
>
I believe this is a wrong idea.
You try to have a lockless slow path, while I believe you should not,
and instead batch things a bit like SLAB, and be smart about false
sharing.
The lock cost is nothing compared to cache line ping pongs.
Here is a real use case I am facing right now :
In traditional NIC driver model, rx path used a ring buffer of
pre-allocated skbs (256 ... 4096 elems per ring), and feed them to upper
stack when interrupts signal frames are available.
If the skb is delivered to a socket, and consumed/freed by another cpu,
we had no particular problem because skb was part of a page that was
completely used (no free objects in it), because of the RX ring buffer
buffering (frame N is delivered to stack if allocations N+1 ... N+1024
were already done)
This model has a downside, since we initialize skb at allocation time,
then add it in the ring buffer. Later, we handle the frame while sk_buff
content had been taken out of cpu caches, so cpu must reload sk_buff
from memory before sending skb to stack, this adds some latency to
receive path (about 5 cache line misses per packet)
We now want to allocate/populate the sk_buff right before sending it to
upper stack. (see build_skb() infrastructure in net-next tree :
http://git.kernel.org/?p=linux/kernel/git/davem/net-next.git;a=commit;h=b2b5ce9d1ccf1c45f8ac68e5d901112ab76ba199
http://git.kernel.org/?p=linux/kernel/git/davem/net-next.git;a=commit;h=e52fcb2462ac484e6dd6e68869536609f0216938
)
But... we now ping-pong in slab_alloc() in the case skb consumer is on a
different cpu (this is typically the case if one cpu is fully used in
softirq handling / stress situation, or if RPS/RFS techniques are used).
So softirq handler and consumers compete on heavy contended cache line
for _every_ allocation and free.
Switching to SLAB solves the problem.
perf profile for SLAB (no packet drops, and 5% of idle stil available),
for CPU0 (the one handling softirqs) : We see normal network functions
in a network workload :)
9.45% [kernel] [k] ipt_do_table
7.81% [kernel] [k] __udp4_lib_lookup.clone.46
7.11% [kernel] [k] build_skb
5.85% [tg3] [k] tg3_poll_work
4.39% [kernel] [k] udp_queue_rcv_skb
4.37% [kernel] [k] sock_def_readable
4.21% [kernel] [k] __sk_mem_schedule
3.72% [kernel] [k] __netif_receive_skb
3.21% [kernel] [k] __udp4_lib_rcv
2.98% [kernel] [k] nf_iterate
2.85% [kernel] [k] _raw_spin_lock
2.85% [kernel] [k] ip_route_input_common
2.83% [kernel] [k] sock_queue_rcv_skb
2.77% [kernel] [k] ip_rcv
2.76% [kernel] [k] __kmalloc
2.03% [kernel] [k] kmem_cache_alloc
1.93% [kernel] [k] _raw_spin_lock_irqsave
1.76% [kernel] [k] eth_type_trans
1.49% [kernel] [k] nf_hook_slow
1.46% [kernel] [k] inet_gro_receive
1.27% [tg3] [k] tg3_alloc_rx_data
With SLUB : We see contention in __slab_alloc, and packet drops.
13.13% [kernel] [k] __slab_alloc.clone.56
8.81% [kernel] [k] ipt_do_table
7.41% [kernel] [k] __udp4_lib_lookup.clone.46
4.64% [tg3] [k] tg3_poll_work
3.93% [kernel] [k] build_skb
3.65% [kernel] [k] udp_queue_rcv_skb
3.33% [kernel] [k] __netif_receive_skb
3.26% [kernel] [k] kmem_cache_alloc
3.16% [kernel] [k] sock_def_readable
3.15% [kernel] [k] nf_iterate
3.13% [kernel] [k] __sk_mem_schedule
2.81% [kernel] [k] __udp4_lib_rcv
2.58% [kernel] [k] setup_object.clone.50
2.54% [kernel] [k] sock_queue_rcv_skb
2.32% [kernel] [k] ip_route_input_common
2.25% [kernel] [k] ip_rcv
2.14% [kernel] [k] _raw_spin_lock
1.95% [kernel] [k] eth_type_trans
1.55% [kernel] [k] inet_gro_receive
1.50% [kernel] [k] ksize
1.42% [kernel] [k] __kmalloc
1.29% [kernel] [k] _raw_spin_lock_irqsave
Notice new_slab() is not there at all.
Adding SLUB_STATS gives :
$ cd /sys/kernel/slab/skbuff_head_cache ; grep . *
aliases:6
align:8
grep: alloc_calls: Function not implemented
alloc_fastpath:89181782 C0=89173048 C1=1599 C2=1357 C3=2140 C4=802 C5=675 C6=638 C7=1523
alloc_from_partial:412658 C0=412658
alloc_node_mismatch:0
alloc_refill:593417 C0=593189 C1=19 C2=15 C3=24 C4=51 C5=18 C6=17 C7=84
alloc_slab:2831313 C0=2831285 C1=2 C2=2 C3=2 C4=2 C5=12 C6=4 C7=4
alloc_slowpath:4430371 C0=4430112 C1=20 C2=17 C3=25 C4=57 C5=31 C6=21 C7=88
cache_dma:0
cmpxchg_double_cpu_fail:0
cmpxchg_double_fail:1 C0=1
cpu_partial:30
cpu_partial_alloc:592991 C0=592981 C2=1 C4=5 C5=2 C6=1 C7=1
cpu_partial_free:4429836 C0=592981 C1=25 C2=19 C3=23 C4=3836767 C5=6 C6=8 C7=7
cpuslab_flush:0
cpu_slabs:107
deactivate_bypass:3836954 C0=3836923 C1=1 C2=2 C3=1 C4=6 C5=13 C6=4 C7=4
deactivate_empty:2831168 C4=2831168
deactivate_full:0
deactivate_remote_frees:0
deactivate_to_head:0
deactivate_to_tail:0
destroy_by_rcu:0
free_add_partial:0
grep: free_calls: Function not implemented
free_fastpath:21192924 C0=21186268 C1=1420 C2=1204 C3=1966 C4=572 C5=349 C6=380 C7=765
free_frozen:67988498 C0=516 C1=121 C2=85 C3=841 C4=67986468 C5=215 C6=76 C7=176
free_remove_partial:18 C4=18
free_slab:2831186 C4=2831186
free_slowpath:71825749 C0=609 C1=146 C2=104 C3=864 C4=71823538 C5=221 C6=84 C7=183
hwcache_align:0
min_partial:5
objects:2494
object_size:192
objects_partial:121
objs_per_slab:21
order:0
order_fallback:0
partial:14
poison:0
reclaim_account:0
red_zone:0
reserved:0
sanity_checks:0
slabs:127
slabs_cpu_partial:99(99) C1=25(25) C2=18(18) C3=23(23) C4=16(16) C5=4(4) C6=7(7) C7=6(6)
slab_size:192
store_user:0
total_objects:2667
trace:0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2011-11-16 17:40 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-11-11 20:07 Christoph Lameter
2011-11-11 20:07 ` [rfc 01/18] slub: Get rid of the node field Christoph Lameter
2011-11-14 21:42 ` Pekka Enberg
2011-11-15 16:07 ` Christoph Lameter
2011-11-20 23:01 ` David Rientjes
2011-11-21 17:17 ` Christoph Lameter
2011-11-11 20:07 ` [rfc 02/18] slub: Separate out kmem_cache_cpu processing from deactivate_slab Christoph Lameter
2011-11-20 23:10 ` David Rientjes
2011-11-11 20:07 ` [rfc 03/18] slub: Extract get_freelist from __slab_alloc Christoph Lameter
2011-11-14 21:43 ` Pekka Enberg
2011-11-15 16:08 ` Christoph Lameter
2011-12-13 20:31 ` Pekka Enberg
2011-11-20 23:18 ` David Rientjes
2011-11-11 20:07 ` [rfc 04/18] slub: Use freelist instead of "object" in __slab_alloc Christoph Lameter
2011-11-14 21:44 ` Pekka Enberg
2011-11-20 23:22 ` David Rientjes
2011-11-11 20:07 ` [rfc 05/18] slub: Simplify control flow in __slab_alloc() Christoph Lameter
2011-11-14 21:45 ` Pekka Enberg
2011-11-20 23:24 ` David Rientjes
2011-11-11 20:07 ` [rfc 06/18] slub: Use page variable instead of c->page Christoph Lameter
2011-11-14 21:46 ` Pekka Enberg
2011-11-20 23:27 ` David Rientjes
2011-11-11 20:07 ` [rfc 07/18] slub: pass page to node_match() instead of kmem_cache_cpu structure Christoph Lameter
2011-11-20 23:28 ` David Rientjes
2011-11-11 20:07 ` [rfc 08/18] slub: enable use of deactivate_slab with interrupts on Christoph Lameter
2011-11-11 20:07 ` [rfc 09/18] slub: Run deactivate_slab with interrupts enabled Christoph Lameter
2011-11-11 20:07 ` [rfc 10/18] slub: Enable use of get_partial " Christoph Lameter
2011-11-11 20:07 ` [rfc 11/18] slub: Acquire_slab() avoid loop Christoph Lameter
2011-11-11 20:07 ` [rfc 12/18] slub: Remove kmem_cache_cpu dependency from acquire slab Christoph Lameter
2011-11-11 20:07 ` [rfc 13/18] slub: Add functions to manage per cpu freelists Christoph Lameter
2011-11-11 20:07 ` [rfc 14/18] slub: Decomplicate the get_pointer_safe call and fixup statistics Christoph Lameter
2011-11-11 20:07 ` [rfc 15/18] slub: new_slab_objects() can also get objects from partial list Christoph Lameter
2011-11-11 20:07 ` [rfc 16/18] slub: Drop page field from kmem_cache_cpu Christoph Lameter
2011-11-11 20:07 ` [rfc 17/18] slub: Move __slab_free() into slab_free() Christoph Lameter
2011-11-11 20:07 ` [rfc 18/18] slub: Move __slab_alloc() into slab_alloc() Christoph Lameter
2011-11-16 17:39 ` Eric Dumazet [this message]
2011-11-16 17:45 ` [rfc 00/18] slub: irqless/lockless slow allocation paths Eric Dumazet
2011-11-20 23:32 ` David Rientjes
2011-11-20 23:30 ` David Rientjes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1321465198.4182.35.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC \
--to=eric.dumazet@gmail.com \
--cc=alex.shi@intel.com \
--cc=andi@firstfloor.org \
--cc=cl@linux.com \
--cc=davem@davemloft.net \
--cc=linux-mm@kvack.org \
--cc=m.galabant@googlemail.com \
--cc=mpm@selenic.com \
--cc=netdev@vger.kernel.org \
--cc=penberg@cs.helsinki.fi \
--cc=rientjes@google.com \
--cc=shaohua.li@intel.com \
--cc=tj@kernel.org \
--cc=z@drze.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox