* Re: Initial thoughts on TXDP
[not found] ` <CALx6S36ywu3ruY7AFKYk=N4Ekr5zjY33ivx92EgNNT36XoXhFA@mail.gmail.com>
@ 2016-12-02 12:13 ` Jesper Dangaard Brouer
[not found] ` <859a0c99-f427-1db8-d260-1297777792fb@stressinduktion.org>
1 sibling, 0 replies; 2+ messages in thread
From: Jesper Dangaard Brouer @ 2016-12-02 12:13 UTC (permalink / raw)
To: Tom Herbert
Cc: brouer, Florian Westphal, Linux Kernel Network Developers, linux-mm
On Thu, 1 Dec 2016 11:51:42 -0800 Tom Herbert <tom@herbertland.com> wrote:
> On Wed, Nov 30, 2016 at 6:44 PM, Florian Westphal <fw@strlen.de> wrote:
> > Tom Herbert <tom@herbertland.com> wrote:
[...]
> >> - Call into TCP/IP stack with page data directly from driver-- no
> >> skbuff allocation or interface. This is essentially provided by the
> >> XDP API although we would need to generalize the interface to call
> >> stack functions (I previously posted patches for that). We will also
> >> need a new action, XDP_HELD?, that indicates the XDP function held the
> >> packet (put on a socket for instance).
> >
> > Seems this will not work at all with the planned page pool thing when
> > pages start to be held indefinitely.
It is quite the opposite, the page pool support pages are being held
for longer times, than drivers today. The current driver page recycle
tricks cannot, as they depend on page refcnt being decremented quickly
(while pages are still mapped in their recycle queue).
> > You can also never get even close to userspace offload stacks once you
> > need/do this; allocations in hotpath are too expensive.
Yes. It is important to understand that once the number of outstanding
pages get large, the driver recycle stops working. Meaning the pages
allocations start to go through the page allocator. I've documented[1]
that the bare alloc+free cost[2] (231 cycles order-0/4K) is higher than
the 10G wirespeed budget (201 cycles).
Thus, the driver recycle tricks are nice for benchmarking, as it hides
the page allocator overhead. But this optimization might disappear for
Tom's and Eric's more real-world use-cases e.g. like 10.000 sockets.
The page pool don't these issues.
[1] http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf
[2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: Initial thoughts on TXDP
[not found] ` <859a0c99-f427-1db8-d260-1297777792fb@stressinduktion.org>
@ 2016-12-02 13:01 ` Jesper Dangaard Brouer
0 siblings, 0 replies; 2+ messages in thread
From: Jesper Dangaard Brouer @ 2016-12-02 13:01 UTC (permalink / raw)
To: Hannes Frederic Sowa
Cc: brouer, Tom Herbert, Florian Westphal,
Linux Kernel Network Developers, Alexander Duyck, John Fastabend,
linux-mm
On Thu, 1 Dec 2016 23:47:44 +0100
Hannes Frederic Sowa <hannes@stressinduktion.org> wrote:
> Side note:
>
> On 01.12.2016 20:51, Tom Herbert wrote:
> >> > E.g. "mini-skb": Even if we assume that this provides a speedup
> >> > (where does that come from? should make no difference if a 32 or
> >> > 320 byte buffer gets allocated).
Yes, the size of the allocation from the SLUB allocator does not change
base performance/cost much (at least for small objects, if < 1024).
Do notice the base SLUB alloc+free cost is fairly high (compared to a
201 cycles budget). Especially for networking as the free-side is very
likely to hit a slow path. SLUB fast-path 53 cycles, and slow-path
around 100 cycles (data from [1]). I've tried to address this with the
kmem_cache bulk APIs. Which reduce the cost to approx 30 cycles.
(Something we have not fully reaped the benefit from yet!)
[1] https://git.kernel.org/torvalds/c/ca257195511
> >> >
> > It's the zero'ing of three cache lines. I believe we talked about that
> > as netdev.
Actually 4 cache-lines, but with some cleanup I believe we can get down
to clearing 192 bytes 3 cache-lines.
>
> Jesper and me played with that again very recently:
>
> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c#L590
>
> In micro-benchmarks we saw a pretty good speed up not using the rep
> stosb generated by gcc builtin but plain movq's. Probably the cost model
> for __builtin_memset in gcc is wrong?
Yes, I believe so.
> When Jesper is free we wanted to benchmark this and maybe come up with a
> arch specific way of cleaning if it turns out to really improve throughput.
>
> SIMD instructions seem even faster but the kernel_fpu_begin/end() kill
> all the benefits.
One strange thing was, that on my skylake CPU (i7-6700K @4.00GHz),
Hannes's hand-optimized MOVQ ASM-code didn't go past 8 bytes per cycle,
or 32 cycles for 256 bytes.
Talking to Alex and John during netdev, and reading on the Intel arch,
I though that this CPU should be-able-to perform 16 bytes per cycle.
The CPU can do it as the rep-stos show this once the size gets large
enough.
On this CPU the memset rep stos starts to win around 512 bytes:
192/35 = 5.5 bytes/cycle
256/36 = 7.1 bytes/cycle
512/40 = 12.8 bytes/cycle
768/46 = 16.7 bytes/cycle
1024/52 = 19.7 bytes/cycle
2048/84 = 24.4 bytes/cycle
4096/148= 27.7 bytes/cycle
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2016-12-02 13:01 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <CALx6S34qPqXa7s1eHmk9V-k6xb=36dfiQvx3JruaNnqg4v8r9g@mail.gmail.com>
[not found] ` <20161201024407.GE26507@breakpoint.cc>
[not found] ` <CALx6S36ywu3ruY7AFKYk=N4Ekr5zjY33ivx92EgNNT36XoXhFA@mail.gmail.com>
2016-12-02 12:13 ` Initial thoughts on TXDP Jesper Dangaard Brouer
[not found] ` <859a0c99-f427-1db8-d260-1297777792fb@stressinduktion.org>
2016-12-02 13:01 ` Jesper Dangaard Brouer
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox