linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
To: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org, Daniel Phillips <phillips@google.com>
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD
Date: Wed, 09 Aug 2006 15:32:33 +0200	[thread overview]
Message-ID: <1155130353.12225.53.camel@twins> (raw)
In-Reply-To: <20060809130752.GA17953@2ka.mipt.ru>

On Wed, 2006-08-09 at 17:07 +0400, Evgeniy Polyakov wrote:
> On Wed, Aug 09, 2006 at 02:37:20PM +0200, Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> > On Wed, 2006-08-09 at 09:46 +0400, Evgeniy Polyakov wrote:
> > > On Tue, Aug 08, 2006 at 09:33:25PM +0200, Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> > > >    http://lwn.net/Articles/144273/
> > > >    "Kernel Summit 2005: Convergence of network and storage paths"
> > > > 
> > > > We believe that an approach very much like today's patch set is
> > > > necessary for NBD, iSCSI, AoE or the like ever to work reliably. 
> > > > We further believe that a properly working version of at least one of
> > > > these subsystems is critical to the viability of Linux as a modern
> > > > storage platform.
> > > 
> > > There is another approach for that - do not use slab allocator for
> > > network dataflow at all. It automatically has all you pros amd if
> > > implemented correctly can have a lot of additional usefull and
> > > high-performance features like full zero-copy and total fragmentation
> > > avoidance.
> > 
> > On your site where you explain the Network Tree Allocator:
> > 
> >  http://tservice.net.ru/~s0mbre/blog/devel/networking/nta/index.html
> > 
> > You only test the fragmentation scenario with the full scale of sizes.
> > Fragmentation will look different if you use a limited number of sizes
> > that share no factors (other than the block size); try 19, 37 and 79 
> > blocks with 1:1:1 ratio.
     ^^^^^^

> 19, 37 and 79 will be rounded by SLAB to 32, 64 and 128 bytes, with NTA it 
> will be 32, 64 and 96 bytes. NTA wins in each allocation which is not
> power-of-two (I use 32 bytes alignemnt, as the smallest one which SLAB
> uses). And as you saw in the blog, network tree allocator is faster
> than SLAB one, although it can have different side effects which are not
> yet 100% discovered.

So that would end up being 19*32 = 608 bytes, etc..
As for speed, sure.

> > Also, I have yet to see how you will do full zero-copy receives; full 
> > zero-copy would mean getting the data from driver DMA to user-space
> > without
> > a single copy. The to user-space part almost requires that each packet
> > live
> > on its own page.
> 
> Each page can easily have several packets inside.

For sure, the problem is: do you know for which user-space process a
packet
is going to be before you receive it?

> > As for the VM deadlock avoidance; I see no zero overhead allocation path
> > - you do not want to deadlock your allocator. I see no critical resource 
> > isolation (our SOCK_MEMALLOC). Without these things your allocator might
> > improve the status quo but it will not aid in avoiding the deadlock we
> > try to tackle here.
> 
> Because such reservation is not needed at all.
> SLAB OOM can be handled by reserving pool using SOCK_MEMALLOC and
> similar hacks, and different allocator, which obviously work with own
> pool of pages, can not suffer from SLAB problems.
> 
> You say "critical resource isolation", but it is not the case - consider
> NFS over UDP - remote side will not stop sending just because receiving 
> socket code drops data due to OOM, or IPsec or compression, which can
> requires reallocation. There is no "critical resource isolation", since
> reserved pool _must_ be used by everyone in the kernel network stack.

The idea is to drop all !NFS packets (or even more specific only keep
those
NFS packets that belong to the critical mount), and everybody doing
critical
IO over layered networks like IPSec or other tunnel constructs asks for 
trouble - Just DON'T do that.

Dropping these non-essential packets makes sure the reserve memory
doesn't 
get stuck in some random blocked user-space process, hence you can make 
progress.

> And as you saw fragmentation issues are handled very good in NTA, just
> consider usual packet with data with 1500 MTU - 500 bytes are wasted.
> If you use jumbo frames... it is posible to end up with 32k allocation
> for 9k jumbo frame with some hardware.

Sure, SLAB does suck at some things, and I don't argue that NTA will
not 
improve. Its just that 'total fragmentation avoidance' it too strong
and 
this deadlock avoidance needs more.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2006-08-09 13:32 UTC|newest]

Thread overview: 140+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-08-08 19:33 Peter Zijlstra
2006-08-08 19:33 ` [RFC][PATCH 1/9] pfn_to_kaddr() for UML Peter Zijlstra
2006-08-08 19:33 ` [RFC][PATCH 2/9] deadlock prevention core Peter Zijlstra
2006-08-08 20:57   ` Stephen Hemminger
2006-08-08 21:05     ` Peter Zijlstra
2006-08-09  1:33     ` Daniel Phillips
2006-08-09  1:38       ` David Miller, Daniel Phillips
2006-08-08 21:17   ` Thomas Graf
2006-08-09  1:34     ` Daniel Phillips
2006-08-09  1:39       ` David Miller, Daniel Phillips
2006-08-09  5:47         ` Daniel Phillips
2006-08-09 13:19           ` Thomas Graf
2006-08-09 14:07             ` Peter Zijlstra
2006-08-09 16:18               ` Thomas Graf
2006-08-09 16:19                 ` Peter Zijlstra
2006-08-10  0:01                   ` David Miller, Peter Zijlstra
2006-08-09 23:58               ` David Miller, Peter Zijlstra
2006-08-10  6:25                 ` Peter Zijlstra
2006-08-11  4:24                 ` Stephen Hemminger
2006-08-13 21:22                 ` Daniel Phillips
2006-08-13 23:49                   ` David Miller, Daniel Phillips
2006-08-14  1:15                     ` Daniel Phillips
2006-08-11  2:37     ` Rik van Riel
2006-08-13 22:05       ` Daniel Phillips
2006-08-13 23:55         ` David Miller, Daniel Phillips
2006-08-14  1:31           ` Daniel Phillips
2006-08-14  1:53             ` Andrew Morton
2006-08-14  4:40               ` Peter Zijlstra
2006-08-14  4:58                 ` Andrew Morton
2006-08-14  5:03                   ` Peter Zijlstra
2006-08-14  5:22                     ` Andrew Morton
2006-08-14  6:45                       ` Peter Zijlstra
2006-08-14  7:07                         ` Andrew Morton
2006-08-14  8:15                           ` Peter Zijlstra
2006-08-14  8:25                             ` Evgeniy Polyakov
2006-08-14  8:35                               ` Peter Zijlstra
2006-08-14  8:33                           ` David Miller, Andrew Morton
2006-08-17  4:27                           ` Daniel Phillips
2006-08-14  7:17                         ` Neil Brown
2006-08-14  7:31                           ` Evgeniy Polyakov
2006-08-17  3:58                   ` Daniel Phillips
2006-08-17  5:57                     ` Andrew Morton
2006-08-17 23:53                       ` Daniel Phillips
2006-08-18  0:24                         ` Rik van Riel
2006-08-18  0:35                         ` Daniel Phillips
2006-08-18  1:14                         ` Neil Brown
2006-08-18  6:05                         ` Andrew Morton
2006-08-18 21:22                           ` Daniel Phillips
2006-08-18 22:34                             ` Andrew Morton
2006-08-18 23:44                               ` Daniel Phillips
2006-08-19  2:44                                 ` Andrew Morton
2006-08-19  4:14                                   ` Network receive stall avoidance (was [PATCH 2/9] deadlock prevention core) Daniel Phillips
2006-08-19  7:28                                     ` Andrew Morton
2006-08-19 15:06                                   ` [RFC][PATCH 2/9] deadlock prevention core Rik van Riel
2006-08-20  1:33                                     ` Andre Tomt
2006-08-19 16:53                                   ` Ray Lee
2006-08-21 13:27                                   ` Philip R. Auld
2006-08-25 10:47                                     ` Pavel Machek
2006-08-21 13:38                                 ` Jens Axboe
2006-08-08 22:10   ` David Miller
2006-08-09  1:35     ` Daniel Phillips
2006-08-09  1:41       ` David Miller, Daniel Phillips
2006-08-09  5:44         ` Daniel Phillips
2006-08-09  7:00           ` Peter Zijlstra
     [not found]   ` <42414.81.207.0.53.1155080443.squirrel@81.207.0.53>
2006-08-09  0:25     ` Daniel Phillips
2006-08-09 12:02       ` Indan Zupancic
2006-08-09 12:54         ` Peter Zijlstra
2006-08-09 13:48           ` Indan Zupancic
2006-08-09 14:00             ` Peter Zijlstra
2006-08-09 18:34               ` Indan Zupancic
2006-08-09 19:45                 ` Peter Zijlstra
2006-08-09 20:19                   ` Peter Zijlstra
2006-08-10  1:21                   ` Indan Zupancic
2006-08-09 16:05   ` -v2 " Peter Zijlstra
2006-08-08 19:33 ` [RFC][PATCH 3/9] e1000 driver conversion Peter Zijlstra
2006-08-08 20:50   ` Auke Kok
2006-08-08 20:59     ` Peter Zijlstra
2006-08-08 22:32     ` David Miller, Auke Kok
2006-08-08 22:42       ` Auke Kok
2006-08-08 19:34 ` [RFC][PATCH 4/9] e100 " Peter Zijlstra
2006-08-08 20:13   ` Auke Kok
2006-08-08 20:18     ` Peter Zijlstra
2006-08-08 19:34 ` [RFC][PATCH 5/9] r8169 " Peter Zijlstra
2006-08-08 19:34 ` [RFC][PATCH 6/9] tg3 " Peter Zijlstra
2006-08-08 19:34 ` [RFC][PATCH 7/9] UML eth " Peter Zijlstra
2006-08-08 19:34 ` [RFC][PATCH 8/9] 3c59x " Peter Zijlstra
2006-08-08 23:07   ` Jeff Garzik
2006-08-09  5:51     ` Daniel Phillips
2006-08-09  5:55       ` David Miller, Daniel Phillips
2006-08-09  6:30         ` Jeff Garzik
2006-08-09  7:03           ` Peter Zijlstra
2006-08-09  7:20             ` Jeff Garzik
2006-08-13 19:38         ` Daniel Phillips
2006-08-13 19:53           ` Jeff Garzik
2006-08-08 19:34 ` [RFC][PATCH 9/9] deadlock prevention for NBD Peter Zijlstra
2006-08-09  5:46 ` [RFC][PATCH 0/9] Network receive " Evgeniy Polyakov
2006-08-09  5:52   ` Daniel Phillips
2006-08-09  5:56     ` David Miller, Daniel Phillips
2006-08-09  5:53   ` David Miller, Evgeniy Polyakov
2006-08-09  5:55     ` Evgeniy Polyakov
2006-08-09 12:37   ` Peter Zijlstra
2006-08-09 13:07     ` Evgeniy Polyakov
2006-08-09 13:32       ` Peter Zijlstra [this message]
2006-08-09 19:29         ` Evgeniy Polyakov
2006-08-09 23:54         ` David Miller, Peter Zijlstra
2006-08-10  6:06           ` Peter Zijlstra
2006-08-13 20:16             ` Daniel Phillips
2006-08-14  5:13               ` Evgeniy Polyakov
2006-08-14  6:45                 ` Peter Zijlstra
2006-08-14  6:54                   ` Evgeniy Polyakov
2006-08-17  4:49                     ` Daniel Phillips
2006-08-17  4:48                 ` Daniel Phillips
2006-08-17  5:36                   ` Evgeniy Polyakov
2006-08-17 18:01                     ` Daniel Phillips
2006-08-17 18:42                       ` Evgeniy Polyakov
2006-08-17 19:15                         ` Peter Zijlstra
2006-08-17 19:48                           ` Evgeniy Polyakov
2006-08-17 23:24                             ` Daniel Phillips
2006-08-18  7:16                               ` Evgeniy Polyakov
2006-08-12  3:42         ` Rik van Riel
2006-08-12  8:47           ` Evgeniy Polyakov
2006-08-12  9:19             ` Peter Zijlstra
2006-08-12  9:37               ` Evgeniy Polyakov
2006-08-12 10:18                 ` Peter Zijlstra
2006-08-12 10:42                   ` Evgeniy Polyakov
2006-08-12 10:51                     ` Evgeniy Polyakov
2006-08-12 11:40                     ` Peter Zijlstra
2006-08-12 11:53                       ` Evgeniy Polyakov
2006-08-13  0:46                   ` David Miller, Peter Zijlstra
2006-08-13  1:11                     ` Rik van Riel
2006-08-12 14:40                 ` Rik van Riel
2006-08-12 14:49                   ` Evgeniy Polyakov
2006-08-12 14:56                     ` Rik van Riel
2006-08-12 15:08                       ` Evgeniy Polyakov
2006-08-12 15:22                         ` Peter Zijlstra
2006-08-14  0:56                         ` Daniel Phillips
2006-08-13  0:46                 ` David Miller, Evgeniy Polyakov
2006-08-13  9:06                   ` Evgeniy Polyakov
2006-08-13  9:52                     ` Evgeniy Polyakov
2006-08-15 19:17 ` Pavel Machek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1155130353.12225.53.camel@twins \
    --to=a.p.zijlstra@chello.nl \
    --cc=johnpol@2ka.mipt.ru \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=netdev@vger.kernel.org \
    --cc=phillips@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox