Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Daniel Phillips <phillips@google.com>
To: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>,
	David Miller <davem@davemloft.net>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD
Date: Thu, 17 Aug 2006 16:24:26 -0700	[thread overview]
Message-ID: <44E4FAAA.2050104@google.com> (raw)
In-Reply-To: <20060817194850.GA19647@2ka.mipt.ru>

Evgeniy Polyakov wrote:
> On Thu, Aug 17, 2006 at 09:15:14PM +0200, Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> 
>>>I got openssh as example of situation when system does not know in 
>>>advance, what sockets must be marked as critical.
>>>OpenSSH works with network and unix sockets in parallel, so you need to
>>>hack openssh code to be able to allow it to use reserve when there is 
>>>not enough memory.
>>
>>OpenSSH or any other user-space program will never ever have the problem
>>we are trying to solve. Nor could it be fixed the way we are solving it,
>>user-space programs can be stalled indefinite. We are concerned with
>>kernel services, and the continued well being of the kernel, not
>>user-space. (well therefore indirectly also user-space of course)
> 
> 
> You limit your system here - it is possible that userspace should send
> some command when kernel agent requires some negotiation.
> And even for them it is possible to require ARP request and/or ICMP
> processing.
> 
> 
>>> Actually all sockets must be able to get data, since
>>>kernel can not diffirentiate between telnetd and a process which will 
>>>receive an ack for swapped page or other significant information.
>>
>>Oh, but it does, the kernel itself controls those sockets: NBD / iSCSI
>>and AoE are all kernel services, not user-space. And it is the core of
>>our work to provide this information to the kernel; to distinguish these
>>few critical sockets.
> 
> 
> As I stated above it is not enough.
> And even if you will cover all kernel-only network allocations, which are
> used for your selected datapath, problem still there - admin is unable
> to connect although it can be critical connection too.
> 
> 
>>>So network must behave separately from main allocator in that period of 
>>>time, but since it is possible that reserve can be not filled or has not
>>>enough space or something other, it must be preallocated in far advance
>>>and should be quite big, but then why netwrok should use it at all, when
>>>being separated from main allocations solves the problem?
>>
>>You still need to guarantee data-paths to these services, and you need
>>to make absolutely sure that your last bit of memory is used to service
>>these critical sockets, not some random blocked user-space process.
>>
>>You cannot pre-allocate enough memory _ever_ to satisfy the total
>>capacity of the network stack. You _can_ allot a certain amount of
>>memory to the network stack (avoiding DoS), and drop packets once you
>>exceed that. But still, you need to make sure these few critical
>>_kernel_ services get their data.
> 
> Feel free to implement any receiving policy inside _separated_ allocator
> to meet your needs, but if allocator depends on main system's memory
> conditions it is always possible that it will fail to make forward
> progress.

Wrong.  Our main allocator has a special reserve that can be accessed
only by a task that has its PF_MEMALLOC flag set.  This reserve exists
in order to guarantee forward progress in just such situations as the
network runs into when it is trying to receive responses from a remote
disk.  Anything otherwise is a bug.

>>>I do not argue that your approach is bad or does not solve the problem,
>>>I'm just trying to show that further evolution of that idea eventually
>>>ends up in separated allocator (as long as all most robust systems
>>>separate operations), which can improve things in a lot of other sides
>>>too.
>>
>>Not a separate allocator per-se, separate socket group, they are
>>serviced by the kernel, they will never refuse to process data, and it
>>is critical for the continued well-being of your kernel that they get
>>their data.

The memalloc reserve is indeed a separate reserve, however it is a
reserve that already exists, and you are busy creating another separate
reserve to further partition memory.  Partitioning memory is not the
direction we should be going, we should be trying to unify our reserves
wherever possible, and find ways such as Andrew and others propose to
implement "reserve on demand", or in other words, take advantage of
"easily freeable" pages.

If your allocation code is so much more efficient than slab then why
don't you fix slab instead of replicating functionality that already
exists elsewhere in the system, and has since day one?

> You do not know in advance which sockets must be separated (since only
> in the simplest situation it is the same as in NBD and is
> kernelspace-only),

Yes we do, they are exactly those sockets that lie in the block IO path.
The VM cannot deadlock on any others.

> you can not solve problem with ARP/ICMP/route changes and other control
> messages, netfilter, IPsec and compression which still can happen in your 
> setup, 

If you bothered to read the patches you would know that ICMP is indeed
handled.  ARP I don't think so, we may need to do that since ARP can
believably be required for remote disk interface failover.  Anybody
who designs ssh into remote disk failover is an idiot.  Ssh for
configuration, monitoring and administration is fine.  Ssh for fencing
or whatever is just plain stupid, unless the nodes running both server
and client are not allowed to mount the remote disk.

> if something goes wrong and receiving will require additional
> allocation from network datapath, system is dead,
> this strict conditions does not allow flexible control over possible
> connections and does not allow to create additional connections.

You know that how?

>>Also, I do not think people would like it if say 100M of their 1G system
>>just disappears, never to used again for eg. page-cache in periods of
>>low network traffic.
> 
> Just for clarification: network tree allocator gets 512kb and then
> increases cache size when it is required. Default value can be changed
> of course.

Great.  Now why does the network layer need its own, invented-in-netland
allocator?  Why can't everybody use your allocator if it is better?

Also, please don't get the idea that your allocator by itself solves the
block IO receive starvation problem.  At the very least you need to do
something about network traffic that is unrelated to forward progress of
memory writeout, yet can starve the memory writeout.  Oh wait, our patch
set already does that.

Regards,

Daniel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2006-08-17 23:24 UTC|newest]

Thread overview: 140+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-08-08 19:33 Peter Zijlstra
2006-08-08 19:33 ` [RFC][PATCH 1/9] pfn_to_kaddr() for UML Peter Zijlstra
2006-08-08 19:33 ` [RFC][PATCH 2/9] deadlock prevention core Peter Zijlstra
2006-08-08 20:57   ` Stephen Hemminger
2006-08-08 21:05     ` Peter Zijlstra
2006-08-09  1:33     ` Daniel Phillips
2006-08-09  1:38       ` David Miller, Daniel Phillips
2006-08-08 21:17   ` Thomas Graf
2006-08-09  1:34     ` Daniel Phillips
2006-08-09  1:39       ` David Miller, Daniel Phillips
2006-08-09  5:47         ` Daniel Phillips
2006-08-09 13:19           ` Thomas Graf
2006-08-09 14:07             ` Peter Zijlstra
2006-08-09 16:18               ` Thomas Graf
2006-08-09 16:19                 ` Peter Zijlstra
2006-08-10  0:01                   ` David Miller, Peter Zijlstra
2006-08-09 23:58               ` David Miller, Peter Zijlstra
2006-08-10  6:25                 ` Peter Zijlstra
2006-08-11  4:24                 ` Stephen Hemminger
2006-08-13 21:22                 ` Daniel Phillips
2006-08-13 23:49                   ` David Miller, Daniel Phillips
2006-08-14  1:15                     ` Daniel Phillips
2006-08-11  2:37     ` Rik van Riel
2006-08-13 22:05       ` Daniel Phillips
2006-08-13 23:55         ` David Miller, Daniel Phillips
2006-08-14  1:31           ` Daniel Phillips
2006-08-14  1:53             ` Andrew Morton
2006-08-14  4:40               ` Peter Zijlstra
2006-08-14  4:58                 ` Andrew Morton
2006-08-14  5:03                   ` Peter Zijlstra
2006-08-14  5:22                     ` Andrew Morton
2006-08-14  6:45                       ` Peter Zijlstra
2006-08-14  7:07                         ` Andrew Morton
2006-08-14  8:15                           ` Peter Zijlstra
2006-08-14  8:25                             ` Evgeniy Polyakov
2006-08-14  8:35                               ` Peter Zijlstra
2006-08-14  8:33                           ` David Miller, Andrew Morton
2006-08-17  4:27                           ` Daniel Phillips
2006-08-14  7:17                         ` Neil Brown
2006-08-14  7:31                           ` Evgeniy Polyakov
2006-08-17  3:58                   ` Daniel Phillips
2006-08-17  5:57                     ` Andrew Morton
2006-08-17 23:53                       ` Daniel Phillips
2006-08-18  0:24                         ` Rik van Riel
2006-08-18  0:35                         ` Daniel Phillips
2006-08-18  1:14                         ` Neil Brown
2006-08-18  6:05                         ` Andrew Morton
2006-08-18 21:22                           ` Daniel Phillips
2006-08-18 22:34                             ` Andrew Morton
2006-08-18 23:44                               ` Daniel Phillips
2006-08-19  2:44                                 ` Andrew Morton
2006-08-19  4:14                                   ` Network receive stall avoidance (was [PATCH 2/9] deadlock prevention core) Daniel Phillips
2006-08-19  7:28                                     ` Andrew Morton
2006-08-19 15:06                                   ` [RFC][PATCH 2/9] deadlock prevention core Rik van Riel
2006-08-20  1:33                                     ` Andre Tomt
2006-08-19 16:53                                   ` Ray Lee
2006-08-21 13:27                                   ` Philip R. Auld
2006-08-25 10:47                                     ` Pavel Machek
2006-08-21 13:38                                 ` Jens Axboe
2006-08-08 22:10   ` David Miller
2006-08-09  1:35     ` Daniel Phillips
2006-08-09  1:41       ` David Miller, Daniel Phillips
2006-08-09  5:44         ` Daniel Phillips
2006-08-09  7:00           ` Peter Zijlstra
     [not found]   ` <42414.81.207.0.53.1155080443.squirrel@81.207.0.53>
2006-08-09  0:25     ` Daniel Phillips
2006-08-09 12:02       ` Indan Zupancic
2006-08-09 12:54         ` Peter Zijlstra
2006-08-09 13:48           ` Indan Zupancic
2006-08-09 14:00             ` Peter Zijlstra
2006-08-09 18:34               ` Indan Zupancic
2006-08-09 19:45                 ` Peter Zijlstra
2006-08-09 20:19                   ` Peter Zijlstra
2006-08-10  1:21                   ` Indan Zupancic
2006-08-09 16:05   ` -v2 " Peter Zijlstra
2006-08-08 19:33 ` [RFC][PATCH 3/9] e1000 driver conversion Peter Zijlstra
2006-08-08 20:50   ` Auke Kok
2006-08-08 20:59     ` Peter Zijlstra
2006-08-08 22:32     ` David Miller, Auke Kok
2006-08-08 22:42       ` Auke Kok
2006-08-08 19:34 ` [RFC][PATCH 4/9] e100 " Peter Zijlstra
2006-08-08 20:13   ` Auke Kok
2006-08-08 20:18     ` Peter Zijlstra
2006-08-08 19:34 ` [RFC][PATCH 5/9] r8169 " Peter Zijlstra
2006-08-08 19:34 ` [RFC][PATCH 6/9] tg3 " Peter Zijlstra
2006-08-08 19:34 ` [RFC][PATCH 7/9] UML eth " Peter Zijlstra
2006-08-08 19:34 ` [RFC][PATCH 8/9] 3c59x " Peter Zijlstra
2006-08-08 23:07   ` Jeff Garzik
2006-08-09  5:51     ` Daniel Phillips
2006-08-09  5:55       ` David Miller, Daniel Phillips
2006-08-09  6:30         ` Jeff Garzik
2006-08-09  7:03           ` Peter Zijlstra
2006-08-09  7:20             ` Jeff Garzik
2006-08-13 19:38         ` Daniel Phillips
2006-08-13 19:53           ` Jeff Garzik
2006-08-08 19:34 ` [RFC][PATCH 9/9] deadlock prevention for NBD Peter Zijlstra
2006-08-09  5:46 ` [RFC][PATCH 0/9] Network receive " Evgeniy Polyakov
2006-08-09  5:52   ` Daniel Phillips
2006-08-09  5:56     ` David Miller, Daniel Phillips
2006-08-09  5:53   ` David Miller, Evgeniy Polyakov
2006-08-09  5:55     ` Evgeniy Polyakov
2006-08-09 12:37   ` Peter Zijlstra
2006-08-09 13:07     ` Evgeniy Polyakov
2006-08-09 13:32       ` Peter Zijlstra
2006-08-09 19:29         ` Evgeniy Polyakov
2006-08-09 23:54         ` David Miller, Peter Zijlstra
2006-08-10  6:06           ` Peter Zijlstra
2006-08-13 20:16             ` Daniel Phillips
2006-08-14  5:13               ` Evgeniy Polyakov
2006-08-14  6:45                 ` Peter Zijlstra
2006-08-14  6:54                   ` Evgeniy Polyakov
2006-08-17  4:49                     ` Daniel Phillips
2006-08-17  4:48                 ` Daniel Phillips
2006-08-17  5:36                   ` Evgeniy Polyakov
2006-08-17 18:01                     ` Daniel Phillips
2006-08-17 18:42                       ` Evgeniy Polyakov
2006-08-17 19:15                         ` Peter Zijlstra
2006-08-17 19:48                           ` Evgeniy Polyakov
2006-08-17 23:24                             ` Daniel Phillips [this message]
2006-08-18  7:16                               ` Evgeniy Polyakov
2006-08-12  3:42         ` Rik van Riel
2006-08-12  8:47           ` Evgeniy Polyakov
2006-08-12  9:19             ` Peter Zijlstra
2006-08-12  9:37               ` Evgeniy Polyakov
2006-08-12 10:18                 ` Peter Zijlstra
2006-08-12 10:42                   ` Evgeniy Polyakov
2006-08-12 10:51                     ` Evgeniy Polyakov
2006-08-12 11:40                     ` Peter Zijlstra
2006-08-12 11:53                       ` Evgeniy Polyakov
2006-08-13  0:46                   ` David Miller, Peter Zijlstra
2006-08-13  1:11                     ` Rik van Riel
2006-08-12 14:40                 ` Rik van Riel
2006-08-12 14:49                   ` Evgeniy Polyakov
2006-08-12 14:56                     ` Rik van Riel
2006-08-12 15:08                       ` Evgeniy Polyakov
2006-08-12 15:22                         ` Peter Zijlstra
2006-08-14  0:56                         ` Daniel Phillips
2006-08-13  0:46                 ` David Miller, Evgeniy Polyakov
2006-08-13  9:06                   ` Evgeniy Polyakov
2006-08-13  9:52                     ` Evgeniy Polyakov
2006-08-15 19:17 ` Pavel Machek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=44E4FAAA.2050104@google.com \
    --to=phillips@google.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=davem@davemloft.net \
    --cc=johnpol@2ka.mipt.ru \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox