From: Peter Zijlstra <a.p.zijlstra@chello.nl>
To: Andrew Morton <akpm@osdl.org>
Cc: Daniel Phillips <phillips@google.com>,
David Miller <davem@davemloft.net>,
riel@redhat.com, tgraf@suug.ch, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
Mike Christie <michaelc@cs.wisc.edu>
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core
Date: Mon, 14 Aug 2006 10:15:52 +0200 [thread overview]
Message-ID: <1155543352.5696.137.camel@twins> (raw)
In-Reply-To: <20060814000736.80e652bb.akpm@osdl.org>
On Mon, 2006-08-14 at 00:07 -0700, Andrew Morton wrote:
> On Mon, 14 Aug 2006 08:45:40 +0200
> Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
> > On Sun, 2006-08-13 at 22:22 -0700, Andrew Morton wrote:
> > > On Mon, 14 Aug 2006 07:03:55 +0200
> > > Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > >
> > > > On Sun, 2006-08-13 at 21:58 -0700, Andrew Morton wrote:
> > > > > On Mon, 14 Aug 2006 06:40:53 +0200
> > > > > Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > > > >
> > > > > > Testcase:
> > > > > >
> > > > > > Mount an NBD device as sole swap device and mmap > physical RAM, then
> > > > > > loop through touching pages only once.
> > > > >
> > > > > Fix: don't try to swap over the network. Yes, there may be some scenarios
> > > > > where people have no local storage, but it's reasonable to expect anyone
> > > > > who is using Linux as an "enterprise storage platform" to stick a local
> > > > > disk on the thing for swap.
> > > >
> > > > I wish you were right, however there seems to be a large demand to go
> > > > diskless and swap over iSCSI because disks seem to be the nr. 1 failing
> > > > piece of hardware in systems these days.
> > >
> > > We could track dirty anonymous memory and throttle.
> > >
> > > Also, there must be some value of /proc/sys/vm/min_free_kbytes at which a
> > > machine is no longer deadlockable with any of these tricks. Do we know
> > > what level that is?
> >
> > Not sure, the theoretical max amount of memory one can 'lose' in socket
> > wait queues is well over the amount of physical memory we have in
> > machines today (even for SGI); this combined with the fact that we limit
> > the memory in some way to avoid DoS attacks, could make for all memory
> > to be stuck in wait queues. Of course this becomes rather more unlikely
> > for ever larger amounts of memory. But unlikely is never a guarantee.
>
> What is a "socket wait queue" and how/why can it consume so much memory?
>
> Can it be prevented from doing that?
>
> If this refers to the socket buffers, they're mostly allocated with
> at least __GFP_WAIT, aren't they?
Wherever it is that packets go if the local end is tied up and cannot
accept them instantly. The simple but prob wrong calculation I made for
evgeniy is: suppose we have 64k sockets, each socket can buffer up to
128 packets, and each packet can be up to 16k (roundup for jumboframes)
large, that makes for 128G of memory. This calculation is wrong on
several points (we can have >64k sockets, and I have no idea on the 128)
but the order of things doesn't get better.
> > > > > That leaves MAP_SHARED, but mm-tracking-shared-dirty-pages.patch will fix
> > > > > that, will it not?
> > > >
> > > > Will makes it less likely. One can still have memory pressure, the
> > > > remaining bits of memory can still get stuck in socket queues for
> > > > blocked processes.
> > >
> > > But there's lots of reclaimable pagecache around and kswapd will free it
> > > up?
> >
> > Yes, however it is possible for kswapd and direct reclaim to block on
> > get_request_wait() for the nbd/iscsi request queue by sheer misfortune.
>
> Possibly there are some situations where kswapd will get stuck on request
> queues. But as long as the block layer is correctly calling
> set_queue_congested(), these are easily avoidable via
> bdi_write_congested().
Right, and this might, regardless of what we're going to end up doing,
be a good thing to do.
> > In that case there will be no more reclaim; of course the more active
> > processes we have the unlikelier this will be. Still with the sheer
> > amount of cpu time invested in Linux its not a gamble we're likely to
> > never lose.
>
> I suspect that with mm-tracking-shared-dirty-pages.patch, a bit of tuning
> and perhaps some bugfixing we can make this problem go away for all
> practical purposes. Particularly if we're prepared to require local
> storage for swap (the paranoid can use RAID, no?).
>
> Seem to me that more investigation of these options is needed before we can
> justify adding lots of hard-to-test complexity to networking?
Well, my aim here, as disgusting as you might think it is, is to get
swap over network working. I sympathise with your stance of: don't do
that; but I have been set this task and shall try to get something that
does not offend people.
As for hard to test, I can supply some patches that would make SROG
(still find the name horrid) the default network allocator so one could
more easily test the code paths. As for the dropping of packets, I could
supply a debug control to switch it on/off regardless of memory
pressure.
As for overall complexity, A simple fallback allocator that kicks in
when the normal allocation path fails, and some simple checks to drop
packets allocated in this fashion when not bound for critical sockets
doesn't seem like a lot of complexity to me.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2006-08-14 8:15 UTC|newest]
Thread overview: 140+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-08-08 19:33 [RFC][PATCH 0/9] Network receive deadlock prevention for NBD Peter Zijlstra
2006-08-08 19:33 ` [RFC][PATCH 1/9] pfn_to_kaddr() for UML Peter Zijlstra
2006-08-08 19:33 ` [RFC][PATCH 2/9] deadlock prevention core Peter Zijlstra
2006-08-08 20:57 ` Stephen Hemminger
2006-08-08 21:05 ` Peter Zijlstra
2006-08-09 1:33 ` Daniel Phillips
2006-08-09 1:38 ` David Miller, Daniel Phillips
2006-08-08 21:17 ` Thomas Graf
2006-08-09 1:34 ` Daniel Phillips
2006-08-09 1:39 ` David Miller, Daniel Phillips
2006-08-09 5:47 ` Daniel Phillips
2006-08-09 13:19 ` Thomas Graf
2006-08-09 14:07 ` Peter Zijlstra
2006-08-09 16:18 ` Thomas Graf
2006-08-09 16:19 ` Peter Zijlstra
2006-08-10 0:01 ` David Miller, Peter Zijlstra
2006-08-09 23:58 ` David Miller, Peter Zijlstra
2006-08-10 6:25 ` Peter Zijlstra
2006-08-11 4:24 ` Stephen Hemminger
2006-08-13 21:22 ` Daniel Phillips
2006-08-13 23:49 ` David Miller, Daniel Phillips
2006-08-14 1:15 ` Daniel Phillips
2006-08-11 2:37 ` Rik van Riel
2006-08-13 22:05 ` Daniel Phillips
2006-08-13 23:55 ` David Miller, Daniel Phillips
2006-08-14 1:31 ` Daniel Phillips
2006-08-14 1:53 ` Andrew Morton
2006-08-14 4:40 ` Peter Zijlstra
2006-08-14 4:58 ` Andrew Morton
2006-08-14 5:03 ` Peter Zijlstra
2006-08-14 5:22 ` Andrew Morton
2006-08-14 6:45 ` Peter Zijlstra
2006-08-14 7:07 ` Andrew Morton
2006-08-14 8:15 ` Peter Zijlstra [this message]
2006-08-14 8:25 ` Evgeniy Polyakov
2006-08-14 8:35 ` Peter Zijlstra
2006-08-14 8:33 ` David Miller, Andrew Morton
2006-08-17 4:27 ` Daniel Phillips
2006-08-14 7:17 ` Neil Brown
2006-08-14 7:31 ` Evgeniy Polyakov
2006-08-17 3:58 ` Daniel Phillips
2006-08-17 5:57 ` Andrew Morton
2006-08-17 23:53 ` Daniel Phillips
2006-08-18 0:24 ` Rik van Riel
2006-08-18 0:35 ` Daniel Phillips
2006-08-18 1:14 ` Neil Brown
2006-08-18 6:05 ` Andrew Morton
2006-08-18 21:22 ` Daniel Phillips
2006-08-18 22:34 ` Andrew Morton
2006-08-18 23:44 ` Daniel Phillips
2006-08-19 2:44 ` Andrew Morton
2006-08-19 4:14 ` Network receive stall avoidance (was [PATCH 2/9] deadlock prevention core) Daniel Phillips
2006-08-19 7:28 ` Andrew Morton
2006-08-19 15:06 ` [RFC][PATCH 2/9] deadlock prevention core Rik van Riel
2006-08-20 1:33 ` Andre Tomt
2006-08-19 16:53 ` Ray Lee
2006-08-21 13:27 ` Philip R. Auld
2006-08-25 10:47 ` Pavel Machek
2006-08-21 13:38 ` Jens Axboe
2006-08-08 22:10 ` David Miller
2006-08-09 1:35 ` Daniel Phillips
2006-08-09 1:41 ` David Miller, Daniel Phillips
2006-08-09 5:44 ` Daniel Phillips
2006-08-09 7:00 ` Peter Zijlstra
[not found] ` <42414.81.207.0.53.1155080443.squirrel@81.207.0.53>
2006-08-09 0:25 ` Daniel Phillips
2006-08-09 12:02 ` Indan Zupancic
2006-08-09 12:54 ` Peter Zijlstra
2006-08-09 13:48 ` Indan Zupancic
2006-08-09 14:00 ` Peter Zijlstra
2006-08-09 18:34 ` Indan Zupancic
2006-08-09 19:45 ` Peter Zijlstra
2006-08-09 20:19 ` Peter Zijlstra
2006-08-10 1:21 ` Indan Zupancic
2006-08-09 16:05 ` -v2 " Peter Zijlstra
2006-08-08 19:33 ` [RFC][PATCH 3/9] e1000 driver conversion Peter Zijlstra
2006-08-08 20:50 ` Auke Kok
2006-08-08 20:59 ` Peter Zijlstra
2006-08-08 22:32 ` David Miller, Auke Kok
2006-08-08 22:42 ` Auke Kok
2006-08-08 19:34 ` [RFC][PATCH 4/9] e100 " Peter Zijlstra
2006-08-08 20:13 ` Auke Kok
2006-08-08 20:18 ` Peter Zijlstra
2006-08-08 19:34 ` [RFC][PATCH 5/9] r8169 " Peter Zijlstra
2006-08-08 19:34 ` [RFC][PATCH 6/9] tg3 " Peter Zijlstra
2006-08-08 19:34 ` [RFC][PATCH 7/9] UML eth " Peter Zijlstra
2006-08-08 19:34 ` [RFC][PATCH 8/9] 3c59x " Peter Zijlstra
2006-08-08 23:07 ` Jeff Garzik
2006-08-09 5:51 ` Daniel Phillips
2006-08-09 5:55 ` David Miller, Daniel Phillips
2006-08-09 6:30 ` Jeff Garzik
2006-08-09 7:03 ` Peter Zijlstra
2006-08-09 7:20 ` Jeff Garzik
2006-08-13 19:38 ` Daniel Phillips
2006-08-13 19:53 ` Jeff Garzik
2006-08-08 19:34 ` [RFC][PATCH 9/9] deadlock prevention for NBD Peter Zijlstra
2006-08-09 5:46 ` [RFC][PATCH 0/9] Network receive " Evgeniy Polyakov
2006-08-09 5:52 ` Daniel Phillips
2006-08-09 5:56 ` David Miller, Daniel Phillips
2006-08-09 5:53 ` David Miller, Evgeniy Polyakov
2006-08-09 5:55 ` Evgeniy Polyakov
2006-08-09 12:37 ` Peter Zijlstra
2006-08-09 13:07 ` Evgeniy Polyakov
2006-08-09 13:32 ` Peter Zijlstra
2006-08-09 19:29 ` Evgeniy Polyakov
2006-08-09 23:54 ` David Miller, Peter Zijlstra
2006-08-10 6:06 ` Peter Zijlstra
2006-08-13 20:16 ` Daniel Phillips
2006-08-14 5:13 ` Evgeniy Polyakov
2006-08-14 6:45 ` Peter Zijlstra
2006-08-14 6:54 ` Evgeniy Polyakov
2006-08-17 4:49 ` Daniel Phillips
2006-08-17 4:48 ` Daniel Phillips
2006-08-17 5:36 ` Evgeniy Polyakov
2006-08-17 18:01 ` Daniel Phillips
2006-08-17 18:42 ` Evgeniy Polyakov
2006-08-17 19:15 ` Peter Zijlstra
2006-08-17 19:48 ` Evgeniy Polyakov
2006-08-17 23:24 ` Daniel Phillips
2006-08-18 7:16 ` Evgeniy Polyakov
2006-08-12 3:42 ` Rik van Riel
2006-08-12 8:47 ` Evgeniy Polyakov
2006-08-12 9:19 ` Peter Zijlstra
2006-08-12 9:37 ` Evgeniy Polyakov
2006-08-12 10:18 ` Peter Zijlstra
2006-08-12 10:42 ` Evgeniy Polyakov
2006-08-12 10:51 ` Evgeniy Polyakov
2006-08-12 11:40 ` Peter Zijlstra
2006-08-12 11:53 ` Evgeniy Polyakov
2006-08-13 0:46 ` David Miller, Peter Zijlstra
2006-08-13 1:11 ` Rik van Riel
2006-08-12 14:40 ` Rik van Riel
2006-08-12 14:49 ` Evgeniy Polyakov
2006-08-12 14:56 ` Rik van Riel
2006-08-12 15:08 ` Evgeniy Polyakov
2006-08-12 15:22 ` Peter Zijlstra
2006-08-14 0:56 ` Daniel Phillips
2006-08-13 0:46 ` David Miller, Evgeniy Polyakov
2006-08-13 9:06 ` Evgeniy Polyakov
2006-08-13 9:52 ` Evgeniy Polyakov
2006-08-15 19:17 ` Pavel Machek
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1155543352.5696.137.camel@twins \
--to=a.p.zijlstra@chello.nl \
--cc=akpm@osdl.org \
--cc=davem@davemloft.net \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=michaelc@cs.wisc.edu \
--cc=netdev@vger.kernel.org \
--cc=phillips@google.com \
--cc=riel@redhat.com \
--cc=tgraf@suug.ch \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox