* [RFC] Net vm deadlock fix (preliminary)
@ 2005-08-03 6:57 Daniel Phillips
2005-08-03 6:59 ` Martin Josefsson
0 siblings, 1 reply; 8+ messages in thread
From: Daniel Phillips @ 2005-08-03 6:57 UTC (permalink / raw)
To: netdev, linux-mm
[-- Attachment #1: Type: text/plain, Size: 2439 bytes --]
Hi,
Here is a preliminary patch, not tested at all, just to give everybody a
target to aim bricks at.
* A new __GFP_MEMALLOC flag gives access to the memalloc reserve.
* In dev_alloc_skb, if GFP_ATOMIC fails then try again with __GFP_MEMALLOC.
* We know an skb was allocated from reserve if we see __GFP_MEMALLOC in the
(misnamed) priority field.
* When a driver uses netif_rx to deliver the packet to the protocol layer,
if the packet was allocated from the reserve it is delivered directly to
the protocol layer, otherwise queue the packet via softnet.
* When the protocol handler (tcp/ipv4 in this case) looks up the socket,
if the packet was allocated from reserve but the socket is not serving
vm traffic, the packet is discarded.
There are some users of __dev_alloc_skb that inherit the new memalloc behavior
for free. This is probably not a good thing. There are a dozen or so users
to check... later.
I claimed earlier that an advantage of using the memalloc reserve over a
mempool is that the pool becomes available to the whole call chain of the
user. This isn't true in a softirq, sorry. Maybe we could make it true,
that's another question. Anyway, memalloc reserve vs mempool is a detail at
this point.
There is a big hole here through which precious reserve memory can escape: if
the network driver is allocating packets from reserve but a protocol handler
does not test such packets, things will deteriorate quickly. The easiest
thing to do is make sure all protocols know about this logic. They probably
all need to anyway.
A memalloc task (one handling IO on behalf of the vm) will set the SO_MEMALLOC
flag after creating the socket. The memalloc task will throttle the amount
of traffic in flight to keep the maximum reserve usage to some reasonable
amount. (It will be necessary to get more precise about this at some point.)
The memalloc task itself will be in PF_MEMALLOC mode when it uses this
socket.
This patch only covers socket input, not output. As you can see, the fast
path is not compromised at all, and even when the low memory path triggers,
efficiency only falls off a little (allocations may take longer and we bypass
the softnet optimization). But the thing is, we don't fall back to
single-request-at-a-time handling, which is exactly what you don't want to do
when the vm is desperately trying to clean memory.
Regards,
Daniel
[-- Attachment #2: net.memalloc-2.6.12.3 --]
[-- Type: text/x-diff, Size: 4402 bytes --]
--- 2.6.12.3.clean/include/linux/gfp.h 2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/include/linux/gfp.h 2005-08-03 01:12:33.000000000 -0400
@@ -39,6 +39,7 @@
#define __GFP_COMP 0x4000u /* Add compound page metadata */
#define __GFP_ZERO 0x8000u /* Return zeroed page on success */
#define __GFP_NOMEMALLOC 0x10000u /* Don't use emergency reserves */
+#define __GFP_MEMALLOC 0x20000u /* Use emergency reserves */
#define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */
#define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)
--- 2.6.12.3.clean/include/linux/skbuff.h 2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/include/linux/skbuff.h 2005-08-03 01:53:43.000000000 -0400
@@ -969,7 +969,6 @@
kfree_skb(skb);
}
-#ifndef CONFIG_HAVE_ARCH_DEV_ALLOC_SKB
/**
* __dev_alloc_skb - allocate an skbuff for sending
* @length: length to allocate
@@ -985,14 +984,14 @@
static inline struct sk_buff *__dev_alloc_skb(unsigned int length,
int gfp_mask)
{
- struct sk_buff *skb = alloc_skb(length + 16, gfp_mask);
+ struct sk_buff *skb = alloc_skb(length += 16, gfp_mask);
+
+ if (unlikely(!skb))
+ skb = alloc_skb(length, gfp_mask|__GFP_MEMALLOC);
if (likely(skb))
skb_reserve(skb, 16);
return skb;
}
-#else
-extern struct sk_buff *__dev_alloc_skb(unsigned int length, int gfp_mask);
-#endif
/**
* dev_alloc_skb - allocate an skbuff for sending
@@ -1011,6 +1010,11 @@
return __dev_alloc_skb(length, GFP_ATOMIC);
}
+static inline int is_memalloc_skb(struct sk_buff *skb)
+{
+ return !!(skb->priority & __GFP_MEMALLOC);
+}
+
/**
* skb_cow - copy header of skb when it is required
* @skb: buffer to cow
--- 2.6.12.3.clean/include/net/sock.h 2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/include/net/sock.h 2005-08-03 01:20:56.000000000 -0400
@@ -382,6 +382,7 @@
SOCK_NO_LARGESEND, /* whether to sent large segments or not */
SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+ SOCK_MEMALLOC, /* protocol can use memalloc reserve */
};
static inline void sock_set_flag(struct sock *sk, enum sock_flags flag)
@@ -399,6 +400,11 @@
return test_bit(flag, &sk->sk_flags);
}
+static inline int is_memalloc_sock(struct sock *sk)
+{
+ return sock_flag(sk, SOCK_MEMALLOC);
+}
+
static inline void sk_acceptq_removed(struct sock *sk)
{
sk->sk_ack_backlog--;
--- 2.6.12.3.clean/mm/page_alloc.c 2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/mm/page_alloc.c 2005-08-03 01:46:10.000000000 -0400
@@ -802,8 +802,8 @@
/* This allocation should allow future memory freeing. */
- if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
- && !in_interrupt()) {
+ if ((((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
+ && !in_interrupt()) || (gfp_mask & __GFP_MEMALLOC)) {
if (!(gfp_mask & __GFP_NOMEMALLOC)) {
/* go through the zonelist yet again, ignoring mins */
for (i = 0; (z = zones[i]) != NULL; i++) {
--- 2.6.12.3.clean/net/core/dev.c 2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/net/core/dev.c 2005-08-03 01:42:46.000000000 -0400
@@ -1452,6 +1452,11 @@
struct softnet_data *queue;
unsigned long flags;
+ if (unlikely(is_memalloc_skb(skb))) {
+ netif_receive_skb(skb);
+ return NET_RX_CN_HIGH;
+ }
+
/* if netpoll wants it, pretend we never saw it */
if (netpoll_rx(skb))
return NET_RX_DROP;
--- 2.6.12.3.clean/net/core/skbuff.c 2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/net/core/skbuff.c 2005-08-03 01:36:50.000000000 -0400
@@ -355,7 +355,7 @@
n->nohdr = 0;
C(pkt_type);
C(ip_summed);
- C(priority);
+ n->priority = skb->priority & ~__GFP_MEMALLOC;
C(protocol);
C(security);
n->destructor = NULL;
@@ -411,7 +411,7 @@
new->sk = NULL;
new->dev = old->dev;
new->real_dev = old->real_dev;
- new->priority = old->priority;
+ new->priority = old->priority & ~__GFP_MEMALLOC;
new->protocol = old->protocol;
new->dst = dst_clone(old->dst);
#ifdef CONFIG_INET
--- 2.6.12.3.clean/net/ipv4/tcp_ipv4.c 2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/net/ipv4/tcp_ipv4.c 2005-08-02 21:35:54.000000000 -0400
@@ -1766,6 +1766,9 @@
if (!sk)
goto no_tcp_socket;
+ if (unlikely(is_memalloc_skb(skb)) && !is_memalloc_sock(sk))
+ goto discard_and_relse;
+
process:
if (sk->sk_state == TCP_TIME_WAIT)
goto do_time_wait;
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] Net vm deadlock fix (preliminary)
2005-08-03 6:57 [RFC] Net vm deadlock fix (preliminary) Daniel Phillips
@ 2005-08-03 6:59 ` Martin Josefsson
2005-08-03 17:36 ` Daniel Phillips
0 siblings, 1 reply; 8+ messages in thread
From: Martin Josefsson @ 2005-08-03 6:59 UTC (permalink / raw)
To: Daniel Phillips; +Cc: netdev, linux-mm
On Wed, 3 Aug 2005, Daniel Phillips wrote:
> Hi,
>
> Here is a preliminary patch, not tested at all, just to give everybody a
> target to aim bricks at.
>
> * A new __GFP_MEMALLOC flag gives access to the memalloc reserve.
>
> * In dev_alloc_skb, if GFP_ATOMIC fails then try again with __GFP_MEMALLOC.
This might give some unwanted side effects... Normally a NIC driver
has an rx-ring with skbs that are allocated by using dev_alloc_skb()
And before the memory got low the skbs in the ring was allocated using
GFP_ATOMIC and when packets are recieved the skb is passed up the stack
and if in the meantime (since the skb was allocated) memory got low the
checks for __GFP_MEMALLOC won't trigger as it isn't set in the skb.
And what's worse is that when memory is low new packets allocated with
dev_alloc_skb() will allocate them from __GFP_MEMALLOC, and then they are
placed in the rx-ring just waiting to be used which might not happen for a
while. One positive thing might be that at least some GFP_ATOMIC memory is
released in the befinning when the skbs in the rx-ring is beeing
reallocated.
So when memory gets tight, you won't notice __GFP_MEMALLOC for as many
packets as the rx-ring contains skbs. But you do allocate skbs with
__GFP_MEMALLOC just that they get stuffed into the rx-ring and will stay
there until used by incoming packets.
iirc, the e1000 driver defaults to a rx-ring size of 256 skbs.
One thing could be to utilize the skb copybreak that certain NIC drivers
implement. If the packetsize is rather small a new skb is allocated and
the data is copied from the skb in the rx-ring to the new skb which then
contains the __GFP_MEMALLOC and the GFP_ATOMIC skb stays in the rx-ring.
This could be used to avoid the problems above, but... it adds overhead to
all packets when memory isn't low since all packets needs to be copied and
the copy is more expensive for larger packets which slows things down
compared to not copying. If there's another way of detecting the low
memory situation other than a failing copy the copybreak abuse could be
enabled only when memory is low, that would probably get the best from
both worlds.
/Martin
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] Net vm deadlock fix (preliminary)
2005-08-03 6:59 ` Martin Josefsson
@ 2005-08-03 17:36 ` Daniel Phillips
2005-08-03 18:21 ` Martin Josefsson
0 siblings, 1 reply; 8+ messages in thread
From: Daniel Phillips @ 2005-08-03 17:36 UTC (permalink / raw)
To: Martin Josefsson; +Cc: netdev, linux-mm
On Wednesday 03 August 2005 16:59, Martin Josefsson wrote:
> On Wed, 3 Aug 2005, Daniel Phillips wrote:
> > Hi,
> >
> > Here is a preliminary patch, not tested at all, just to give everybody a
> > target to aim bricks at.
> >
> > * A new __GFP_MEMALLOC flag gives access to the memalloc reserve.
> >
> > * In dev_alloc_skb, if GFP_ATOMIC fails then try again with
> > __GFP_MEMALLOC.
>
> This might give some unwanted side effects...
You are right, and it was a bad idea to try to do this fix generically for all
drivers at once, see below.
> Normally a NIC driver
> has an rx-ring with skbs that are allocated by using dev_alloc_skb()
> And before the memory got low the skbs in the ring was allocated using
> GFP_ATOMIC and when packets are recieved the skb is passed up the stack
> and if in the meantime (since the skb was allocated) memory got low the
> checks for __GFP_MEMALLOC won't trigger as it isn't set in the skb.
>
> And what's worse is that when memory is low new packets allocated with
> dev_alloc_skb() will allocate them from __GFP_MEMALLOC, and then they are
> placed in the rx-ring just waiting to be used which might not happen for a
> while. One positive thing might be that at least some GFP_ATOMIC memory is
> released in the befinning when the skbs in the rx-ring is beeing
> reallocated.
>
> So when memory gets tight, you won't notice __GFP_MEMALLOC for as many
> packets as the rx-ring contains skbs. But you do allocate skbs with
> __GFP_MEMALLOC just that they get stuffed into the rx-ring and will stay
> there until used by incoming packets.
> iirc, the e1000 driver defaults to a rx-ring size of 256 skbs.
Yep, yep and yep.
I can think of two ways to deal with this:
1) Mindlessly include the entire maximum memory usage of the rx-ring in
the reserve requirement (e.g., (maxskbs * (MTU + k)) / PAGE_SIZE).
2) Never refill the rx-ring from the reserve. Instead, if the skbs ever
run out (because e1000_alloc_rx_buffers had too many GFP_ATOMIC alloc
failures) then use __GFP_MEMALLOC instead of just giving up at that
point.
(1) is the quick fix, (2) is the proper fix. (2) also implies auditing every
net driver to ensure its usage is "block IO safe". I made a valiant attempt
to handle this in just __dev_alloc_skb, but really it just drives the point
home that there is no substitute for laying out the memory allocation
rules for drivers and making sure they are followed.
We can soften the pain of this by classifying net drivers as block-io-safe or
not with a capability flag, and fix them up incrementally. The resulting
memory usage audit can be nothing but good, I bet the code gets smaller too.
It would be easy to log a warning any time network block IO is attempted over
an unfixed driver. This will be just as unsafe as it ever was and exactly as
efficient.
If the driver is block-io-safe, it will use __GFP_MEMALLOC (if we ultimately
decide to use that instead of mempools) and the fallback allocation+delivery
path. The fast path will be the same as it ever was, because packets taking
the fallback path will be ones that would have been dropped before. Packets
allocated from reserve that pass the check in the protocol driver are ones
that would have been dropped _in the block io path_, so rescuing these not
only fixes the deadlock, but is a huge efficiency win.
> One thing could be to utilize the skb copybreak that certain NIC drivers
> implement. If the packetsize is rather small a new skb is allocated and
> the data is copied from the skb in the rx-ring to the new skb which then
> contains the __GFP_MEMALLOC and the GFP_ATOMIC skb stays in the rx-ring.
> This could be used to avoid the problems above, but... it adds overhead to
> all packets when memory isn't low since all packets needs to be copied and
> the copy is more expensive for larger packets which slows things down
> compared to not copying.
I am have not looked at the copybreak feature (its memory requirements
certainly need to be analyzed) but my suggestion above does not hurt the
fast path and does divide the allocation class properly between
__GFP_MEMALLOC and GFP_ATOMIC, as you noticed is necessary.
(Note that, in all of this, there is the temptation to substitute mempools for
__GFP_MEMALLOC. This doesn't change the analysis at all, it just changes the
details of the API. We still have to fix the drivers and protocols in all
the same places.)
> If there's another way of detecting the low
> memory situation other than a failing copy the copybreak abuse could be
> enabled only when memory is low, that would probably get the best from
> both worlds.
Checking whether memory is low is a somewhat expensive operation because we
have to walk through all the allocation zones according to the allocation
class requested and check the bounds. So at the moment, non-vm code detects
low memory only through failed allocations. We should have a separate,
efficient api for detecting "out of normal memory", but the immediate issue is
killing the network block io deadlock.
Regards,
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] Net vm deadlock fix (preliminary)
2005-08-03 17:36 ` Daniel Phillips
@ 2005-08-03 18:21 ` Martin Josefsson
2005-08-03 20:06 ` Daniel Phillips
0 siblings, 1 reply; 8+ messages in thread
From: Martin Josefsson @ 2005-08-03 18:21 UTC (permalink / raw)
To: Daniel Phillips; +Cc: netdev, linux-mm
[-- Attachment #1: Type: text/plain, Size: 1930 bytes --]
On Thu, 2005-08-04 at 03:36 +1000, Daniel Phillips wrote:
> I can think of two ways to deal with this:
>
> 1) Mindlessly include the entire maximum memory usage of the rx-ring in
> the reserve requirement (e.g., (maxskbs * (MTU + k)) / PAGE_SIZE).
Would be dependent on the numberof interfaces as well.
> 2) Never refill the rx-ring from the reserve. Instead, if the skbs ever
> run out (because e1000_alloc_rx_buffers had too many GFP_ATOMIC alloc
> failures) then use __GFP_MEMALLOC instead of just giving up at that
> point.
This is how e1000 currently works (suggestions have been made to change
this to work like the tg3 driver does which has copybreak support etc)
1. Allocate skbs filling the rx-ring as much as possible
2. tell hardware there's new skbs to DMA packets into
3. note that an skb has been filled with data (interrupt or polling)
4. remove that skb from the rx-ring
5. pass the skb up the stack
6. goto 3 if quota hasn't been filled
7. goto 1 if quota has been filled
The skbs allocated to fill the rx-ring are the _same_ skbs that are
passed up the stack. So you won't see __GFP_MEMALLOC allocated skbs
until RX_RINGSIZE packets after we got low on memory (fifo ring). I
can't really say I see how #2 above solves that since we _have_ to
allocate skbs to fill the rx-ring, otherwise the NIC won't have anywhere
to put the received packets and will thus drop them in hardware.
Or are you suggesting to let the rx-ring deplete until completely empty
(or nearly empty) if we are low on memory, and only then start falling
back to allocating with __GFP_MEMALLOC if GFP_ATOMIC fails?
That could and probably would cause hardware to drop packets because it
can run out of fresh rx-descriptors before we manage to start allocating
with __GFP_MEMALLOC if the packetrate is high, at least it makes it much
more likely to happen.
--
/Martin
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] Net vm deadlock fix (preliminary)
2005-08-03 18:21 ` Martin Josefsson
@ 2005-08-03 20:06 ` Daniel Phillips
2005-08-04 21:51 ` Daniel Phillips
0 siblings, 1 reply; 8+ messages in thread
From: Daniel Phillips @ 2005-08-03 20:06 UTC (permalink / raw)
To: Martin Josefsson; +Cc: netdev, linux-mm
On Thursday 04 August 2005 04:21, Martin Josefsson wrote:
> On Thu, 2005-08-04 at 03:36 +1000, Daniel Phillips wrote:
> > I can think of two ways to deal with this:
> >
> > 1) Mindlessly include the entire maximum memory usage of the rx-ring in
> > the reserve requirement (e.g., (maxskbs * (MTU + k)) / PAGE_SIZE).
>
> Would be dependent on the numberof interfaces as well.
Actually, just the number of interfaces being used by network block IO.
Theoretically unbounded, but not an immediate problem. This goes on the
"must fix in order to be perfect" list.
> > 2) Never refill the rx-ring from the reserve. Instead, if the skbs
> > ever run out (because e1000_alloc_rx_buffers had too many GFP_ATOMIC
> > alloc failures) then use __GFP_MEMALLOC instead of just giving up at that
> > point.
>
> This is how e1000 currently works (suggestions have been made to change
> this to work like the tg3 driver does which has copybreak support etc)
>
> 1. Allocate skbs filling the rx-ring as much as possible
> 2. tell hardware there's new skbs to DMA packets into
> 3. note that an skb has been filled with data (interrupt or polling)
> 4. remove that skb from the rx-ring
> 5. pass the skb up the stack
> 6. goto 3 if quota hasn't been filled
> 7. goto 1 if quota has been filled
Thanks, I originally missed the part about the hardware requiring at least one
skb in the ring, or else it will drop a packet.
> The skbs allocated to fill the rx-ring are the _same_ skbs that are
> passed up the stack. So you won't see __GFP_MEMALLOC allocated skbs
> until RX_RINGSIZE packets after we got low on memory (fifo ring). I
> can't really say I see how #2 above solves that since we _have_ to
> allocate skbs to fill the rx-ring, otherwise the NIC won't have anywhere
> to put the received packets and will thus drop them in hardware.
>
> Or are you suggesting to let the rx-ring deplete until completely empty
> (or nearly empty) if we are low on memory, and only then start falling
> back to allocating with __GFP_MEMALLOC if GFP_ATOMIC fails?
Yes, exactly. Except as you point out "completely empty" won't do. We will
use the __GFP_MEMALLOC (or alternatively, mempool) to ensure that the rx_ring
always has at least some minimum number N of packets in it, and reserve N*MTU
pages for the interface. Actually, we will reserve considerably more than
that, because we want to be able to have a fairly large number of packets in
flight on the block IO path, particularly under memory pressure.
It doesn't actually matter which packet we return to the reserve, so we might
want to just count the number of __GFP_MEMALLOC packets in the ring and
deliver on the direct (non-softnet) path until the count drops to zero.
Which implies a slightly different approach to the is_memalloc_skb flagging.
This is just a refinement though, it does not really matter when we reclaim a
reserve buffer as long as we are certain to reclaim it.
> That could and probably would cause hardware to drop packets because it
> can run out of fresh rx-descriptors before we manage to start allocating
> with __GFP_MEMALLOC if the packetrate is high, at least it makes it much
> more likely to happen.
It is a matter of setting N, the minimum number of packets in the rx ring high
enough, no?
OK, the next step is to reroll the patch and make it specific to e1000, which
I happen to have here and can test. Also, I will use a mempool this time, so
we will see how that code looks.
Regards,
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] Net vm deadlock fix (preliminary)
2005-08-03 20:06 ` Daniel Phillips
@ 2005-08-04 21:51 ` Daniel Phillips
2005-08-04 22:09 ` Daniel Phillips
0 siblings, 1 reply; 8+ messages in thread
From: Daniel Phillips @ 2005-08-04 21:51 UTC (permalink / raw)
To: Martin Josefsson; +Cc: netdev, linux-mm
[-- Attachment #1: Type: text/plain, Size: 5373 bytes --]
Hi,
I spent the last day mulling things over and doing research. It seems to me
that the patch as first posted is correct and solves the deadlock, except
that some uses of __GFP_MEMALLOC in __dev_alloc_skb may escape into contexts
where the reserve is not guaranteed to be reclaimed. It may be that this
does not actually happen, but there is enough different usage that I would
rather err on the side of caution just now, and offer a variant called, e.g.,
dev_memalloc_skb so that drivers will explicitly have to choose to use it (or
supply the flag to __dev_alloc_skb). This is just a stack of inline
functions so there should be no extra object code. The dev_memalloc_skb
variant can go away in time, but for now it does no harm.
A minor cleanup: somebody (Rik) complained about his bleeding eyes after
reading my side-effecty alloc_skb expression, so that was rearranged in a way
that should optimize to the same thing.
On the "first write the patch, then do the research" principle, the entire
thread on this topic from the ksummit-2005-discuss mailing list is a good
read:
http://thunker.thunk.org/pipermail/ksummit-2005-discuss/2005-March/thread.html#186
Matt, you are close to the truth here:
http://thunker.thunk.org/pipermail/ksummit-2005-discuss/2005-March/000242.html
but this part isn't right: "it's important to note here that "making progress"
may require M acknowledgements to N packets representing a single IO. So we
need separate send and acknowledge pools for each SO_MEMALLOC socket so that
we don't find ourselves wedged with M-1 available mempool slots when we're
waiting on ACKs". This erroneously assumes that mempools throttle the block
IO traffic. In fact, the throttling _must_ take place higher up, in the
block IO stack. The block driver that submits the network IO must
pre-account any per-request resources and block until sufficient resources
become available. So the accounting would include both space for transmit
and acknowledge, and the network block IO protocol must be designed to obey
that accounting. (I will wave my hands at the question of how we arrange for
low-level components to communicate their resource needs to high-level
throttlers, just for now.)
Andrea, also getting close:
http://thunker.thunk.org/pipermail/ksummit-2005-discuss/2005-March/000200.html
But there is no need to be random. Short of actually overlowing the input
ring buffer, we can be precise about accepting all block IO packets and
dropping non-blockio traffic as necessary.
Rik, not bad:
http://thunker.thunk.org/pipermail/ksummit-2005-discuss/2005-March/000218.html
particularly for deducing it from first principles without actually looking at
the network code ;-) It is even the same socket flag name as I settled on
(SO_MEMALLOC). But step 4 veers off course: out of order does not matter.
And the conclusion that we can throttle here by dropping non-blockio packets
is not right: the packets that we got from reserve still can live an
arbitrary time in the protocol stack, so we could still exhaust the reserve
and be back to the same bad old deadlock conditions.
Everybody noticed that dropping non-blockio packets is key, and everybody
missed the fact that softnet introduces additional queues that need
throttling (which can't be done sanely) or bypassing. Almost everybody
noticed that throttling in the block IO submission path is non-optional.
Everybody thought that mempool is the one true way of reserving memory. I am
not so sure, though I still intend to produce a mempool variant of the patch.
One problem I see with mempool is that it not only reserves resources, but
pins them. If the kernel is full of mempools pinning memory pages here and
there, physical defragmentation gets that much harder and the buddy tree will
fragment that much sooner. The __GPF_MEMALLOC interface does not have this
problem because pages stay in the global pool. So the jury is still out on
which method is better.
Obviously, to do the job properly, __GPF_MEMALLOC would need a way of resizing
the memalloc reserve as users are loaded and unloaded. Such an interface can
be very lightweight. I will cook one up just to demonstrate this.
Now, the scheme in my patch does the job and I think it does it in a way that
works for all drivers, even e1000 (by method 1. in the thread above). But we
could tighten this up a little by noticing that it doesn't actually matter
which socket buffer we return to the pool as long as we are sure to return
the same amount of memory as we withdrew. Therefore, we could just account
the number of pages alloc_skb withdraws, and the number that freeing a packet
returns. The e1000 driver would look at this number to determine whether to
mark a packet as from_reserve or not. That way, the e1000 driver could set
things in motion to release reserve resources sooner, rather than waiting for
certain specially flagged skbs to work their way around the rx-ring. This
also makes it easier for related systems (such as delivery hooks) to draw
from a single pool with accurate accounting.
I will follow the current simple per-skb approach all the way to the end on
the "perfect is the enemy of good enough" principle. In the long run, page
accounting is the way to go.
Next, on to actually trying this.
Regards,
Daniel
[-- Attachment #2: net.memalloc-2.6.12.3-2 --]
[-- Type: text/x-diff, Size: 4778 bytes --]
--- 2.6.12.3.clean/include/linux/gfp.h 2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/include/linux/gfp.h 2005-08-04 00:12:48.000000000 -0400
@@ -39,6 +39,7 @@
#define __GFP_COMP 0x4000u /* Add compound page metadata */
#define __GFP_ZERO 0x8000u /* Return zeroed page on success */
#define __GFP_NOMEMALLOC 0x10000u /* Don't use emergency reserves */
+#define __GFP_MEMALLOC 0x20000u /* Use emergency reserves */
#define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */
#define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)
--- 2.6.12.3.clean/include/linux/skbuff.h 2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/include/linux/skbuff.h 2005-08-04 13:46:35.000000000 -0400
@@ -969,7 +969,6 @@
kfree_skb(skb);
}
-#ifndef CONFIG_HAVE_ARCH_DEV_ALLOC_SKB
/**
* __dev_alloc_skb - allocate an skbuff for sending
* @length: length to allocate
@@ -982,17 +981,17 @@
*
* %NULL is returned in there is no free memory.
*/
-static inline struct sk_buff *__dev_alloc_skb(unsigned int length,
- int gfp_mask)
+static inline struct sk_buff *__dev_alloc_skb(unsigned length, int gfp_mask)
{
- struct sk_buff *skb = alloc_skb(length + 16, gfp_mask);
+ int opaque = 16, size = length + opaque;
+ struct sk_buff *skb = alloc_skb(size, gfp_mask & ~__GFP_MEMALLOC);
+
+ if (unlikely(!skb) && (gfp_mask & __GFP_MEMALLOC))
+ skb = alloc_skb(size, gfp_mask);
if (likely(skb))
- skb_reserve(skb, 16);
+ skb_reserve(skb, opaque);
return skb;
}
-#else
-extern struct sk_buff *__dev_alloc_skb(unsigned int length, int gfp_mask);
-#endif
/**
* dev_alloc_skb - allocate an skbuff for sending
@@ -1011,6 +1010,16 @@
return __dev_alloc_skb(length, GFP_ATOMIC);
}
+static inline struct sk_buff *dev_alloc_skb_reserve(unsigned int length)
+{
+ return __dev_alloc_skb(length, gfp_mask | __GFP_MEMALLOC);
+}
+
+static inline int is_memalloc_skb(struct sk_buff *skb)
+{
+ return !!(skb->priority & __GFP_MEMALLOC);
+}
+
/**
* skb_cow - copy header of skb when it is required
* @skb: buffer to cow
--- 2.6.12.3.clean/include/net/sock.h 2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/include/net/sock.h 2005-08-04 00:12:49.000000000 -0400
@@ -382,6 +382,7 @@
SOCK_NO_LARGESEND, /* whether to sent large segments or not */
SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+ SOCK_MEMALLOC, /* protocol can use memalloc reserve */
};
static inline void sock_set_flag(struct sock *sk, enum sock_flags flag)
@@ -399,6 +400,11 @@
return test_bit(flag, &sk->sk_flags);
}
+static inline int is_memalloc_sock(struct sock *sk)
+{
+ return sock_flag(sk, SOCK_MEMALLOC);
+}
+
static inline void sk_acceptq_removed(struct sock *sk)
{
sk->sk_ack_backlog--;
--- 2.6.12.3.clean/mm/page_alloc.c 2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/mm/page_alloc.c 2005-08-04 00:12:49.000000000 -0400
@@ -802,8 +802,8 @@
/* This allocation should allow future memory freeing. */
- if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
- && !in_interrupt()) {
+ if ((((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
+ && !in_interrupt()) || (gfp_mask & __GFP_MEMALLOC)) {
if (!(gfp_mask & __GFP_NOMEMALLOC)) {
/* go through the zonelist yet again, ignoring mins */
for (i = 0; (z = zones[i]) != NULL; i++) {
--- 2.6.12.3.clean/net/core/dev.c 2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/net/core/dev.c 2005-08-04 00:12:49.000000000 -0400
@@ -1452,6 +1452,11 @@
struct softnet_data *queue;
unsigned long flags;
+ if (unlikely(is_memalloc_skb(skb))) {
+ netif_receive_skb(skb);
+ return NET_RX_CN_HIGH;
+ }
+
/* if netpoll wants it, pretend we never saw it */
if (netpoll_rx(skb))
return NET_RX_DROP;
--- 2.6.12.3.clean/net/core/skbuff.c 2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/net/core/skbuff.c 2005-08-04 00:12:49.000000000 -0400
@@ -355,7 +355,7 @@
n->nohdr = 0;
C(pkt_type);
C(ip_summed);
- C(priority);
+ n->priority = skb->priority & ~__GFP_MEMALLOC;
C(protocol);
C(security);
n->destructor = NULL;
@@ -411,7 +411,7 @@
new->sk = NULL;
new->dev = old->dev;
new->real_dev = old->real_dev;
- new->priority = old->priority;
+ new->priority = old->priority & ~__GFP_MEMALLOC;
new->protocol = old->protocol;
new->dst = dst_clone(old->dst);
#ifdef CONFIG_INET
--- 2.6.12.3.clean/net/ipv4/tcp_ipv4.c 2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/net/ipv4/tcp_ipv4.c 2005-08-04 00:12:49.000000000 -0400
@@ -1766,6 +1766,9 @@
if (!sk)
goto no_tcp_socket;
+ if (unlikely(is_memalloc_skb(skb)) && !is_memalloc_sock(sk))
+ goto discard_and_relse;
+
process:
if (sk->sk_state == TCP_TIME_WAIT)
goto do_time_wait;
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] Net vm deadlock fix (preliminary)
2005-08-04 21:51 ` Daniel Phillips
@ 2005-08-04 22:09 ` Daniel Phillips
2005-08-04 22:45 ` Daniel Phillips
0 siblings, 1 reply; 8+ messages in thread
From: Daniel Phillips @ 2005-08-04 22:09 UTC (permalink / raw)
To: Martin Josefsson; +Cc: netdev, linux-mm
- return __dev_alloc_skb(length, gfp_mask | __GFP_MEMALLOC);
+ return __dev_alloc_skb(length, GFP_ATOMIC|__GFP_MEMALLOC);
Regards,
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] Net vm deadlock fix (preliminary)
2005-08-04 22:09 ` Daniel Phillips
@ 2005-08-04 22:45 ` Daniel Phillips
0 siblings, 0 replies; 8+ messages in thread
From: Daniel Phillips @ 2005-08-04 22:45 UTC (permalink / raw)
To: linux-mm
Hmm, the linux-mm mailer stripped off the first line of my last mail which was
"Whoops:". Some fancy pseudo-header parsing perhaps?
Regards,
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2005-08-04 23:00 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-08-03 6:57 [RFC] Net vm deadlock fix (preliminary) Daniel Phillips
2005-08-03 6:59 ` Martin Josefsson
2005-08-03 17:36 ` Daniel Phillips
2005-08-03 18:21 ` Martin Josefsson
2005-08-03 20:06 ` Daniel Phillips
2005-08-04 21:51 ` Daniel Phillips
2005-08-04 22:09 ` Daniel Phillips
2005-08-04 22:45 ` Daniel Phillips
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox