[RFC] Net vm deadlock fix (preliminary)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC] Net vm deadlock fix (preliminary)
@ 2005-08-03  6:57 Daniel Phillips
  2005-08-03  6:59 ` Martin Josefsson
  0 siblings, 1 reply; 8+ messages in thread
From: Daniel Phillips @ 2005-08-03  6:57 UTC (permalink / raw)
  To: netdev, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2439 bytes --]

Hi,

Here is a preliminary patch, not tested at all, just to give everybody a 
target to aim bricks at.

  * A new __GFP_MEMALLOC flag gives access to the memalloc reserve.

  * In dev_alloc_skb, if GFP_ATOMIC fails then try again with __GFP_MEMALLOC.

  * We know an skb was allocated from reserve if we see __GFP_MEMALLOC in the
    (misnamed) priority field.

  * When a driver uses netif_rx to deliver the packet to the protocol layer,
    if the packet was allocated from the reserve it is delivered directly to
    the protocol layer, otherwise queue the packet via softnet.

  * When the protocol handler (tcp/ipv4 in this case) looks up the socket,
    if the packet was allocated from reserve but the socket is not serving
    vm traffic, the packet is discarded.

There are some users of __dev_alloc_skb that inherit the new memalloc behavior 
for free.  This is probably not a good thing.  There are a dozen or so users 
to check... later.

I claimed earlier that an advantage of using the memalloc reserve over a 
mempool is that the pool becomes available to the whole call chain of the 
user.  This isn't true in a softirq, sorry.  Maybe we could make it true, 
that's another question.  Anyway, memalloc reserve vs mempool is a detail at 
this point.

There is a big hole here through which precious reserve memory can escape: if 
the network driver is allocating packets from reserve but a protocol handler 
does not test such packets, things will deteriorate quickly.  The easiest 
thing to do is make sure all protocols know about this logic.  They probably 
all need to anyway.

A memalloc task (one handling IO on behalf of the vm) will set the SO_MEMALLOC 
flag after creating the socket.  The memalloc task will throttle the amount 
of traffic in flight to keep the maximum reserve usage to some reasonable 
amount.  (It will be necessary to get more precise about this at some point.) 
The memalloc task itself will be in PF_MEMALLOC mode when it uses this 
socket.

This patch only covers socket input, not output.  As you can see, the fast 
path is not compromised at all, and even when the low memory path triggers, 
efficiency only falls off a little (allocations may take longer and we bypass 
the softnet optimization).  But the thing is, we don't fall back to 
single-request-at-a-time handling, which is exactly what you don't want to do 
when the vm is desperately trying to clean memory.

Regards,

Daniel

[-- Attachment #2: net.memalloc-2.6.12.3 --]
[-- Type: text/x-diff, Size: 4402 bytes --]

--- 2.6.12.3.clean/include/linux/gfp.h	2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/include/linux/gfp.h	2005-08-03 01:12:33.000000000 -0400
@@ -39,6 +39,7 @@
 #define __GFP_COMP	0x4000u	/* Add compound page metadata */
 #define __GFP_ZERO	0x8000u	/* Return zeroed page on success */
 #define __GFP_NOMEMALLOC 0x10000u /* Don't use emergency reserves */
+#define __GFP_MEMALLOC  0x20000u /* Use emergency reserves */
 
 #define __GFP_BITS_SHIFT 20	/* Room for 20 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)
--- 2.6.12.3.clean/include/linux/skbuff.h	2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/include/linux/skbuff.h	2005-08-03 01:53:43.000000000 -0400
@@ -969,7 +969,6 @@
 		kfree_skb(skb);
 }
 
-#ifndef CONFIG_HAVE_ARCH_DEV_ALLOC_SKB
 /**
  *	__dev_alloc_skb - allocate an skbuff for sending
  *	@length: length to allocate
@@ -985,14 +984,14 @@
 static inline struct sk_buff *__dev_alloc_skb(unsigned int length,
 					      int gfp_mask)
 {
-	struct sk_buff *skb = alloc_skb(length + 16, gfp_mask);
+	struct sk_buff *skb = alloc_skb(length += 16, gfp_mask);
+
+	if (unlikely(!skb))
+		skb = alloc_skb(length, gfp_mask|__GFP_MEMALLOC);
 	if (likely(skb))
 		skb_reserve(skb, 16);
 	return skb;
 }
-#else
-extern struct sk_buff *__dev_alloc_skb(unsigned int length, int gfp_mask);
-#endif
 
 /**
  *	dev_alloc_skb - allocate an skbuff for sending
@@ -1011,6 +1010,11 @@
 	return __dev_alloc_skb(length, GFP_ATOMIC);
 }
 
+static inline int is_memalloc_skb(struct sk_buff *skb)
+{
+	return !!(skb->priority & __GFP_MEMALLOC);
+}
+
 /**
  *	skb_cow - copy header of skb when it is required
  *	@skb: buffer to cow
--- 2.6.12.3.clean/include/net/sock.h	2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/include/net/sock.h	2005-08-03 01:20:56.000000000 -0400
@@ -382,6 +382,7 @@
 	SOCK_NO_LARGESEND, /* whether to sent large segments or not */
 	SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
 	SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+	SOCK_MEMALLOC, /* protocol can use memalloc reserve */
 };
 
 static inline void sock_set_flag(struct sock *sk, enum sock_flags flag)
@@ -399,6 +400,11 @@
 	return test_bit(flag, &sk->sk_flags);
 }
 
+static inline int is_memalloc_sock(struct sock *sk)
+{
+	return sock_flag(sk, SOCK_MEMALLOC);
+}
+
 static inline void sk_acceptq_removed(struct sock *sk)
 {
 	sk->sk_ack_backlog--;
--- 2.6.12.3.clean/mm/page_alloc.c	2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/mm/page_alloc.c	2005-08-03 01:46:10.000000000 -0400
@@ -802,8 +802,8 @@
 
 	/* This allocation should allow future memory freeing. */
 
-	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-			&& !in_interrupt()) {
+	if ((((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
+			&& !in_interrupt()) || (gfp_mask & __GFP_MEMALLOC)) {
 		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
 			/* go through the zonelist yet again, ignoring mins */
 			for (i = 0; (z = zones[i]) != NULL; i++) {
--- 2.6.12.3.clean/net/core/dev.c	2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/net/core/dev.c	2005-08-03 01:42:46.000000000 -0400
@@ -1452,6 +1452,11 @@
 	struct softnet_data *queue;
 	unsigned long flags;
 
+        if (unlikely(is_memalloc_skb(skb))) {
+                netif_receive_skb(skb);
+                return NET_RX_CN_HIGH;
+        }
+
 	/* if netpoll wants it, pretend we never saw it */
 	if (netpoll_rx(skb))
 		return NET_RX_DROP;
--- 2.6.12.3.clean/net/core/skbuff.c	2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/net/core/skbuff.c	2005-08-03 01:36:50.000000000 -0400
@@ -355,7 +355,7 @@
 	n->nohdr = 0;
 	C(pkt_type);
 	C(ip_summed);
-	C(priority);
+	n->priority = skb->priority & ~__GFP_MEMALLOC;
 	C(protocol);
 	C(security);
 	n->destructor = NULL;
@@ -411,7 +411,7 @@
 	new->sk		= NULL;
 	new->dev	= old->dev;
 	new->real_dev	= old->real_dev;
-	new->priority	= old->priority;
+	new->priority	= old->priority & ~__GFP_MEMALLOC;
 	new->protocol	= old->protocol;
 	new->dst	= dst_clone(old->dst);
 #ifdef CONFIG_INET
--- 2.6.12.3.clean/net/ipv4/tcp_ipv4.c	2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/net/ipv4/tcp_ipv4.c	2005-08-02 21:35:54.000000000 -0400
@@ -1766,6 +1766,9 @@
 	if (!sk)
 		goto no_tcp_socket;
 
+	if (unlikely(is_memalloc_skb(skb)) && !is_memalloc_sock(sk))
+		goto discard_and_relse;
+
 process:
 	if (sk->sk_state == TCP_TIME_WAIT)
 		goto do_time_wait;

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Net vm deadlock fix (preliminary)
  2005-08-03  6:57 [RFC] Net vm deadlock fix (preliminary) Daniel Phillips
@ 2005-08-03  6:59 ` Martin Josefsson
  2005-08-03 17:36   ` Daniel Phillips
  0 siblings, 1 reply; 8+ messages in thread
From: Martin Josefsson @ 2005-08-03  6:59 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: netdev, linux-mm

On Wed, 3 Aug 2005, Daniel Phillips wrote:

> Hi,
>
> Here is a preliminary patch, not tested at all, just to give everybody a
> target to aim bricks at.
>
>   * A new __GFP_MEMALLOC flag gives access to the memalloc reserve.
>
>   * In dev_alloc_skb, if GFP_ATOMIC fails then try again with __GFP_MEMALLOC.

This might give some unwanted side effects... Normally a NIC driver
has an rx-ring with skbs that are allocated by using dev_alloc_skb()
And before the memory got low the skbs in the ring was allocated using
GFP_ATOMIC and when packets are recieved the skb is passed up the stack
and if in the meantime (since the skb was allocated) memory got low the
checks for __GFP_MEMALLOC won't trigger as it isn't set in the skb.

And what's worse is that when memory is low new packets allocated with
dev_alloc_skb() will allocate them from __GFP_MEMALLOC, and then they are
placed in the rx-ring just waiting to be used which might not happen for a
while. One positive thing might be that at least some GFP_ATOMIC memory is
released in the befinning when the skbs in the rx-ring is beeing
reallocated.

So when memory gets tight, you won't notice __GFP_MEMALLOC for as many
packets as the rx-ring contains skbs. But you do allocate skbs with
__GFP_MEMALLOC just that they get stuffed into the rx-ring and will stay
there until used by incoming packets.
iirc, the e1000 driver defaults to a rx-ring size of 256 skbs.

One thing could be to utilize the skb copybreak that certain NIC drivers
implement. If the packetsize is rather small a new skb is allocated and
the data is copied from the skb in the rx-ring to the new skb which then
contains the __GFP_MEMALLOC and the GFP_ATOMIC skb stays in the rx-ring.
This could be used to avoid the problems above, but... it adds overhead to
all packets when memory isn't low since all packets needs to be copied and
the copy is more expensive for larger packets which slows things down
compared to not copying. If there's another way of detecting the low
memory situation other than a failing copy the copybreak abuse could be
enabled only when memory is low, that would probably get the best from
both worlds.

/Martin
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Net vm deadlock fix (preliminary)
  2005-08-03  6:59 ` Martin Josefsson
@ 2005-08-03 17:36   ` Daniel Phillips
  2005-08-03 18:21     ` Martin Josefsson
  0 siblings, 1 reply; 8+ messages in thread
From: Daniel Phillips @ 2005-08-03 17:36 UTC (permalink / raw)
  To: Martin Josefsson; +Cc: netdev, linux-mm

On Wednesday 03 August 2005 16:59, Martin Josefsson wrote:
> On Wed, 3 Aug 2005, Daniel Phillips wrote:
> > Hi,
> >
> > Here is a preliminary patch, not tested at all, just to give everybody a
> > target to aim bricks at.
> >
> >   * A new __GFP_MEMALLOC flag gives access to the memalloc reserve.
> >
> >   * In dev_alloc_skb, if GFP_ATOMIC fails then try again with
> > __GFP_MEMALLOC.
>
> This might give some unwanted side effects... 

You are right, and it was a bad idea to try to do this fix generically for all 
drivers at once, see below.

> Normally a NIC driver 
> has an rx-ring with skbs that are allocated by using dev_alloc_skb()
> And before the memory got low the skbs in the ring was allocated using
> GFP_ATOMIC and when packets are recieved the skb is passed up the stack
> and if in the meantime (since the skb was allocated) memory got low the
> checks for __GFP_MEMALLOC won't trigger as it isn't set in the skb.
>
> And what's worse is that when memory is low new packets allocated with
> dev_alloc_skb() will allocate them from __GFP_MEMALLOC, and then they are
> placed in the rx-ring just waiting to be used which might not happen for a
> while. One positive thing might be that at least some GFP_ATOMIC memory is
> released in the befinning when the skbs in the rx-ring is beeing
> reallocated.
>
> So when memory gets tight, you won't notice __GFP_MEMALLOC for as many
> packets as the rx-ring contains skbs. But you do allocate skbs with
> __GFP_MEMALLOC just that they get stuffed into the rx-ring and will stay
> there until used by incoming packets.
> iirc, the e1000 driver defaults to a rx-ring size of 256 skbs.

Yep, yep and yep.

I can think of two ways to deal with this:

  1) Mindlessly include the entire maximum memory usage of the rx-ring in
     the reserve requirement (e.g., (maxskbs * (MTU + k)) / PAGE_SIZE).

  2) Never refill the rx-ring from the reserve.  Instead, if the skbs ever
     run out (because e1000_alloc_rx_buffers had too many GFP_ATOMIC alloc
     failures) then use __GFP_MEMALLOC instead of just giving up at that
     point.

(1) is the quick fix, (2) is the proper fix.  (2) also implies auditing every 
net driver to ensure its usage is "block IO safe".  I made a valiant attempt 
to handle this in just __dev_alloc_skb, but really it just drives the point 
home that there is no substitute for laying out the memory allocation 
rules for drivers and making sure they are followed.

We can soften the pain of this by classifying net drivers as block-io-safe or 
not with a capability flag, and fix them up incrementally.  The resulting 
memory usage audit can be nothing but good, I bet the code gets smaller too.  
It would be easy to log a warning any time network block IO is attempted over 
an unfixed driver.  This will be just as unsafe as it ever was and exactly as 
efficient.

If the driver is block-io-safe, it will use __GFP_MEMALLOC (if we ultimately 
decide to use that instead of mempools) and the fallback allocation+delivery 
path.  The fast path will be the same as it ever was, because packets taking 
the fallback path will be ones that would have been dropped before.  Packets 
allocated from reserve that pass the check in the protocol driver are ones 
that would have been dropped _in the block io path_, so rescuing these not 
only fixes the deadlock, but is a huge efficiency win.

> One thing could be to utilize the skb copybreak that certain NIC drivers
> implement. If the packetsize is rather small a new skb is allocated and
> the data is copied from the skb in the rx-ring to the new skb which then
> contains the __GFP_MEMALLOC and the GFP_ATOMIC skb stays in the rx-ring.
> This could be used to avoid the problems above, but... it adds overhead to
> all packets when memory isn't low since all packets needs to be copied and
> the copy is more expensive for larger packets which slows things down
> compared to not copying.

I am have not looked at the copybreak feature (its memory requirements 
certainly need to be analyzed) but my suggestion above does not hurt the 
fast path and does divide the allocation class properly between 
__GFP_MEMALLOC and GFP_ATOMIC, as you noticed is necessary.

(Note that, in all of this, there is the temptation to substitute mempools for 
__GFP_MEMALLOC.  This doesn't change the analysis at all, it just changes the 
details of the API.  We still have to fix the drivers and protocols in all 
the same places.)

> If there's another way of detecting the low 
> memory situation other than a failing copy the copybreak abuse could be
> enabled only when memory is low, that would probably get the best from
> both worlds.

Checking whether memory is low is a somewhat expensive operation because we 
have to walk through all the allocation zones according to the allocation 
class requested and check the bounds.  So at the moment, non-vm code detects 
low memory only through failed allocations.  We should have a separate, 
efficient api for detecting "out of normal memory", but the immediate issue is 
killing the network block io deadlock.

Regards,

Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Net vm deadlock fix (preliminary)
  2005-08-03 17:36   ` Daniel Phillips
@ 2005-08-03 18:21     ` Martin Josefsson
  2005-08-03 20:06       ` Daniel Phillips
  0 siblings, 1 reply; 8+ messages in thread
From: Martin Josefsson @ 2005-08-03 18:21 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: netdev, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1930 bytes --]

On Thu, 2005-08-04 at 03:36 +1000, Daniel Phillips wrote:

> I can think of two ways to deal with this:
> 
>   1) Mindlessly include the entire maximum memory usage of the rx-ring in
>      the reserve requirement (e.g., (maxskbs * (MTU + k)) / PAGE_SIZE).

Would be dependent on the numberof interfaces as well.

>   2) Never refill the rx-ring from the reserve.  Instead, if the skbs ever
>      run out (because e1000_alloc_rx_buffers had too many GFP_ATOMIC alloc
>      failures) then use __GFP_MEMALLOC instead of just giving up at that
>      point.

This is how e1000 currently works (suggestions have been made to change
this to work like the tg3 driver does which has copybreak support etc)

1. Allocate skbs filling the rx-ring as much as possible
2. tell hardware there's new skbs to DMA packets into
3. note that an skb has been filled with data (interrupt or polling)
4. remove that skb from the rx-ring
5. pass the skb up the stack
6. goto 3 if quota hasn't been filled
7. goto 1 if quota has been filled

The skbs allocated to fill the rx-ring are the _same_ skbs that are
passed up the stack. So you won't see __GFP_MEMALLOC allocated skbs
until RX_RINGSIZE packets after we got low on memory (fifo ring). I
can't really say I see how #2 above solves that since we _have_ to
allocate skbs to fill the rx-ring, otherwise the NIC won't have anywhere
to put the received packets and will thus drop them in hardware.

Or are you suggesting to let the rx-ring deplete until completely empty
(or nearly empty) if we are low on memory, and only then start falling
back to allocating with __GFP_MEMALLOC if GFP_ATOMIC fails?
That could and probably would cause hardware to drop packets because it
can run out of fresh rx-descriptors before we manage to start allocating
with __GFP_MEMALLOC if the packetrate is high, at least it makes it much
more likely to happen.

-- 
/Martin

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Net vm deadlock fix (preliminary)
  2005-08-03 18:21     ` Martin Josefsson
@ 2005-08-03 20:06       ` Daniel Phillips
  2005-08-04 21:51         ` Daniel Phillips
  0 siblings, 1 reply; 8+ messages in thread
From: Daniel Phillips @ 2005-08-03 20:06 UTC (permalink / raw)
  To: Martin Josefsson; +Cc: netdev, linux-mm

On Thursday 04 August 2005 04:21, Martin Josefsson wrote:
> On Thu, 2005-08-04 at 03:36 +1000, Daniel Phillips wrote:
> > I can think of two ways to deal with this:
> >
> >   1) Mindlessly include the entire maximum memory usage of the rx-ring in
> >      the reserve requirement (e.g., (maxskbs * (MTU + k)) / PAGE_SIZE).
>
> Would be dependent on the numberof interfaces as well.

Actually, just the number of interfaces being used by network block IO.  
Theoretically unbounded, but not an immediate problem.  This goes on the 
"must fix in order to be perfect" list.

> >   2) Never refill the rx-ring from the reserve.  Instead, if the skbs
> > ever run out (because e1000_alloc_rx_buffers had too many GFP_ATOMIC
> > alloc failures) then use __GFP_MEMALLOC instead of just giving up at that
> > point.
>
> This is how e1000 currently works (suggestions have been made to change
> this to work like the tg3 driver does which has copybreak support etc)
>
> 1. Allocate skbs filling the rx-ring as much as possible
> 2. tell hardware there's new skbs to DMA packets into
> 3. note that an skb has been filled with data (interrupt or polling)
> 4. remove that skb from the rx-ring
> 5. pass the skb up the stack
> 6. goto 3 if quota hasn't been filled
> 7. goto 1 if quota has been filled

Thanks, I originally missed the part about the hardware requiring at least one 
skb in the ring, or else it will drop a packet.

> The skbs allocated to fill the rx-ring are the _same_ skbs that are
> passed up the stack. So you won't see __GFP_MEMALLOC allocated skbs
> until RX_RINGSIZE packets after we got low on memory (fifo ring). I
> can't really say I see how #2 above solves that since we _have_ to
> allocate skbs to fill the rx-ring, otherwise the NIC won't have anywhere
> to put the received packets and will thus drop them in hardware.
>
> Or are you suggesting to let the rx-ring deplete until completely empty
> (or nearly empty) if we are low on memory, and only then start falling
> back to allocating with __GFP_MEMALLOC if GFP_ATOMIC fails?

Yes, exactly.  Except as you point out "completely empty" won't do.  We will 
use the __GFP_MEMALLOC (or alternatively, mempool) to ensure that the rx_ring 
always has at least some minimum number N of packets in it, and reserve N*MTU 
pages for the interface.  Actually, we will reserve considerably more than 
that, because we want to be able to have a fairly large number of packets in 
flight on the block IO path, particularly under memory pressure.

It doesn't actually matter which packet we return to the reserve, so we might 
want to just count the number of __GFP_MEMALLOC packets in the ring and 
deliver on the direct (non-softnet) path until the count drops to zero.  
Which implies a slightly different approach to the is_memalloc_skb flagging.  
This is just a refinement though, it does not really matter when we reclaim a 
reserve buffer as long as we are certain to reclaim it.

> That could and probably would cause hardware to drop packets because it
> can run out of fresh rx-descriptors before we manage to start allocating
> with __GFP_MEMALLOC if the packetrate is high, at least it makes it much
> more likely to happen.

It is a matter of setting N, the minimum number of packets in the rx ring high 
enough, no?

OK, the next step is to reroll the patch and make it specific to e1000, which 
I happen to have here and can test.  Also, I will use a mempool this time, so 
we will see how that code looks.

Regards,

Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Net vm deadlock fix (preliminary)
  2005-08-03 20:06       ` Daniel Phillips
@ 2005-08-04 21:51         ` Daniel Phillips
  2005-08-04 22:09           ` Daniel Phillips
  0 siblings, 1 reply; 8+ messages in thread
From: Daniel Phillips @ 2005-08-04 21:51 UTC (permalink / raw)
  To: Martin Josefsson; +Cc: netdev, linux-mm

[-- Attachment #1: Type: text/plain, Size: 5373 bytes --]

Hi,

I spent the last day mulling things over and doing research.  It seems to me 
that the patch as first posted is correct and solves the deadlock, except 
that some uses of __GFP_MEMALLOC in __dev_alloc_skb may escape into contexts 
where the reserve is not guaranteed to be reclaimed.  It may be that this 
does not actually happen, but there is enough different usage that I would 
rather err on the side of caution just now, and offer a variant called, e.g., 
dev_memalloc_skb so that drivers will explicitly have to choose to use it (or 
supply the flag to __dev_alloc_skb).  This is just a stack of inline 
functions so there should be no extra object code.  The dev_memalloc_skb 
variant can go away in time, but for now it does no harm.

A minor cleanup: somebody (Rik) complained about his bleeding eyes after 
reading my side-effecty alloc_skb expression, so that was rearranged in a way 
that should optimize to the same thing.

On the "first write the patch, then do the research" principle, the entire 
thread on this topic from the ksummit-2005-discuss mailing list is a good 
read:

http://thunker.thunk.org/pipermail/ksummit-2005-discuss/2005-March/thread.html#186

Matt, you are close to the truth here:

http://thunker.thunk.org/pipermail/ksummit-2005-discuss/2005-March/000242.html

but this part isn't right: "it's important to note here that "making progress" 
may require M acknowledgements to N packets representing a single IO. So we 
need separate send and acknowledge pools for each SO_MEMALLOC socket so that 
we don't find ourselves wedged with M-1 available mempool slots when we're 
waiting on ACKs".  This erroneously assumes that mempools throttle the block 
IO traffic.  In fact, the throttling _must_ take place higher up, in the 
block IO stack.  The block driver that submits the network IO must 
pre-account any per-request resources and block until sufficient resources 
become available.  So the accounting would include both space for transmit 
and acknowledge, and the network block IO protocol must be designed to obey 
that accounting.  (I will wave my hands at the question of how we arrange for 
low-level components to communicate their resource needs to high-level 
throttlers, just for now.)

Andrea, also getting close:

http://thunker.thunk.org/pipermail/ksummit-2005-discuss/2005-March/000200.html

But there is no need to be random.  Short of actually overlowing the input 
ring buffer, we can be precise about accepting all block IO packets and 
dropping non-blockio traffic as necessary.

Rik, not bad:

http://thunker.thunk.org/pipermail/ksummit-2005-discuss/2005-March/000218.html

particularly for deducing it from first principles without actually looking at 
the network code ;-)  It is even the same socket flag name as I settled on 
(SO_MEMALLOC).  But step 4 veers off course: out of order does not matter.  
And the conclusion that we can throttle here by dropping non-blockio packets 
is not right: the packets that we got from reserve still can live an 
arbitrary time in the protocol stack, so we could still exhaust the reserve 
and be back to the same bad old deadlock conditions.

Everybody noticed that dropping non-blockio packets is key, and everybody 
missed the fact that softnet introduces additional queues that need 
throttling (which can't be done sanely) or bypassing.  Almost everybody 
noticed that throttling in the block IO submission path is non-optional.

Everybody thought that mempool is the one true way of reserving memory.  I am 
not so sure, though I still intend to produce a mempool variant of the patch.  
One problem I see with mempool is that it not only reserves resources, but 
pins them.  If the kernel is full of mempools pinning memory pages here and 
there, physical defragmentation gets that much harder and the buddy tree will 
fragment that much sooner.  The __GPF_MEMALLOC interface does not have this 
problem because pages stay in the global pool.  So the jury is still out on 
which method is better.

Obviously, to do the job properly, __GPF_MEMALLOC would need a way of resizing 
the memalloc reserve as users are loaded and unloaded.  Such an interface can 
be very lightweight.  I will cook one up just to demonstrate this.

Now, the scheme in my patch does the job and I think it does it in a way that 
works for all drivers, even e1000 (by method 1. in the thread above).  But we 
could tighten this up a little by noticing that it doesn't actually matter 
which socket buffer we return to the pool as long as we are sure to return 
the same amount of memory as we withdrew.  Therefore, we could just account 
the number of pages alloc_skb withdraws, and the number that freeing a packet 
returns.  The e1000 driver would look at this number to determine whether to 
mark a packet as from_reserve or not.  That way, the e1000 driver could set 
things in motion to release reserve resources sooner, rather than waiting for 
certain specially flagged skbs to work their way around the rx-ring.  This 
also makes it easier for related systems (such as delivery hooks) to draw 
from a single pool with accurate accounting.

I will follow the current simple per-skb approach all the way to the end on 
the "perfect is the enemy of good enough" principle.  In the long run, page 
accounting is the way to go.

Next, on to actually trying this.

Regards,

Daniel

[-- Attachment #2: net.memalloc-2.6.12.3-2 --]
[-- Type: text/x-diff, Size: 4778 bytes --]

--- 2.6.12.3.clean/include/linux/gfp.h	2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/include/linux/gfp.h	2005-08-04 00:12:48.000000000 -0400
@@ -39,6 +39,7 @@
 #define __GFP_COMP	0x4000u	/* Add compound page metadata */
 #define __GFP_ZERO	0x8000u	/* Return zeroed page on success */
 #define __GFP_NOMEMALLOC 0x10000u /* Don't use emergency reserves */
+#define __GFP_MEMALLOC  0x20000u /* Use emergency reserves */

 #define __GFP_BITS_SHIFT 20	/* Room for 20 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)
--- 2.6.12.3.clean/include/linux/skbuff.h	2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/include/linux/skbuff.h	2005-08-04 13:46:35.000000000 -0400
@@ -969,7 +969,6 @@
 		kfree_skb(skb);
 }

-#ifndef CONFIG_HAVE_ARCH_DEV_ALLOC_SKB
 /**
  *	__dev_alloc_skb - allocate an skbuff for sending
  *	@length: length to allocate
@@ -982,17 +981,17 @@
  *
  *	%NULL is returned in there is no free memory.
  */
-static inline struct sk_buff *__dev_alloc_skb(unsigned int length,
-					      int gfp_mask)
+static inline struct sk_buff *__dev_alloc_skb(unsigned length, int gfp_mask)
 {
-	struct sk_buff *skb = alloc_skb(length + 16, gfp_mask);
+	int opaque = 16, size = length + opaque;
+	struct sk_buff *skb = alloc_skb(size, gfp_mask & ~__GFP_MEMALLOC);
+
+	if (unlikely(!skb) && (gfp_mask & __GFP_MEMALLOC))
+		skb = alloc_skb(size, gfp_mask);
 	if (likely(skb))
-		skb_reserve(skb, 16);
+		skb_reserve(skb, opaque);
 	return skb;
 }
-#else
-extern struct sk_buff *__dev_alloc_skb(unsigned int length, int gfp_mask);
-#endif

 /**
  *	dev_alloc_skb - allocate an skbuff for sending
@@ -1011,6 +1010,16 @@
 	return __dev_alloc_skb(length, GFP_ATOMIC);
 }

+static inline struct sk_buff *dev_alloc_skb_reserve(unsigned int length)
+{
+	return __dev_alloc_skb(length, gfp_mask | __GFP_MEMALLOC);
+}
+
+static inline int is_memalloc_skb(struct sk_buff *skb)
+{
+	return !!(skb->priority & __GFP_MEMALLOC);
+}
+
 /**
  *	skb_cow - copy header of skb when it is required
  *	@skb: buffer to cow
--- 2.6.12.3.clean/include/net/sock.h	2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/include/net/sock.h	2005-08-04 00:12:49.000000000 -0400
@@ -382,6 +382,7 @@
 	SOCK_NO_LARGESEND, /* whether to sent large segments or not */
 	SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
 	SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+	SOCK_MEMALLOC, /* protocol can use memalloc reserve */
 };

 static inline void sock_set_flag(struct sock *sk, enum sock_flags flag)
@@ -399,6 +400,11 @@
 	return test_bit(flag, &sk->sk_flags);
 }

+static inline int is_memalloc_sock(struct sock *sk)
+{
+	return sock_flag(sk, SOCK_MEMALLOC);
+}
+
 static inline void sk_acceptq_removed(struct sock *sk)
 {
 	sk->sk_ack_backlog--;
--- 2.6.12.3.clean/mm/page_alloc.c	2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/mm/page_alloc.c	2005-08-04 00:12:49.000000000 -0400
@@ -802,8 +802,8 @@

 	/* This allocation should allow future memory freeing. */

-	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-			&& !in_interrupt()) {
+	if ((((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
+			&& !in_interrupt()) || (gfp_mask & __GFP_MEMALLOC)) {
 		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
 			/* go through the zonelist yet again, ignoring mins */
 			for (i = 0; (z = zones[i]) != NULL; i++) {
--- 2.6.12.3.clean/net/core/dev.c	2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/net/core/dev.c	2005-08-04 00:12:49.000000000 -0400
@@ -1452,6 +1452,11 @@
 	struct softnet_data *queue;
 	unsigned long flags;

+        if (unlikely(is_memalloc_skb(skb))) {
+                netif_receive_skb(skb);
+                return NET_RX_CN_HIGH;
+        }
+
 	/* if netpoll wants it, pretend we never saw it */
 	if (netpoll_rx(skb))
 		return NET_RX_DROP;
--- 2.6.12.3.clean/net/core/skbuff.c	2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/net/core/skbuff.c	2005-08-04 00:12:49.000000000 -0400
@@ -355,7 +355,7 @@
 	n->nohdr = 0;
 	C(pkt_type);
 	C(ip_summed);
-	C(priority);
+	n->priority = skb->priority & ~__GFP_MEMALLOC;
 	C(protocol);
 	C(security);
 	n->destructor = NULL;
@@ -411,7 +411,7 @@
 	new->sk		= NULL;
 	new->dev	= old->dev;
 	new->real_dev	= old->real_dev;
-	new->priority	= old->priority;
+	new->priority	= old->priority & ~__GFP_MEMALLOC;
 	new->protocol	= old->protocol;
 	new->dst	= dst_clone(old->dst);
 #ifdef CONFIG_INET
--- 2.6.12.3.clean/net/ipv4/tcp_ipv4.c	2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/net/ipv4/tcp_ipv4.c	2005-08-04 00:12:49.000000000 -0400
@@ -1766,6 +1766,9 @@
 	if (!sk)
 		goto no_tcp_socket;

+	if (unlikely(is_memalloc_skb(skb)) && !is_memalloc_sock(sk))
+		goto discard_and_relse;
+
 process:
 	if (sk->sk_state == TCP_TIME_WAIT)
 		goto do_time_wait;

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Net vm deadlock fix (preliminary)
  2005-08-04 21:51         ` Daniel Phillips
@ 2005-08-04 22:09           ` Daniel Phillips
  2005-08-04 22:45             ` Daniel Phillips
  0 siblings, 1 reply; 8+ messages in thread
From: Daniel Phillips @ 2005-08-04 22:09 UTC (permalink / raw)
  To: Martin Josefsson; +Cc: netdev, linux-mm

-       return __dev_alloc_skb(length, gfp_mask | __GFP_MEMALLOC);
+       return __dev_alloc_skb(length, GFP_ATOMIC|__GFP_MEMALLOC);

Regards,

Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Net vm deadlock fix (preliminary)
  2005-08-04 22:09           ` Daniel Phillips
@ 2005-08-04 22:45             ` Daniel Phillips
  0 siblings, 0 replies; 8+ messages in thread
From: Daniel Phillips @ 2005-08-04 22:45 UTC (permalink / raw)
  To: linux-mm

Hmm, the linux-mm mailer stripped off the first line of my last mail which was 
"Whoops:".  Some fancy pseudo-header parsing perhaps?

Regards,

Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2005-08-04 23:00 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-08-03  6:57 [RFC] Net vm deadlock fix (preliminary) Daniel Phillips
2005-08-03  6:59 ` Martin Josefsson
2005-08-03 17:36   ` Daniel Phillips
2005-08-03 18:21     ` Martin Josefsson
2005-08-03 20:06       ` Daniel Phillips
2005-08-04 21:51         ` Daniel Phillips
2005-08-04 22:09           ` Daniel Phillips
2005-08-04 22:45             ` Daniel Phillips

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox