linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Daniel Phillips <phillips@istop.com>
To: netdev@vger.kernel.org, linux-mm@kvack.org
Subject: [RFC] Net vm deadlock fix (preliminary)
Date: Wed, 3 Aug 2005 16:57:34 +1000	[thread overview]
Message-ID: <200508031657.34948.phillips@istop.com> (raw)

[-- Attachment #1: Type: text/plain, Size: 2439 bytes --]

Hi,

Here is a preliminary patch, not tested at all, just to give everybody a 
target to aim bricks at.

  * A new __GFP_MEMALLOC flag gives access to the memalloc reserve.

  * In dev_alloc_skb, if GFP_ATOMIC fails then try again with __GFP_MEMALLOC.

  * We know an skb was allocated from reserve if we see __GFP_MEMALLOC in the
    (misnamed) priority field.

  * When a driver uses netif_rx to deliver the packet to the protocol layer,
    if the packet was allocated from the reserve it is delivered directly to
    the protocol layer, otherwise queue the packet via softnet.

  * When the protocol handler (tcp/ipv4 in this case) looks up the socket,
    if the packet was allocated from reserve but the socket is not serving
    vm traffic, the packet is discarded.

There are some users of __dev_alloc_skb that inherit the new memalloc behavior 
for free.  This is probably not a good thing.  There are a dozen or so users 
to check... later.

I claimed earlier that an advantage of using the memalloc reserve over a 
mempool is that the pool becomes available to the whole call chain of the 
user.  This isn't true in a softirq, sorry.  Maybe we could make it true, 
that's another question.  Anyway, memalloc reserve vs mempool is a detail at 
this point.

There is a big hole here through which precious reserve memory can escape: if 
the network driver is allocating packets from reserve but a protocol handler 
does not test such packets, things will deteriorate quickly.  The easiest 
thing to do is make sure all protocols know about this logic.  They probably 
all need to anyway.

A memalloc task (one handling IO on behalf of the vm) will set the SO_MEMALLOC 
flag after creating the socket.  The memalloc task will throttle the amount 
of traffic in flight to keep the maximum reserve usage to some reasonable 
amount.  (It will be necessary to get more precise about this at some point.) 
The memalloc task itself will be in PF_MEMALLOC mode when it uses this 
socket.

This patch only covers socket input, not output.  As you can see, the fast 
path is not compromised at all, and even when the low memory path triggers, 
efficiency only falls off a little (allocations may take longer and we bypass 
the softnet optimization).  But the thing is, we don't fall back to 
single-request-at-a-time handling, which is exactly what you don't want to do 
when the vm is desperately trying to clean memory.

Regards,

Daniel

[-- Attachment #2: net.memalloc-2.6.12.3 --]
[-- Type: text/x-diff, Size: 4402 bytes --]

--- 2.6.12.3.clean/include/linux/gfp.h	2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/include/linux/gfp.h	2005-08-03 01:12:33.000000000 -0400
@@ -39,6 +39,7 @@
 #define __GFP_COMP	0x4000u	/* Add compound page metadata */
 #define __GFP_ZERO	0x8000u	/* Return zeroed page on success */
 #define __GFP_NOMEMALLOC 0x10000u /* Don't use emergency reserves */
+#define __GFP_MEMALLOC  0x20000u /* Use emergency reserves */
 
 #define __GFP_BITS_SHIFT 20	/* Room for 20 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)
--- 2.6.12.3.clean/include/linux/skbuff.h	2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/include/linux/skbuff.h	2005-08-03 01:53:43.000000000 -0400
@@ -969,7 +969,6 @@
 		kfree_skb(skb);
 }
 
-#ifndef CONFIG_HAVE_ARCH_DEV_ALLOC_SKB
 /**
  *	__dev_alloc_skb - allocate an skbuff for sending
  *	@length: length to allocate
@@ -985,14 +984,14 @@
 static inline struct sk_buff *__dev_alloc_skb(unsigned int length,
 					      int gfp_mask)
 {
-	struct sk_buff *skb = alloc_skb(length + 16, gfp_mask);
+	struct sk_buff *skb = alloc_skb(length += 16, gfp_mask);
+
+	if (unlikely(!skb))
+		skb = alloc_skb(length, gfp_mask|__GFP_MEMALLOC);
 	if (likely(skb))
 		skb_reserve(skb, 16);
 	return skb;
 }
-#else
-extern struct sk_buff *__dev_alloc_skb(unsigned int length, int gfp_mask);
-#endif
 
 /**
  *	dev_alloc_skb - allocate an skbuff for sending
@@ -1011,6 +1010,11 @@
 	return __dev_alloc_skb(length, GFP_ATOMIC);
 }
 
+static inline int is_memalloc_skb(struct sk_buff *skb)
+{
+	return !!(skb->priority & __GFP_MEMALLOC);
+}
+
 /**
  *	skb_cow - copy header of skb when it is required
  *	@skb: buffer to cow
--- 2.6.12.3.clean/include/net/sock.h	2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/include/net/sock.h	2005-08-03 01:20:56.000000000 -0400
@@ -382,6 +382,7 @@
 	SOCK_NO_LARGESEND, /* whether to sent large segments or not */
 	SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
 	SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+	SOCK_MEMALLOC, /* protocol can use memalloc reserve */
 };
 
 static inline void sock_set_flag(struct sock *sk, enum sock_flags flag)
@@ -399,6 +400,11 @@
 	return test_bit(flag, &sk->sk_flags);
 }
 
+static inline int is_memalloc_sock(struct sock *sk)
+{
+	return sock_flag(sk, SOCK_MEMALLOC);
+}
+
 static inline void sk_acceptq_removed(struct sock *sk)
 {
 	sk->sk_ack_backlog--;
--- 2.6.12.3.clean/mm/page_alloc.c	2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/mm/page_alloc.c	2005-08-03 01:46:10.000000000 -0400
@@ -802,8 +802,8 @@
 
 	/* This allocation should allow future memory freeing. */
 
-	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-			&& !in_interrupt()) {
+	if ((((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
+			&& !in_interrupt()) || (gfp_mask & __GFP_MEMALLOC)) {
 		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
 			/* go through the zonelist yet again, ignoring mins */
 			for (i = 0; (z = zones[i]) != NULL; i++) {
--- 2.6.12.3.clean/net/core/dev.c	2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/net/core/dev.c	2005-08-03 01:42:46.000000000 -0400
@@ -1452,6 +1452,11 @@
 	struct softnet_data *queue;
 	unsigned long flags;
 
+        if (unlikely(is_memalloc_skb(skb))) {
+                netif_receive_skb(skb);
+                return NET_RX_CN_HIGH;
+        }
+
 	/* if netpoll wants it, pretend we never saw it */
 	if (netpoll_rx(skb))
 		return NET_RX_DROP;
--- 2.6.12.3.clean/net/core/skbuff.c	2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/net/core/skbuff.c	2005-08-03 01:36:50.000000000 -0400
@@ -355,7 +355,7 @@
 	n->nohdr = 0;
 	C(pkt_type);
 	C(ip_summed);
-	C(priority);
+	n->priority = skb->priority & ~__GFP_MEMALLOC;
 	C(protocol);
 	C(security);
 	n->destructor = NULL;
@@ -411,7 +411,7 @@
 	new->sk		= NULL;
 	new->dev	= old->dev;
 	new->real_dev	= old->real_dev;
-	new->priority	= old->priority;
+	new->priority	= old->priority & ~__GFP_MEMALLOC;
 	new->protocol	= old->protocol;
 	new->dst	= dst_clone(old->dst);
 #ifdef CONFIG_INET
--- 2.6.12.3.clean/net/ipv4/tcp_ipv4.c	2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/net/ipv4/tcp_ipv4.c	2005-08-02 21:35:54.000000000 -0400
@@ -1766,6 +1766,9 @@
 	if (!sk)
 		goto no_tcp_socket;
 
+	if (unlikely(is_memalloc_skb(skb)) && !is_memalloc_sock(sk))
+		goto discard_and_relse;
+
 process:
 	if (sk->sk_state == TCP_TIME_WAIT)
 		goto do_time_wait;

             reply	other threads:[~2005-08-03  6:57 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-08-03  6:57 Daniel Phillips [this message]
2005-08-03  6:59 ` Martin Josefsson
2005-08-03 17:36   ` Daniel Phillips
2005-08-03 18:21     ` Martin Josefsson
2005-08-03 20:06       ` Daniel Phillips
2005-08-04 21:51         ` Daniel Phillips
2005-08-04 22:09           ` Daniel Phillips
2005-08-04 22:45             ` Daniel Phillips

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200508031657.34948.phillips@istop.com \
    --to=phillips@istop.com \
    --cc=linux-mm@kvack.org \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox