[RFC][PATCH 0/6] VM deadlock avoidance -v9

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH 0/6] VM deadlock avoidance -v9
@ 2006-11-30 10:14 Peter Zijlstra
  2006-11-30 10:14 ` [RFC][PATCH 1/6] mm: slab allocation fairness Peter Zijlstra
                   ` (5 more replies)
  0 siblings, 6 replies; 24+ messages in thread
From: Peter Zijlstra @ 2006-11-30 10:14 UTC (permalink / raw)
  To: netdev, linux-mm; +Cc: David Miller, Peter Zijlstra

Hi,

I have a new version of these patches; I'm still using SOCK_VMIO socket
tagging and skb->emergency marks, since I have not come up with another
approach that might work and my RFC to netdev has so far been ignored.

Other than this though, it changed quite a bit;

 - I now use the regular allocation paths and cover all allocations
needed to process a skb (although the RX pool sizing might need more
variables)

 - The emergency RX pool size is based on ip[46]frag_high_thresh and
ip[46]_rt_max_size so that fragment assembly and dst route cache
allocations cannot exhaust the memory. (more paths need analysis xfrm,
conntrack?)

 - skb->emergency packets skip taps

 - skb->emergency packets warn about and ignores NF_QUEUE targets

The patches definitely need more work but would you agree with the
general direction I'm working in or would you suggest yet another
direction?

Kind regards,

Peter


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 1/6] mm: slab allocation fairness
  2006-11-30 10:14 [RFC][PATCH 0/6] VM deadlock avoidance -v9 Peter Zijlstra
@ 2006-11-30 10:14 ` Peter Zijlstra
  2006-11-30 18:52   ` Christoph Lameter
  2006-11-30 10:14 ` [RFC][PATCH 2/6] mm: allow PF_MEMALLOC from softirq context Peter Zijlstra
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2006-11-30 10:14 UTC (permalink / raw)
  To: netdev, linux-mm; +Cc: David Miller, Peter Zijlstra

[-- Attachment #1: slab-ranking.patch --]
[-- Type: text/plain, Size: 15071 bytes --]

The slab has some unfairness wrt gfp flags; when the slab is grown the gfp 
flags are used to allocate more memory, however when there is slab space 
available, gfp flags are ignored. Thus it is possible for less critical 
slab allocations to succeed and gobble up precious memory.

This patch avoids this by keeping track of the allocation hardness when 
growing. This is then compared to the current slab alloc's gfp flags.

[AIM9 results go here]

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/internal.h   |   89 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c |   58 ++++++++++--------------------------
 mm/slab.c       |   57 ++++++++++++++++++++++-------------
 3 files changed, 142 insertions(+), 62 deletions(-)

Index: linux-2.6-git/mm/internal.h
===================================================================
--- linux-2.6-git.orig/mm/internal.h	2006-11-29 16:15:54.000000000 +0100
+++ linux-2.6-git/mm/internal.h	2006-11-29 16:23:02.000000000 +0100
@@ -12,6 +12,7 @@
 #define __MM_INTERNAL_H
 
 #include <linux/mm.h>
+#include <linux/hardirq.h>
 
 static inline void set_page_count(struct page *page, int v)
 {
@@ -37,4 +38,92 @@ static inline void __put_page(struct pag
 extern void fastcall __init __free_pages_bootmem(struct page *page,
 						unsigned int order);
 
+#define ALLOC_HARDER		0x01 /* try to alloc harder */
+#define ALLOC_HIGH		0x02 /* __GFP_HIGH set */
+#define ALLOC_WMARK_MIN		0x04 /* use pages_min watermark */
+#define ALLOC_WMARK_LOW		0x08 /* use pages_low watermark */
+#define ALLOC_WMARK_HIGH	0x10 /* use pages_high watermark */
+#define ALLOC_NO_WATERMARKS	0x20 /* don't check watermarks at all */
+#define ALLOC_CPUSET		0x40 /* check for correct cpuset */
+
+/*
+ * get the deepest reaching allocation flags for the given gfp_mask
+ */
+static int inline gfp_to_alloc_flags(gfp_t gfp_mask)
+{
+	struct task_struct *p = current;
+	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
+	const gfp_t wait = gfp_mask & __GFP_WAIT;
+
+	/*
+	 * The caller may dip into page reserves a bit more if the caller
+	 * cannot run direct reclaim, or if the caller has realtime scheduling
+	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
+	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
+	 */
+	if (gfp_mask & __GFP_HIGH)
+		alloc_flags |= ALLOC_HIGH;
+
+	if (!wait) {
+		alloc_flags |= ALLOC_HARDER;
+		/*
+		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
+		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
+		 */
+		alloc_flags &= ~ALLOC_CPUSET;
+	} else if (unlikely(rt_task(p)) && !in_interrupt())
+		alloc_flags |= ALLOC_HARDER;
+
+	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+		if (!in_interrupt() &&
+		    ((p->flags & PF_MEMALLOC) ||
+		     unlikely(test_thread_flag(TIF_MEMDIE))))
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+	}
+
+	return alloc_flags;
+}
+
+#define MAX_ALLOC_RANK	16
+
+/*
+ * classify the allocation: 0 is hardest, 16 is easiest.
+ */
+static inline int alloc_flags_to_rank(int alloc_flags)
+{
+	int rank;
+
+	if (alloc_flags & ALLOC_NO_WATERMARKS)
+		return 0;
+
+	rank = alloc_flags & (ALLOC_WMARK_MIN|ALLOC_WMARK_LOW|ALLOC_WMARK_HIGH);
+	rank -= alloc_flags & (ALLOC_HARDER|ALLOC_HIGH);
+
+	return rank;
+}
+
+static inline int gfp_to_rank(gfp_t gfp_mask)
+{
+	/*
+	 * Although correct this full version takes a ~3% performance
+	 * hit on the network tests in aim9.
+	 *
+
+	return alloc_flags_to_rank(gfp_to_alloc_flags(gfp_mask));
+
+	 *
+	 * Just check the bare essential ALLOC_NO_WATERMARKS case this keeps
+	 * the aim9 results within the error margin.
+	 */
+
+	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+		if (!in_interrupt() &&
+		    ((current->flags & PF_MEMALLOC) ||
+		     unlikely(test_thread_flag(TIF_MEMDIE))))
+			return 0;
+	}
+
+	return 1;
+}
+
 #endif
Index: linux-2.6-git/mm/page_alloc.c
===================================================================
--- linux-2.6-git.orig/mm/page_alloc.c	2006-11-29 16:21:27.000000000 +0100
+++ linux-2.6-git/mm/page_alloc.c	2006-11-29 16:21:55.000000000 +0100
@@ -885,14 +885,6 @@ failed:
 	return NULL;
 }
 
-#define ALLOC_NO_WATERMARKS	0x01 /* don't check watermarks at all */
-#define ALLOC_WMARK_MIN		0x02 /* use pages_min watermark */
-#define ALLOC_WMARK_LOW		0x04 /* use pages_low watermark */
-#define ALLOC_WMARK_HIGH	0x08 /* use pages_high watermark */
-#define ALLOC_HARDER		0x10 /* try to alloc harder */
-#define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
-#define ALLOC_CPUSET		0x40 /* check for correct cpuset */
-
 /*
  * Return 1 if free pages are above 'mark'. This takes into account the order
  * of the allocation.
@@ -968,6 +960,7 @@ get_page_from_freelist(gfp_t gfp_mask, u
 
 		page = buffered_rmqueue(zonelist, zone, order, gfp_mask);
 		if (page) {
+			page->index = alloc_flags_to_rank(alloc_flags);
 			break;
 		}
 	} while (*(++z) != NULL);
@@ -1013,47 +1006,26 @@ restart:
 	 * OK, we're below the kswapd watermark and have kicked background
 	 * reclaim. Now things get more complex, so set up alloc_flags according
 	 * to how we want to proceed.
-	 *
-	 * The caller may dip into page reserves a bit more if the caller
-	 * cannot run direct reclaim, or if the caller has realtime scheduling
-	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
-	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
 	 */
-	alloc_flags = ALLOC_WMARK_MIN;
-	if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait)
-		alloc_flags |= ALLOC_HARDER;
-	if (gfp_mask & __GFP_HIGH)
-		alloc_flags |= ALLOC_HIGH;
-	if (wait)
-		alloc_flags |= ALLOC_CPUSET;
+	alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
-	/*
-	 * Go through the zonelist again. Let __GFP_HIGH and allocations
-	 * coming from realtime tasks go deeper into reserves.
-	 *
-	 * This is the last chance, in general, before the goto nopage.
-	 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
-	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
-	 */
-	page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags);
+	/* This is the last chance, in general, before the goto nopage. */
+	page = get_page_from_freelist(gfp_mask, order, zonelist,
+			alloc_flags & ~ALLOC_NO_WATERMARKS);
 	if (page)
 		goto got_pg;
 
 	/* This allocation should allow future memory freeing. */
-
-	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-			&& !in_interrupt()) {
-		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
+	if (alloc_flags & ALLOC_NO_WATERMARKS) {
 nofail_alloc:
-			/* go through the zonelist yet again, ignoring mins */
-			page = get_page_from_freelist(gfp_mask, order,
+		/* go through the zonelist yet again, ignoring mins */
+		page = get_page_from_freelist(gfp_mask, order,
 				zonelist, ALLOC_NO_WATERMARKS);
-			if (page)
-				goto got_pg;
-			if (gfp_mask & __GFP_NOFAIL) {
-				congestion_wait(WRITE, HZ/50);
-				goto nofail_alloc;
-			}
+		if (page)
+			goto got_pg;
+		if (wait && (gfp_mask & __GFP_NOFAIL)) {
+			congestion_wait(WRITE, HZ/50);
+			goto nofail_alloc;
 		}
 		goto nopage;
 	}
@@ -1062,6 +1034,10 @@ nofail_alloc:
 	if (!wait)
 		goto nopage;
 
+	/* Avoid recursion of direct reclaim */
+	if (p->flags & PF_MEMALLOC)
+		goto nopage;
+
 rebalance:
 	cond_resched();
 
Index: linux-2.6-git/mm/slab.c
===================================================================
--- linux-2.6-git.orig/mm/slab.c	2006-11-29 16:15:55.000000000 +0100
+++ linux-2.6-git/mm/slab.c	2006-11-29 16:21:55.000000000 +0100
@@ -112,6 +112,7 @@
 #include	<asm/cacheflush.h>
 #include	<asm/tlbflush.h>
 #include	<asm/page.h>
+#include	"internal.h"
 
 /*
  * DEBUG	- 1 for kmem_cache_create() to honour; SLAB_DEBUG_INITIAL,
@@ -378,6 +379,7 @@ static void kmem_list3_init(struct kmem_
 
 struct kmem_cache {
 /* 1) per-cpu data, touched during every alloc/free */
+	int rank;
 	struct array_cache *array[NR_CPUS];
 /* 2) Cache tunables. Protected by cache_chain_mutex */
 	unsigned int batchcount;
@@ -991,21 +993,21 @@ static inline int cache_free_alien(struc
 }
 
 static inline void *alternate_node_alloc(struct kmem_cache *cachep,
-		gfp_t flags)
+		gfp_t flags, int rank)
 {
 	return NULL;
 }
 
 static inline void *__cache_alloc_node(struct kmem_cache *cachep,
-		 gfp_t flags, int nodeid)
+		 gfp_t flags, int nodeid, int rank)
 {
 	return NULL;
 }
 
 #else	/* CONFIG_NUMA */
 
-static void *__cache_alloc_node(struct kmem_cache *, gfp_t, int);
-static void *alternate_node_alloc(struct kmem_cache *, gfp_t);
+static void *__cache_alloc_node(struct kmem_cache *, gfp_t, int, int);
+static void *alternate_node_alloc(struct kmem_cache *, gfp_t, int);
 
 static struct array_cache **alloc_alien_cache(int node, int limit)
 {
@@ -1591,6 +1593,7 @@ static void *kmem_getpages(struct kmem_c
 	if (!page)
 		return NULL;
 
+	cachep->rank = page->index;
 	nr_pages = (1 << cachep->gfporder);
 	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
 		add_zone_page_state(page_zone(page),
@@ -2245,6 +2248,7 @@ kmem_cache_create (const char *name, siz
 	}
 #endif
 #endif
+	cachep->rank = MAX_ALLOC_RANK;
 
 	/*
 	 * Determine if the slab management is 'on' or 'off' slab.
@@ -2917,7 +2921,7 @@ bad:
 #define check_slabp(x,y) do { } while(0)
 #endif
 
-static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
+static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags, int rank)
 {
 	int batchcount;
 	struct kmem_list3 *l3;
@@ -2929,6 +2933,8 @@ static void *cache_alloc_refill(struct k
 	check_irq_off();
 	ac = cpu_cache_get(cachep);
 retry:
+	if (unlikely(rank > cachep->rank))
+		goto force_grow;
 	batchcount = ac->batchcount;
 	if (!ac->touched && batchcount > BATCHREFILL_LIMIT) {
 		/*
@@ -2984,14 +2990,16 @@ must_grow:
 	l3->free_objects -= ac->avail;
 alloc_done:
 	spin_unlock(&l3->list_lock);
-
 	if (unlikely(!ac->avail)) {
 		int x;
+force_grow:
 		x = cache_grow(cachep, flags, node);
 
 		/* cache_grow can reenable interrupts, then ac could change. */
 		ac = cpu_cache_get(cachep);
-		if (!x && ac->avail == 0)	/* no objects in sight? abort */
+
+		/* no objects in sight? abort */
+		if (!x && (ac->avail == 0 || rank > cachep->rank))
 			return NULL;
 
 		if (!ac->avail)		/* objects refilled by interrupt? */
@@ -3069,20 +3077,21 @@ static void *cache_alloc_debugcheck_afte
 #define cache_alloc_debugcheck_after(a,b,objp,d) (objp)
 #endif
 
-static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
+static inline void *____cache_alloc(struct kmem_cache *cachep,
+		gfp_t flags, int rank)
 {
 	void *objp;
 	struct array_cache *ac;
 
 	check_irq_off();
 	ac = cpu_cache_get(cachep);
-	if (likely(ac->avail)) {
+	if (likely(ac->avail && rank <= cachep->rank)) {
 		STATS_INC_ALLOCHIT(cachep);
 		ac->touched = 1;
 		objp = ac->entry[--ac->avail];
 	} else {
 		STATS_INC_ALLOCMISS(cachep);
-		objp = cache_alloc_refill(cachep, flags);
+		objp = cache_alloc_refill(cachep, flags, rank);
 	}
 	return objp;
 }
@@ -3092,6 +3101,7 @@ static __always_inline void *__cache_all
 {
 	unsigned long save_flags;
 	void *objp = NULL;
+	int rank = gfp_to_rank(flags);
 
 	cache_alloc_debugcheck_before(cachep, flags);
 
@@ -3099,16 +3109,16 @@ static __always_inline void *__cache_all
 
 	if (unlikely(NUMA_BUILD &&
 			current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY)))
-		objp = alternate_node_alloc(cachep, flags);
+		objp = alternate_node_alloc(cachep, flags, rank);
 
 	if (!objp)
-		objp = ____cache_alloc(cachep, flags);
+		objp = ____cache_alloc(cachep, flags, rank);
 	/*
 	 * We may just have run out of memory on the local node.
 	 * __cache_alloc_node() knows how to locate memory on other nodes
 	 */
  	if (NUMA_BUILD && !objp)
- 		objp = __cache_alloc_node(cachep, flags, numa_node_id());
+ 		objp = __cache_alloc_node(cachep, flags, numa_node_id(), rank);
 	local_irq_restore(save_flags);
 	objp = cache_alloc_debugcheck_after(cachep, flags, objp,
 					    caller);
@@ -3123,7 +3133,8 @@ static __always_inline void *__cache_all
  * If we are in_interrupt, then process context, including cpusets and
  * mempolicy, may not apply and should not be used for allocation policy.
  */
-static void *alternate_node_alloc(struct kmem_cache *cachep, gfp_t flags)
+static void *alternate_node_alloc(struct kmem_cache *cachep,
+		gfp_t flags, int rank)
 {
 	int nid_alloc, nid_here;
 
@@ -3135,7 +3146,7 @@ static void *alternate_node_alloc(struct
 	else if (current->mempolicy)
 		nid_alloc = slab_node(current->mempolicy);
 	if (nid_alloc != nid_here)
-		return __cache_alloc_node(cachep, flags, nid_alloc);
+		return __cache_alloc_node(cachep, flags, nid_alloc, rank);
 	return NULL;
 }
 
@@ -3145,7 +3156,7 @@ static void *alternate_node_alloc(struct
  * the page allocator. We fall back according to a zonelist determined by
  * the policy layer while obeying cpuset constraints.
  */
-void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
+void *fallback_alloc(struct kmem_cache *cache, gfp_t flags, int rank)
 {
 	struct zonelist *zonelist = &NODE_DATA(slab_node(current->mempolicy))
 					->node_zonelists[gfp_zone(flags)];
@@ -3159,7 +3170,7 @@ void *fallback_alloc(struct kmem_cache *
 				cpuset_zone_allowed(*z, flags) &&
 				cache->nodelists[nid])
 			obj = __cache_alloc_node(cache,
-					flags | __GFP_THISNODE, nid);
+					flags | __GFP_THISNODE, nid, rank);
 	}
 	return obj;
 }
@@ -3168,7 +3179,7 @@ void *fallback_alloc(struct kmem_cache *
  * A interface to enable slab creation on nodeid
  */
 static void *__cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,
-				int nodeid)
+				int nodeid, int rank)
 {
 	struct list_head *entry;
 	struct slab *slabp;
@@ -3181,6 +3192,8 @@ static void *__cache_alloc_node(struct k
 
 retry:
 	check_irq_off();
+	if (unlikely(rank > cachep->rank))
+		goto force_grow;
 	spin_lock(&l3->list_lock);
 	entry = l3->slabs_partial.next;
 	if (entry == &l3->slabs_partial) {
@@ -3216,13 +3229,14 @@ retry:
 
 must_grow:
 	spin_unlock(&l3->list_lock);
+force_grow:
 	x = cache_grow(cachep, flags, nodeid);
 	if (x)
 		goto retry;
 
 	if (!(flags & __GFP_THISNODE))
 		/* Unable to grow the cache. Fall back to other nodes. */
-		return fallback_alloc(cachep, flags);
+		return fallback_alloc(cachep, flags, rank);
 
 	return NULL;
 
@@ -3444,15 +3458,16 @@ void *kmem_cache_alloc_node(struct kmem_
 {
 	unsigned long save_flags;
 	void *ptr;
+	int rank = gfp_to_rank(flags);
 
 	cache_alloc_debugcheck_before(cachep, flags);
 	local_irq_save(save_flags);
 
 	if (nodeid == -1 || nodeid == numa_node_id() ||
 			!cachep->nodelists[nodeid])
-		ptr = ____cache_alloc(cachep, flags);
+		ptr = ____cache_alloc(cachep, flags, rank);
 	else
-		ptr = __cache_alloc_node(cachep, flags, nodeid);
+		ptr = __cache_alloc_node(cachep, flags, nodeid, rank);
 	local_irq_restore(save_flags);
 
 	ptr = cache_alloc_debugcheck_after(cachep, flags, ptr,

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 2/6] mm: allow PF_MEMALLOC from softirq context
  2006-11-30 10:14 [RFC][PATCH 0/6] VM deadlock avoidance -v9 Peter Zijlstra
  2006-11-30 10:14 ` [RFC][PATCH 1/6] mm: slab allocation fairness Peter Zijlstra
@ 2006-11-30 10:14 ` Peter Zijlstra
  2006-11-30 10:14 ` [RFC][PATCH 3/6] mm: serialize access to min_free_kbytes Peter Zijlstra
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2006-11-30 10:14 UTC (permalink / raw)
  To: netdev, linux-mm; +Cc: David Miller, Peter Zijlstra

[-- Attachment #1: page-alloc-softirq.patch --]
[-- Type: text/plain, Size: 2343 bytes --]

Allow PF_MEMALLOC to be set in softirq context. When running softirqs from
a borrowed context save current->flags, ksoftirqd will have its own 
task_struct.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/softirq.c |    3 +++
 mm/internal.h    |   14 ++++++++------
 2 files changed, 11 insertions(+), 6 deletions(-)

Index: linux-2.6-git/mm/internal.h
===================================================================
--- linux-2.6-git.orig/mm/internal.h	2006-11-29 16:23:02.000000000 +0100
+++ linux-2.6-git/mm/internal.h	2006-11-29 16:34:17.000000000 +0100
@@ -75,9 +75,10 @@ static int inline gfp_to_alloc_flags(gfp
 		alloc_flags |= ALLOC_HARDER;
 
 	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
-		if (!in_interrupt() &&
-		    ((p->flags & PF_MEMALLOC) ||
-		     unlikely(test_thread_flag(TIF_MEMDIE))))
+		if (!in_irq() && (p->flags & PF_MEMALLOC))
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+		else if (!in_interrupt() &&
+				unlikely(test_thread_flag(TIF_MEMDIE)))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 	}
 
@@ -117,9 +118,10 @@ static inline int gfp_to_rank(gfp_t gfp_
 	 */
 
 	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
-		if (!in_interrupt() &&
-		    ((current->flags & PF_MEMALLOC) ||
-		     unlikely(test_thread_flag(TIF_MEMDIE))))
+		if (!in_irq() && (current->flags & PF_MEMALLOC))
+			return 0;
+		else if (!in_interrupt() &&
+				unlikely(test_thread_flag(TIF_MEMDIE)))
 			return 0;
 	}
 
Index: linux-2.6-git/kernel/softirq.c
===================================================================
--- linux-2.6-git.orig/kernel/softirq.c	2006-11-29 16:15:54.000000000 +0100
+++ linux-2.6-git/kernel/softirq.c	2006-11-29 16:32:47.000000000 +0100
@@ -209,6 +209,8 @@ asmlinkage void __do_softirq(void)
 	__u32 pending;
 	int max_restart = MAX_SOFTIRQ_RESTART;
 	int cpu;
+	unsigned long pflags = current->flags;
+	current->flags &= ~PF_MEMALLOC;
 
 	pending = local_softirq_pending();
 	account_system_vtime(current);
@@ -247,6 +249,7 @@ restart:
 
 	account_system_vtime(current);
 	_local_bh_enable();
+	current->flags = pflags;
 }
 
 #ifndef __ARCH_HAS_DO_SOFTIRQ

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 3/6] mm: serialize access to min_free_kbytes
  2006-11-30 10:14 [RFC][PATCH 0/6] VM deadlock avoidance -v9 Peter Zijlstra
  2006-11-30 10:14 ` [RFC][PATCH 1/6] mm: slab allocation fairness Peter Zijlstra
  2006-11-30 10:14 ` [RFC][PATCH 2/6] mm: allow PF_MEMALLOC from softirq context Peter Zijlstra
@ 2006-11-30 10:14 ` Peter Zijlstra
  2006-11-30 10:14 ` [RFC][PATCH 4/6] mm: emergency pool and __GFP_EMERGENCY Peter Zijlstra
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2006-11-30 10:14 UTC (permalink / raw)
  To: netdev, linux-mm; +Cc: David Miller, Peter Zijlstra

[-- Attachment #1: setup_per_zone_pages_min.patch --]
[-- Type: text/plain, Size: 2139 bytes --]

There is a small race between the procfs caller and the memory hotplug caller
of setup_per_zone_pages_min(). Not a big deal, but the next patch will add yet
another caller. Time to close the gap.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/page_alloc.c |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

Index: linux-2.6-git/mm/page_alloc.c
===================================================================
--- linux-2.6-git.orig/mm/page_alloc.c	2006-11-20 13:45:27.000000000 +0100
+++ linux-2.6-git/mm/page_alloc.c	2006-11-20 13:47:39.000000000 +0100
@@ -101,6 +101,7 @@ static char *zone_names[MAX_NR_ZONES] = 
 #endif
 };
 
+static DEFINE_SPINLOCK(min_free_lock);
 int min_free_kbytes = 1024;
 
 unsigned long __meminitdata nr_kernel_pages;
@@ -2806,12 +2807,12 @@ static void setup_per_zone_lowmem_reserv
 }
 
 /**
- * setup_per_zone_pages_min - called when min_free_kbytes changes.
+ * __setup_per_zone_pages_min - called when min_free_kbytes changes.
  *
  * Ensures that the pages_{min,low,high} values for each zone are set correctly
  * with respect to min_free_kbytes.
  */
-void setup_per_zone_pages_min(void)
+static void __setup_per_zone_pages_min(void)
 {
 	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
@@ -2865,6 +2866,15 @@ void setup_per_zone_pages_min(void)
 	calculate_totalreserve_pages();
 }
 
+void setup_per_zone_pages_min(void)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&min_free_lock, flags);
+	__setup_per_zone_pages_min();
+	spin_unlock_irqrestore(&min_free_lock, flags);
+}
+
 /*
  * Initialise min_free_kbytes.
  *
@@ -2900,7 +2910,7 @@ static int __init init_per_zone_pages_mi
 		min_free_kbytes = 128;
 	if (min_free_kbytes > 65536)
 		min_free_kbytes = 65536;
-	setup_per_zone_pages_min();
+	__setup_per_zone_pages_min();
 	setup_per_zone_lowmem_reserve();
 	return 0;
 }

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 4/6] mm: emergency pool and __GFP_EMERGENCY
  2006-11-30 10:14 [RFC][PATCH 0/6] VM deadlock avoidance -v9 Peter Zijlstra
                   ` (2 preceding siblings ...)
  2006-11-30 10:14 ` [RFC][PATCH 3/6] mm: serialize access to min_free_kbytes Peter Zijlstra
@ 2006-11-30 10:14 ` Peter Zijlstra
  2006-11-30 10:14 ` [RFC][PATCH 5/6] slab: kmem_cache_objs_to_pages() Peter Zijlstra
  2006-11-30 10:14 ` [RFC][PATCH 6/6] net: vm deadlock avoidance core Peter Zijlstra
  5 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2006-11-30 10:14 UTC (permalink / raw)
  To: netdev, linux-mm; +Cc: David Miller, Peter Zijlstra

[-- Attachment #1: page_alloc-GFP_EMERGENCY.patch --]
[-- Type: text/plain, Size: 9646 bytes --]

Introduce __GFP_EMERGENCY and an emergency pool.

__GFP_EMERGENCY will allow the allocation to disregard the watermarks, 
much like PF_MEMALLOC. The emergency pool is separated from the min watermark
because ALLOC_HARDER and ALLOC_HIGH modify the watermark in a relative way
and thus do not ensure a strict minimum.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/gfp.h    |    7 ++++++-
 include/linux/mmzone.h |    3 ++-
 mm/internal.h          |   10 +++++++---
 mm/page_alloc.c        |   48 +++++++++++++++++++++++++++++++++++++++++-------
 mm/vmstat.c            |    6 +++---
 5 files changed, 59 insertions(+), 15 deletions(-)

Index: linux-2.6-git/include/linux/gfp.h
===================================================================
--- linux-2.6-git.orig/include/linux/gfp.h	2006-11-30 10:56:35.000000000 +0100
+++ linux-2.6-git/include/linux/gfp.h	2006-11-30 10:56:43.000000000 +0100
@@ -35,17 +35,21 @@ struct vm_area_struct;
 #define __GFP_HIGH	((__force gfp_t)0x20u)	/* Should access emergency pools? */
 #define __GFP_IO	((__force gfp_t)0x40u)	/* Can start physical IO? */
 #define __GFP_FS	((__force gfp_t)0x80u)	/* Can call down to low-level FS? */
+
 #define __GFP_COLD	((__force gfp_t)0x100u)	/* Cache-cold page required */
 #define __GFP_NOWARN	((__force gfp_t)0x200u)	/* Suppress page allocation failure warning */
 #define __GFP_REPEAT	((__force gfp_t)0x400u)	/* Retry the allocation.  Might fail */
 #define __GFP_NOFAIL	((__force gfp_t)0x800u)	/* Retry for ever.  Cannot fail */
+
 #define __GFP_NORETRY	((__force gfp_t)0x1000u)/* Do not retry.  Might fail */
 #define __GFP_NO_GROW	((__force gfp_t)0x2000u)/* Slab internal usage */
 #define __GFP_COMP	((__force gfp_t)0x4000u)/* Add compound page metadata */
 #define __GFP_ZERO	((__force gfp_t)0x8000u)/* Return zeroed page on success */
+
 #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
 #define __GFP_HARDWALL   ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
 #define __GFP_THISNODE	((__force gfp_t)0x40000u)/* No fallback, no policies */
+#define __GFP_EMERGENCY  ((__force gfp_t)0x80000u) /* Use emergency reserves */
 
 #define __GFP_BITS_SHIFT 20	/* Room for 20 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
@@ -54,7 +58,8 @@ struct vm_area_struct;
 #define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
 			__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
 			__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
-			__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE)
+			__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE| \
+			__GFP_EMERGENCY)
 
 /* This equals 0, but use constants in case they ever change */
 #define GFP_NOWAIT	(GFP_ATOMIC & ~__GFP_HIGH)
Index: linux-2.6-git/include/linux/mmzone.h
===================================================================
--- linux-2.6-git.orig/include/linux/mmzone.h	2006-11-30 10:56:35.000000000 +0100
+++ linux-2.6-git/include/linux/mmzone.h	2006-11-30 10:56:43.000000000 +0100
@@ -156,7 +156,7 @@ enum zone_type {
 struct zone {
 	/* Fields commonly accessed by the page allocator */
 	unsigned long		free_pages;
-	unsigned long		pages_min, pages_low, pages_high;
+	unsigned long		pages_emerg, pages_min, pages_low, pages_high;
 	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
 	 * or/and it will be released eventually, so to avoid totally wasting several
@@ -461,6 +461,7 @@ int sysctl_min_unmapped_ratio_sysctl_han
 			struct file *, void __user *, size_t *, loff_t *);
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
 			struct file *, void __user *, size_t *, loff_t *);
+void adjust_memalloc_reserve(int pages);
 
 #include <linux/topology.h>
 /* Returns the number of the current Node. */
Index: linux-2.6-git/mm/page_alloc.c
===================================================================
--- linux-2.6-git.orig/mm/page_alloc.c	2006-11-30 10:56:43.000000000 +0100
+++ linux-2.6-git/mm/page_alloc.c	2006-11-30 11:00:02.000000000 +0100
@@ -103,6 +103,7 @@ static char *zone_names[MAX_NR_ZONES] = 
 
 static DEFINE_SPINLOCK(min_free_lock);
 int min_free_kbytes = 1024;
+int var_free_kbytes;
 
 unsigned long __meminitdata nr_kernel_pages;
 unsigned long __meminitdata nr_all_pages;
@@ -903,7 +904,8 @@ int zone_watermark_ok(struct zone *z, in
 	if (alloc_flags & ALLOC_HARDER)
 		min -= min / 4;
 
-	if (free_pages <= min + z->lowmem_reserve[classzone_idx])
+	if (free_pages <= min + z->lowmem_reserve[classzone_idx] +
+			z->pages_emerg)
 		return 0;
 	for (o = 0; o < order; o++) {
 		/* At the next order, this order's pages become unavailable */
@@ -1344,9 +1346,9 @@ void show_free_areas(void)
 			"\n",
 			zone->name,
 			K(zone->free_pages),
-			K(zone->pages_min),
-			K(zone->pages_low),
-			K(zone->pages_high),
+			K(zone->pages_emerg + zone->pages_min),
+			K(zone->pages_emerg + zone->pages_low),
+			K(zone->pages_emerg + zone->pages_high),
 			K(zone->nr_active),
 			K(zone->nr_inactive),
 			K(zone->present_pages),
@@ -2757,7 +2759,7 @@ static void calculate_totalreserve_pages
 			}
 
 			/* we treat pages_high as reserved pages. */
-			max += zone->pages_high;
+			max += zone->pages_high + zone->pages_emerg;
 
 			if (max > zone->present_pages)
 				max = zone->present_pages;
@@ -2814,7 +2816,8 @@ static void setup_per_zone_lowmem_reserv
  */
 static void __setup_per_zone_pages_min(void)
 {
-	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
+	unsigned pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
+	unsigned pages_emerg = var_free_kbytes >> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
 	struct zone *zone;
 	unsigned long flags;
@@ -2826,11 +2829,13 @@ static void __setup_per_zone_pages_min(v
 	}
 
 	for_each_zone(zone) {
-		u64 tmp;
+		u64 tmp, tmp_emerg;
 
 		spin_lock_irqsave(&zone->lru_lock, flags);
 		tmp = (u64)pages_min * zone->present_pages;
 		do_div(tmp, lowmem_pages);
+		tmp_emerg = (u64)pages_emerg * zone->present_pages;
+		do_div(tmp_emerg, lowmem_pages);
 		if (is_highmem(zone)) {
 			/*
 			 * __GFP_HIGH and PF_MEMALLOC allocations usually don't
@@ -2849,12 +2854,14 @@ static void __setup_per_zone_pages_min(v
 			if (min_pages > 128)
 				min_pages = 128;
 			zone->pages_min = min_pages;
+			zone->pages_emerg = min_pages;
 		} else {
 			/*
 			 * If it's a lowmem zone, reserve a number of pages
 			 * proportionate to the zone's size.
 			 */
 			zone->pages_min = tmp;
+			zone->pages_emerg = tmp_emerg;
 		}
 
 		zone->pages_low   = zone->pages_min + (tmp >> 2);
@@ -2875,6 +2882,33 @@ void setup_per_zone_pages_min(void)
 	spin_unlock_irqrestore(&min_free_lock, flags);
 }
 
+/**
+ *	adjust_memalloc_reserve - adjust the memalloc reserve
+ *	@pages: number of pages to add
+ *
+ *	It adds a number of pages to the memalloc reserve; if
+ *	the number was positive it kicks kswapd into action to
+ *	satisfy the higher watermarks.
+ *
+ *	NOTE: there is only a single caller, hence no locking.
+ */
+void adjust_memalloc_reserve(int pages)
+{
+	var_free_kbytes += pages << (PAGE_SHIFT - 10);
+	BUG_ON(var_free_kbytes < 0);
+	setup_per_zone_pages_min();
+	if (pages > 0) {
+		struct zone *zone;
+		for_each_zone(zone)
+			wakeup_kswapd(zone, 0);
+	}
+	if (pages)
+		printk(KERN_DEBUG "Emergency reserve: %d\n",
+				var_free_kbytes);
+}
+
+EXPORT_SYMBOL_GPL(adjust_memalloc_reserve);
+
 /*
  * Initialise min_free_kbytes.
  *
Index: linux-2.6-git/mm/vmstat.c
===================================================================
--- linux-2.6-git.orig/mm/vmstat.c	2006-11-30 10:56:35.000000000 +0100
+++ linux-2.6-git/mm/vmstat.c	2006-11-30 11:00:42.000000000 +0100
@@ -535,9 +535,9 @@ static int zoneinfo_show(struct seq_file
 			   "\n        spanned  %lu"
 			   "\n        present  %lu",
 			   zone->free_pages,
-			   zone->pages_min,
-			   zone->pages_low,
-			   zone->pages_high,
+			   zone->pages_emerg + zone->pages_min,
+			   zone->pages_emerg + zone->pages_low,
+			   zone->pages_emerg + zone->pages_high,
 			   zone->nr_active,
 			   zone->nr_inactive,
 			   zone->pages_scanned,
Index: linux-2.6-git/mm/internal.h
===================================================================
--- linux-2.6-git.orig/mm/internal.h	2006-11-30 10:56:43.000000000 +0100
+++ linux-2.6-git/mm/internal.h	2006-11-30 10:56:43.000000000 +0100
@@ -75,7 +75,9 @@ static int inline gfp_to_alloc_flags(gfp
 		alloc_flags |= ALLOC_HARDER;
 
 	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
-		if (!in_irq() && (p->flags & PF_MEMALLOC))
+		if (gfp_mask & __GFP_EMERGENCY)
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+		else if (!in_irq() && (p->flags & PF_MEMALLOC))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 		else if (!in_interrupt() &&
 				unlikely(test_thread_flag(TIF_MEMDIE)))
@@ -103,7 +105,7 @@ static inline int alloc_flags_to_rank(in
 	return rank;
 }
 
-static inline int gfp_to_rank(gfp_t gfp_mask)
+static __always_inline int gfp_to_rank(gfp_t gfp_mask)
 {
 	/*
 	 * Although correct this full version takes a ~3% performance
@@ -118,7 +120,9 @@ static inline int gfp_to_rank(gfp_t gfp_
 	 */
 
 	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
-		if (!in_irq() && (current->flags & PF_MEMALLOC))
+		if (gfp_mask & __GFP_EMERGENCY)
+			return 0;
+		else if (!in_irq() && (current->flags & PF_MEMALLOC))
 			return 0;
 		else if (!in_interrupt() &&
 				unlikely(test_thread_flag(TIF_MEMDIE)))

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 5/6] slab: kmem_cache_objs_to_pages()
  2006-11-30 10:14 [RFC][PATCH 0/6] VM deadlock avoidance -v9 Peter Zijlstra
                   ` (3 preceding siblings ...)
  2006-11-30 10:14 ` [RFC][PATCH 4/6] mm: emergency pool and __GFP_EMERGENCY Peter Zijlstra
@ 2006-11-30 10:14 ` Peter Zijlstra
  2006-11-30 18:55   ` Christoph Lameter
  2006-11-30 10:14 ` [RFC][PATCH 6/6] net: vm deadlock avoidance core Peter Zijlstra
  5 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2006-11-30 10:14 UTC (permalink / raw)
  To: netdev, linux-mm; +Cc: David Miller, Peter Zijlstra

[-- Attachment #1: kmem_cache_objs_to_pages.patch --]
[-- Type: text/plain, Size: 1625 bytes --]

Provide a method to calculate the number of pages used by a given number of
slab objects.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/slab.h |    1 +
 mm/slab.c            |    6 ++++++
 2 files changed, 7 insertions(+)

Index: linux-2.6-git/include/linux/slab.h
===================================================================
--- linux-2.6-git.orig/include/linux/slab.h	2006-11-27 10:46:44.000000000 +0100
+++ linux-2.6-git/include/linux/slab.h	2006-11-27 10:47:33.000000000 +0100
@@ -209,6 +209,7 @@ static inline void *kcalloc(size_t n, si
 extern void kfree(const void *);
 extern unsigned int ksize(const void *);
 extern int slab_is_available(void);
+extern unsigned int kmem_cache_objs_to_pages(struct kmem_cache *, int);
 
 #ifdef CONFIG_NUMA
 extern void *kmem_cache_alloc_node(kmem_cache_t *, gfp_t flags, int node);
Index: linux-2.6-git/mm/slab.c
===================================================================
--- linux-2.6-git.orig/mm/slab.c	2006-11-27 10:47:26.000000000 +0100
+++ linux-2.6-git/mm/slab.c	2006-11-27 10:47:54.000000000 +0100
@@ -4279,3 +4279,9 @@ unsigned int ksize(const void *objp)
 
 	return obj_size(virt_to_cache(objp));
 }
+
+unsigned int kmem_cache_objs_to_pages(struct kmem_cache *cachep, int nr)
+{
+	return ((nr + cachep->num - 1) / cachep->num) << cachep->gfporder;
+}
+EXPORT_SYMBOL_GPL(kmem_cache_objs_to_pages);

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 6/6] net: vm deadlock avoidance core
  2006-11-30 10:14 [RFC][PATCH 0/6] VM deadlock avoidance -v9 Peter Zijlstra
                   ` (4 preceding siblings ...)
  2006-11-30 10:14 ` [RFC][PATCH 5/6] slab: kmem_cache_objs_to_pages() Peter Zijlstra
@ 2006-11-30 10:14 ` Peter Zijlstra
  2006-11-30 12:04   ` Peter Zijlstra
  5 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2006-11-30 10:14 UTC (permalink / raw)
  To: netdev, linux-mm; +Cc: David Miller, Peter Zijlstra

[-- Attachment #1: vm_deadlock_core.patch --]
[-- Type: text/plain, Size: 25037 bytes --]

In order to provide robust networked block devices there must be a guarantee
of progress. That is, the block device must never stall because of (physical)
OOM, because the device itself might be needed to get out of it (reclaim).

This means that the device queue must always be unplugable, this in turn means
that it must always find enough memory to build/send packets over the network
_and_ receive (level 7) ACKs for those packets.

The network stack has a huge capacity for buffering packets; waiting for 
user-space to read them. There is a practical limit imposed to avoid DoS 
scenarios. These two things make for a deadlock; what if the receive limit is
reached and all packets are buffered in non-critical sockets (those not serving
the network block device waiting for an ACK to free a page). 

Memory pressure will add to that; what if there is simply no memory left to
receive packets in.

This patch provides a service to register sockets as critical; SOCK_VMIO
is a promise the socket will never block on receive. Along with with a memory
reserve that will service a limited number of packets this can guarantee a
limited service to these critical sockets.

When we make sure that packets allocated from the reserve will only service
critical sockets we will not lose the memory and can guarantee progress.

(Note on the name SOCK_VMIO; the basic problem is a circular dependency between
the network and virtual memory subsystems which needs to be broken. This does
make VM network IO - and only VM network IO - special, it does not generalize)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/skbuff.h     |   13 +++-
 include/net/sock.h         |   36 +++++++++++++
 net/core/dev.c             |   40 +++++++++++++-
 net/core/skbuff.c          |   51 ++++++++++++++++--
 net/core/sock.c            |  121 +++++++++++++++++++++++++++++++++++++++++++++
 net/ipv4/ip_fragment.c     |    1 
 net/ipv4/ipmr.c            |    4 +
 net/ipv4/route.c           |   15 +++++
 net/ipv4/sysctl_net_ipv4.c |   14 ++++-
 net/ipv4/tcp_ipv4.c        |    9 +++
 net/ipv6/reassembly.c      |    1 
 net/ipv6/route.c           |   15 +++++
 net/ipv6/sysctl_net_ipv6.c |    6 +-
 net/netfilter/core.c       |    5 +
 security/selinux/avc.c     |    2 
 15 files changed, 312 insertions(+), 21 deletions(-)

Index: linux-2.6-git/include/linux/skbuff.h
===================================================================
--- linux-2.6-git.orig/include/linux/skbuff.h	2006-11-29 16:15:54.000000000 +0100
+++ linux-2.6-git/include/linux/skbuff.h	2006-11-29 17:20:21.000000000 +0100
@@ -283,7 +283,8 @@ struct sk_buff {
 				nfctinfo:3;
 	__u8			pkt_type:3,
 				fclone:2,
-				ipvs_property:1;
+				ipvs_property:1,
+				emergency:1;
 	__be16			protocol;
 
 	void			(*destructor)(struct sk_buff *skb);
@@ -328,10 +329,13 @@ struct sk_buff {
 
 #include <asm/system.h>
 
+#define SKB_ALLOC_FCLONE	0x01
+#define SKB_ALLOC_RX		0x02
+
 extern void kfree_skb(struct sk_buff *skb);
 extern void	       __kfree_skb(struct sk_buff *skb);
 extern struct sk_buff *__alloc_skb(unsigned int size,
-				   gfp_t priority, int fclone);
+				   gfp_t priority, int flags);
 static inline struct sk_buff *alloc_skb(unsigned int size,
 					gfp_t priority)
 {
@@ -341,7 +345,7 @@ static inline struct sk_buff *alloc_skb(
 static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
 					       gfp_t priority)
 {
-	return __alloc_skb(size, priority, 1);
+	return __alloc_skb(size, priority, SKB_ALLOC_FCLONE);
 }
 
 extern struct sk_buff *alloc_skb_from_cache(kmem_cache_t *cp,
@@ -1102,7 +1106,8 @@ static inline void __skb_queue_purge(str
 static inline struct sk_buff *__dev_alloc_skb(unsigned int length,
 					      gfp_t gfp_mask)
 {
-	struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
+	struct sk_buff *skb =
+		__alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX);
 	if (likely(skb))
 		skb_reserve(skb, NET_SKB_PAD);
 	return skb;
Index: linux-2.6-git/include/net/sock.h
===================================================================
--- linux-2.6-git.orig/include/net/sock.h	2006-11-29 16:21:27.000000000 +0100
+++ linux-2.6-git/include/net/sock.h	2006-11-29 17:20:21.000000000 +0100
@@ -391,6 +391,7 @@ enum sock_flags {
 	SOCK_RCVTSTAMP, /* %SO_TIMESTAMP setting */
 	SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
 	SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+	SOCK_VMIO, /* the VM depends on us - make sure we're serviced */
 };
 
 static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
@@ -413,6 +414,40 @@ static inline int sock_flag(struct sock 
 	return test_bit(flag, &sk->sk_flags);
 }
 
+static inline int sk_has_vmio(struct sock *sk)
+{
+	return sock_flag(sk, SOCK_VMIO);
+}
+
+#define MAX_PAGES_PER_SKB 3
+#define MAX_FRAGMENTS ((65536 + 1500 - 1) / 1500)
+/*
+ * Guestimate the per request queue TX upper bound.
+ */
+#define TX_RESERVE_PAGES \
+	(4 * MAX_FRAGMENTS * MAX_PAGES_PER_SKB)
+
+extern atomic_t vmio_socks;
+extern atomic_t emergency_rx_skbs;
+
+static inline int sk_vmio_socks(void)
+{
+	return atomic_read(&vmio_socks);
+}
+
+extern int sk_emergency_skb_get(void);
+
+static inline void sk_emergency_skb_put(void)
+{
+	return atomic_dec(&emergency_rx_skbs);
+}
+
+extern void sk_adjust_memalloc(int socks, int tx_reserve_pages);
+extern void ipfrag_add_memory(int ipfrag_reserve);
+extern void iprt_add_memory(int rt_reserve);
+extern int sk_set_vmio(struct sock *sk);
+extern int sk_clear_vmio(struct sock *sk);
+
 static inline void sk_acceptq_removed(struct sock *sk)
 {
 	sk->sk_ack_backlog--;
@@ -720,6 +755,7 @@ static inline void sk_stream_writequeue_
 
 static inline int sk_stream_rmem_schedule(struct sock *sk, struct sk_buff *skb)
 {
+	// XXX: skb->emergency stuff
 	return (int)skb->truesize <= sk->sk_forward_alloc ||
 		sk_stream_mem_schedule(sk, skb->truesize, 1);
 }
Index: linux-2.6-git/net/core/dev.c
===================================================================
--- linux-2.6-git.orig/net/core/dev.c	2006-11-29 16:15:53.000000000 +0100
+++ linux-2.6-git/net/core/dev.c	2006-11-29 17:20:21.000000000 +0100
@@ -1768,10 +1768,23 @@ int netif_receive_skb(struct sk_buff *sk
 	struct net_device *orig_dev;
 	int ret = NET_RX_DROP;
 	unsigned short type;
+	unsigned long pflags = current->flags;
+
+	/* Emergency skb are special, they should
+	 *  - be delivered to SOCK_VMIO sockets only
+	 *  - stay away from userspace
+	 *  - have bounded memory usage
+	 *
+	 * Use PF_MEMALLOC as a poor mans memory pool - the grouping kind.
+	 * This saves us from propagating the allocation context down to all
+	 * allocation sites.
+	 */
+	if (unlikely(skb->emergency))
+		current->flags |= PF_MEMALLOC;
 
 	/* if we've gotten here through NAPI, check netpoll */
 	if (skb->dev->poll && netpoll_rx(skb))
-		return NET_RX_DROP;
+		goto out;
 
 	if (!skb->tstamp.off_sec)
 		net_timestamp(skb);
@@ -1782,7 +1795,7 @@ int netif_receive_skb(struct sk_buff *sk
 	orig_dev = skb_bond(skb);
 
 	if (!orig_dev)
-		return NET_RX_DROP;
+		goto out;
 
 	__get_cpu_var(netdev_rx_stat).total++;
 
@@ -1799,6 +1812,8 @@ int netif_receive_skb(struct sk_buff *sk
 		goto ncls;
 	}
 #endif
+	if (unlikely(skb->emergency))
+		goto skip_taps;
 
 	list_for_each_entry_rcu(ptype, &ptype_all, list) {
 		if (!ptype->dev || ptype->dev == skb->dev) {
@@ -1808,6 +1823,7 @@ int netif_receive_skb(struct sk_buff *sk
 		}
 	}
 
+skip_taps:
 #ifdef CONFIG_NET_CLS_ACT
 	if (pt_prev) {
 		ret = deliver_skb(skb, pt_prev, orig_dev);
@@ -1820,17 +1836,28 @@ int netif_receive_skb(struct sk_buff *sk
 
 	if (ret == TC_ACT_SHOT || (ret == TC_ACT_STOLEN)) {
 		kfree_skb(skb);
-		goto out;
+		goto unlock;
 	}
 
 	skb->tc_verd = 0;
 ncls:
 #endif
 
+	if (unlikely(skb->emergency))
+		switch(skb->protocol) {
+			case __constant_htons(ETH_P_ARP):
+			case __constant_htons(ETH_P_IP):
+			case __constant_htons(ETH_P_IPV6):
+				break;
+
+			default:
+				goto drop;
+		}
+
 	handle_diverter(skb);
 
 	if (handle_bridge(&skb, &pt_prev, &ret, orig_dev))
-		goto out;
+		goto unlock;
 
 	type = skb->protocol;
 	list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type)&15], list) {
@@ -1845,6 +1872,7 @@ ncls:
 	if (pt_prev) {
 		ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
 	} else {
+drop:
 		kfree_skb(skb);
 		/* Jamal, now you will not able to escape explaining
 		 * me how you were going to use this. :-)
@@ -1852,8 +1880,10 @@ ncls:
 		ret = NET_RX_DROP;
 	}
 
-out:
+unlock:
 	rcu_read_unlock();
+out:
+	current->flags = pflags;
 	return ret;
 }
 
Index: linux-2.6-git/net/core/skbuff.c
===================================================================
--- linux-2.6-git.orig/net/core/skbuff.c	2006-11-29 16:15:53.000000000 +0100
+++ linux-2.6-git/net/core/skbuff.c	2006-11-29 17:20:21.000000000 +0100
@@ -139,29 +139,34 @@ EXPORT_SYMBOL(skb_truesize_bug);
  *	Buffers may only be allocated from interrupts using a @gfp_mask of
  *	%GFP_ATOMIC.
  */
-struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
-			    int fclone)
+struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, int flags)
 {
 	kmem_cache_t *cache;
 	struct skb_shared_info *shinfo;
 	struct sk_buff *skb;
 	u8 *data;
+	int emergency = 0;
 
-	cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
+	size = SKB_DATA_ALIGN(size);
+	cache = (flags & SKB_ALLOC_FCLONE)
+		? skbuff_fclone_cache : skbuff_head_cache;
+	if (flags & SKB_ALLOC_RX)
+		gfp_mask |= __GFP_NOMEMALLOC|__GFP_NOWARN;
 
+retry_alloc:
 	/* Get the HEAD */
 	skb = kmem_cache_alloc(cache, gfp_mask & ~__GFP_DMA);
 	if (!skb)
-		goto out;
+		goto noskb;
 
 	/* Get the DATA. Size must match skb_add_mtu(). */
-	size = SKB_DATA_ALIGN(size);
 	data = kmalloc_track_caller(size + sizeof(struct skb_shared_info),
 			gfp_mask);
 	if (!data)
 		goto nodata;
 
 	memset(skb, 0, offsetof(struct sk_buff, truesize));
+	skb->emergency = emergency;
 	skb->truesize = size + sizeof(struct sk_buff);
 	atomic_set(&skb->users, 1);
 	skb->head = data;
@@ -178,7 +183,7 @@ struct sk_buff *__alloc_skb(unsigned int
 	shinfo->ip6_frag_id = 0;
 	shinfo->frag_list = NULL;
 
-	if (fclone) {
+	if (flags & SKB_ALLOC_FCLONE) {
 		struct sk_buff *child = skb + 1;
 		atomic_t *fclone_ref = (atomic_t *) (child + 1);
 
@@ -186,12 +191,29 @@ struct sk_buff *__alloc_skb(unsigned int
 		atomic_set(fclone_ref, 1);
 
 		child->fclone = SKB_FCLONE_UNAVAILABLE;
+		child->emergency = skb->emergency;
 	}
 out:
 	return skb;
+
 nodata:
 	kmem_cache_free(cache, skb);
 	skb = NULL;
+noskb:
+	/* Attempt emergency allocation when RX skb. */
+	if (likely(!(flags & SKB_ALLOC_RX) || !sk_vmio_socks()))
+		goto out;
+
+	if (!emergency) {
+		if (sk_emergency_skb_get()) {
+			gfp_mask &= ~(__GFP_NOMEMALLOC|__GFP_NOWARN);
+			gfp_mask |= __GFP_EMERGENCY;
+			emergency = 1;
+			goto retry_alloc;
+		}
+	} else
+		sk_emergency_skb_put();
+
 	goto out;
 }
 
@@ -268,7 +290,7 @@ struct sk_buff *__netdev_alloc_skb(struc
 {
 	struct sk_buff *skb;
 
-	skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
+	skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX);
 	if (likely(skb)) {
 		skb_reserve(skb, NET_SKB_PAD);
 		skb->dev = dev;
@@ -317,6 +339,8 @@ static void skb_release_data(struct sk_b
 			skb_drop_fraglist(skb);
 
 		kfree(skb->head);
+		if (unlikely(skb->emergency))
+			sk_emergency_skb_put();
 	}
 }
 
@@ -437,6 +461,9 @@ struct sk_buff *skb_clone(struct sk_buff
 		n->fclone = SKB_FCLONE_CLONE;
 		atomic_inc(fclone_ref);
 	} else {
+		if (unlikely(skb->emergency))
+			gfp_mask |= __GFP_EMERGENCY;
+
 		n = kmem_cache_alloc(skbuff_head_cache, gfp_mask);
 		if (!n)
 			return NULL;
@@ -471,6 +498,7 @@ struct sk_buff *skb_clone(struct sk_buff
 #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
 	C(ipvs_property);
 #endif
+	C(emergency);
 	C(protocol);
 	n->destructor = NULL;
 #ifdef CONFIG_NETFILTER
@@ -686,12 +714,19 @@ int pskb_expand_head(struct sk_buff *skb
 	u8 *data;
 	int size = nhead + (skb->end - skb->head) + ntail;
 	long off;
+	int emergency = 0;
 
 	if (skb_shared(skb))
 		BUG();
 
 	size = SKB_DATA_ALIGN(size);
 
+	if (unlikely(skb->emergency) && sk_emergency_skb_get()) {
+		gfp_mask |= __GFP_EMERGENCY;
+		emergency = 1;
+	} else
+		gfp_mask |= __GFP_NOMEMALLOC;
+
 	data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
 	if (!data)
 		goto nodata;
@@ -724,6 +759,8 @@ int pskb_expand_head(struct sk_buff *skb
 	return 0;
 
 nodata:
+	if (unlikely(emergency))
+		sk_emergency_skb_put();
 	return -ENOMEM;
 }
 
Index: linux-2.6-git/net/core/sock.c
===================================================================
--- linux-2.6-git.orig/net/core/sock.c	2006-11-29 16:15:53.000000000 +0100
+++ linux-2.6-git/net/core/sock.c	2006-11-29 17:20:21.000000000 +0100
@@ -195,6 +195,120 @@ __u32 sysctl_rmem_default __read_mostly 
 /* Maximal space eaten by iovec or ancilliary data plus some space */
 int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
 
+static DEFINE_SPINLOCK(memalloc_lock);
+static int rx_net_reserve;
+
+atomic_t vmio_socks;
+atomic_t emergency_rx_skbs;
+
+static int ipfrag_threshold;
+
+#define ipfrag_mtu()	(1500) /* XXX: should be smallest mtu system wide */
+#define ipfrag_skbs()	(ipfrag_threshold / ipfrag_mtu())
+#define ipfrag_pages()	(ipfrag_threshold / (ipfrag_mtu() * (PAGE_SIZE / ipfrag_mtu())))
+
+static int iprt_pages;
+
+/*
+ * is there room for another emergency skb.
+ */
+int sk_emergency_skb_get(void)
+{
+	int nr = atomic_add_return(1, &emergency_rx_skbs);
+	int thresh = (3 * ipfrag_skbs()) / 2;
+	if (nr < thresh)
+		return 1;
+
+	atomic_dec(&emergency_rx_skbs);
+	return 0;
+}
+
+/**
+ *	sk_adjust_memalloc - adjust the global memalloc reserve for critical RX
+ *	@socks: number of new %SOCK_VMIO sockets
+ *	@tx_resserve_pages: number of pages to (un)reserve for TX
+ *
+ *	This function adjusts the memalloc reserve based on system demand.
+ *	The RX reserve is a limit, and only added once, not for each socket.
+ *
+ *	NOTE:
+ *	   @tx_reserve_pages is an upper-bound of memory used for TX hence
+ *	   we need not account the pages like we do for RX pages.
+ */
+void sk_adjust_memalloc(int socks, int tx_reserve_pages)
+{
+	unsigned long flags;
+	int reserve = tx_reserve_pages;
+	int nr_socks;
+
+	spin_lock_irqsave(&memalloc_lock, flags);
+	nr_socks = atomic_add_return(socks, &vmio_socks);
+	BUG_ON(nr_socks < 0);
+
+	if (nr_socks) {
+		int rx_pages = 2 * ipfrag_pages() + iprt_pages;
+		reserve += rx_pages - rx_net_reserve;
+		rx_net_reserve = rx_pages;
+	} else {
+		reserve -= rx_net_reserve;
+		rx_net_reserve = 0;
+	}
+
+	if (reserve)
+		adjust_memalloc_reserve(reserve);
+	spin_unlock_irqrestore(&memalloc_lock, flags);
+}
+EXPORT_SYMBOL_GPL(sk_adjust_memalloc);
+
+/*
+ * tiny helper function to track the total ipfragment memory
+ * needed because of modular ipv6
+ */
+void ipfrag_add_memory(int frags)
+{
+	ipfrag_threshold += frags;
+	sk_adjust_memalloc(0, 0);
+}
+EXPORT_SYMBOL_GPL(ipfrag_add_memory);
+
+void iprt_add_memory(int pages)
+{
+	iprt_pages += pages;
+	sk_adjust_memalloc(0, 0);
+}
+EXPORT_SYMBOL_GPL(iprt_add_memory);
+
+/**
+ *	sk_set_vmio - sets %SOCK_VMIO
+ *	@sk: socket to set it on
+ *
+ *	Set %SOCK_VMIO on a socket and increase the memalloc reserve
+ *	accordingly.
+ */
+int sk_set_vmio(struct sock *sk)
+{
+	int set = sock_flag(sk, SOCK_VMIO);
+	if (!set) {
+		sk_adjust_memalloc(1, 0);
+		sock_set_flag(sk, SOCK_VMIO);
+		sk->sk_allocation |= __GFP_EMERGENCY;
+	}
+	return !set;
+}
+EXPORT_SYMBOL_GPL(sk_set_vmio);
+
+int sk_clear_vmio(struct sock *sk)
+{
+	int set = sock_flag(sk, SOCK_VMIO);
+	if (set) {
+		sk_adjust_memalloc(-1, 0);
+		sock_reset_flag(sk, SOCK_VMIO);
+		sk->sk_allocation &= ~__GFP_EMERGENCY;
+	}
+	return set;
+}
+EXPORT_SYMBOL_GPL(sk_clear_vmio);
+
 static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
 {
 	struct timeval tv;
@@ -238,6 +352,12 @@ int sock_queue_rcv_skb(struct sock *sk, 
 	int err = 0;
 	int skb_len;
 
+	if (unlikely(skb->emergency)) {
+		if (!sk_has_vmio(sk)) {
+			err = -ENOMEM;
+			goto out;
+		}
+	} else
 	/* Cast skb->rcvbuf to unsigned... It's pointless, but reduces
 	   number of warnings when compiling with -W --ANK
 	 */
@@ -877,6 +997,7 @@ void sk_free(struct sock *sk)
 	struct sk_filter *filter;
 	struct module *owner = sk->sk_prot_creator->owner;
 
+	sk_clear_vmio(sk);
 	if (sk->sk_destruct)
 		sk->sk_destruct(sk);
 
Index: linux-2.6-git/net/ipv4/ipmr.c
===================================================================
--- linux-2.6-git.orig/net/ipv4/ipmr.c	2006-11-29 16:15:53.000000000 +0100
+++ linux-2.6-git/net/ipv4/ipmr.c	2006-11-29 17:20:21.000000000 +0100
@@ -1340,6 +1340,9 @@ int ip_mr_input(struct sk_buff *skb)
 	struct mfc_cache *cache;
 	int local = ((struct rtable*)skb->dst)->rt_flags&RTCF_LOCAL;
 
+	if (unlikely(skb->emergency))
+		goto drop;
+
 	/* Packet is looped back after forward, it should not be
 	   forwarded second time, but still can be delivered locally.
 	 */
@@ -1411,6 +1414,7 @@ int ip_mr_input(struct sk_buff *skb)
 dont_forward:
 	if (local)
 		return ip_local_deliver(skb);
+drop:
 	kfree_skb(skb);
 	return 0;
 }
Index: linux-2.6-git/net/ipv4/sysctl_net_ipv4.c
===================================================================
--- linux-2.6-git.orig/net/ipv4/sysctl_net_ipv4.c	2006-11-29 16:15:53.000000000 +0100
+++ linux-2.6-git/net/ipv4/sysctl_net_ipv4.c	2006-11-29 17:20:21.000000000 +0100
@@ -18,6 +18,7 @@
 #include <net/route.h>
 #include <net/tcp.h>
 #include <net/cipso_ipv4.h>
+#include <net/sock.h>
 
 /* From af_inet.c */
 extern int sysctl_ip_nonlocal_bind;
@@ -129,6 +130,17 @@ static int sysctl_tcp_congestion_control
 	return ret;
 }
 
+int proc_dointvec_fragment(ctl_table *table, int write, struct file *filp,
+		     void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	int ret;
+	int old_thresh = *(int *)table->data;
+	ret = proc_dointvec(table,write,filp,buffer,lenp,ppos);
+	ipfrag_add_memory(*(int *)table->data - old_thresh);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(proc_dointvec_fragment);
+
 ctl_table ipv4_table[] = {
         {
 		.ctl_name	= NET_IPV4_TCP_TIMESTAMPS,
@@ -234,7 +246,7 @@ ctl_table ipv4_table[] = {
 		.data		= &sysctl_ipfrag_high_thresh,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= &proc_dointvec
+		.proc_handler	= &proc_dointvec_fragment
 	},
 	{
 		.ctl_name	= NET_IPV4_IPFRAG_LOW_THRESH,
Index: linux-2.6-git/net/ipv4/tcp_ipv4.c
===================================================================
--- linux-2.6-git.orig/net/ipv4/tcp_ipv4.c	2006-11-29 16:15:53.000000000 +0100
+++ linux-2.6-git/net/ipv4/tcp_ipv4.c	2006-11-29 17:20:21.000000000 +0100
@@ -1096,6 +1096,15 @@ int tcp_v4_rcv(struct sk_buff *skb)
 	if (!sk)
 		goto no_tcp_socket;
 
+	if (unlikely(skb->emergency)) {
+	       	if (!sk_has_vmio(sk))
+			goto discard_and_relse;
+		/*
+		   decrease window size..
+		   tcp_enter_quickack_mode(sk);
+		*/
+	}
+
 process:
 	if (sk->sk_state == TCP_TIME_WAIT)
 		goto do_time_wait;
Index: linux-2.6-git/net/ipv6/sysctl_net_ipv6.c
===================================================================
--- linux-2.6-git.orig/net/ipv6/sysctl_net_ipv6.c	2006-11-29 16:15:53.000000000 +0100
+++ linux-2.6-git/net/ipv6/sysctl_net_ipv6.c	2006-11-29 17:20:21.000000000 +0100
@@ -15,6 +15,10 @@
 
 #ifdef CONFIG_SYSCTL
 
+extern int proc_dointvec_fragment(ctl_table *table, int write,
+	       	struct file *filp, void __user *buffer, size_t *lenp,
+	       	loff_t *ppos);
+
 static ctl_table ipv6_table[] = {
 	{
 		.ctl_name	= NET_IPV6_ROUTE,
@@ -44,7 +48,7 @@ static ctl_table ipv6_table[] = {
 		.data		= &sysctl_ip6frag_high_thresh,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= &proc_dointvec
+		.proc_handler	= &proc_dointvec_fragment
 	},
 	{
 		.ctl_name	= NET_IPV6_IP6FRAG_LOW_THRESH,
Index: linux-2.6-git/net/netfilter/core.c
===================================================================
--- linux-2.6-git.orig/net/netfilter/core.c	2006-11-29 16:15:54.000000000 +0100
+++ linux-2.6-git/net/netfilter/core.c	2006-11-29 17:20:21.000000000 +0100
@@ -181,6 +181,11 @@ next_hook:
 		kfree_skb(*pskb);
 		ret = -EPERM;
 	} else if ((verdict & NF_VERDICT_MASK)  == NF_QUEUE) {
+		if (unlikely((*pskb)->emergency)) {
+			printk(KERN_ERR "nf_hook: NF_QUEUE encountered for "
+					"emergency skb - skipping rule.\n");
+			goto next_hook;
+		}
 		NFDEBUG("nf_hook: Verdict = QUEUE.\n");
 		if (!nf_queue(*pskb, elem, pf, hook, indev, outdev, okfn,
 			      verdict >> NF_VERDICT_BITS))
Index: linux-2.6-git/security/selinux/avc.c
===================================================================
--- linux-2.6-git.orig/security/selinux/avc.c	2006-11-29 16:15:54.000000000 +0100
+++ linux-2.6-git/security/selinux/avc.c	2006-11-29 17:20:21.000000000 +0100
@@ -333,7 +333,7 @@ static struct avc_node *avc_alloc_node(v
 {
 	struct avc_node *node;
 
-	node = kmem_cache_alloc(avc_node_cachep, SLAB_ATOMIC);
+	node = kmem_cache_alloc(avc_node_cachep, SLAB_ATOMIC | __GFP_NOMEMALLOC);
 	if (!node)
 		goto out;
 
Index: linux-2.6-git/net/ipv4/ip_fragment.c
===================================================================
--- linux-2.6-git.orig/net/ipv4/ip_fragment.c	2006-11-29 16:15:53.000000000 +0100
+++ linux-2.6-git/net/ipv4/ip_fragment.c	2006-11-29 17:20:21.000000000 +0100
@@ -743,6 +743,7 @@ void ipfrag_init(void)
 	ipfrag_secret_timer.function = ipfrag_secret_rebuild;
 	ipfrag_secret_timer.expires = jiffies + sysctl_ipfrag_secret_interval;
 	add_timer(&ipfrag_secret_timer);
+	ipfrag_add_memory(sysctl_ipfrag_high_thresh);
 }
 
 EXPORT_SYMBOL(ip_defrag);
Index: linux-2.6-git/net/ipv6/reassembly.c
===================================================================
--- linux-2.6-git.orig/net/ipv6/reassembly.c	2006-11-29 16:15:54.000000000 +0100
+++ linux-2.6-git/net/ipv6/reassembly.c	2006-11-29 17:20:21.000000000 +0100
@@ -759,4 +759,5 @@ void __init ipv6_frag_init(void)
 	ip6_frag_secret_timer.function = ip6_frag_secret_rebuild;
 	ip6_frag_secret_timer.expires = jiffies + sysctl_ip6frag_secret_interval;
 	add_timer(&ip6_frag_secret_timer);
+	ipfrag_add_memory(sysctl_ip6frag_high_thresh);
 }
Index: linux-2.6-git/net/ipv4/route.c
===================================================================
--- linux-2.6-git.orig/net/ipv4/route.c	2006-11-29 16:15:53.000000000 +0100
+++ linux-2.6-git/net/ipv4/route.c	2006-11-29 17:20:21.000000000 +0100
@@ -2906,6 +2906,17 @@ static int ipv4_sysctl_rtcache_flush_str
 	return 0;
 }
 
+static int proc_dointvec_rt_size(ctl_table *table, int write, struct file *filp,
+		     void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	int ret;
+	int old = *(int *)table->data;
+	ret = proc_dointvec(table,write,filp,buffer,lenp,ppos);
+	iprt_add_memory(kmem_cache_objs_to_pages(ipv4_dst_ops.kmem_cachep,
+				*(int *)table->data - old));
+	return ret;
+}
+
 ctl_table ipv4_route_table[] = {
         {
 		.ctl_name 	= NET_IPV4_ROUTE_FLUSH,
@@ -2948,7 +2959,7 @@ ctl_table ipv4_route_table[] = {
 		.data		= &ip_rt_max_size,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= &proc_dointvec,
+		.proc_handler	= &proc_dointvec_rt_size,
 	},
 	{
 		/*  Deprecated. Use gc_min_interval_ms */
@@ -3175,6 +3186,8 @@ int __init ip_rt_init(void)
 
 	ipv4_dst_ops.gc_thresh = (rt_hash_mask + 1);
 	ip_rt_max_size = (rt_hash_mask + 1) * 16;
+	iprt_add_memory(kmem_cache_objs_to_pages(ipv4_dst_ops.kmem_cachep,
+				ip_rt_max_size));
 
 	devinet_init();
 	ip_fib_init();
Index: linux-2.6-git/net/ipv6/route.c
===================================================================
--- linux-2.6-git.orig/net/ipv6/route.c	2006-11-29 16:21:27.000000000 +0100
+++ linux-2.6-git/net/ipv6/route.c	2006-11-29 17:20:21.000000000 +0100
@@ -2365,6 +2365,17 @@ int ipv6_sysctl_rtcache_flush(ctl_table 
 		return -EINVAL;
 }
 
+static int proc_dointvec_rt_size(ctl_table *table, int write, struct file *filp,
+		     void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	int ret;
+	int old = *(int *)table->data;
+	ret = proc_dointvec(table,write,filp,buffer,lenp,ppos);
+	iprt_add_memory(kmem_cache_objs_to_pages(ip6_dst_ops.kmem_cachep,
+				*(int *)table->data - old));
+	return ret;
+}
+
 ctl_table ipv6_route_table[] = {
         {
 		.ctl_name	=	NET_IPV6_ROUTE_FLUSH, 
@@ -2388,7 +2399,7 @@ ctl_table ipv6_route_table[] = {
          	.data		=	&ip6_rt_max_size,
 		.maxlen		=	sizeof(int),
 		.mode		=	0644,
-         	.proc_handler	=	&proc_dointvec,
+         	.proc_handler	=	&proc_dointvec_rt_size,
 	},
 	{
 		.ctl_name	=	NET_IPV6_ROUTE_GC_MIN_INTERVAL,
@@ -2473,6 +2484,8 @@ void __init ip6_route_init(void)
 
 	proc_net_fops_create("rt6_stats", S_IRUGO, &rt6_stats_seq_fops);
 #endif
+	iprt_add_memory(kmem_cache_objs_to_pages(ip6_dst_ops.kmem_cachep,
+				ip6_rt_max_size));
 #ifdef CONFIG_XFRM
 	xfrm6_init();
 #endif

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 6/6] net: vm deadlock avoidance core
  2006-11-30 10:14 ` [RFC][PATCH 6/6] net: vm deadlock avoidance core Peter Zijlstra
@ 2006-11-30 12:04   ` Peter Zijlstra
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2006-11-30 12:04 UTC (permalink / raw)
  To: netdev; +Cc: linux-mm, David Miller

Oops, it seems I missed some chunks.
New patch attached.

---
Subject: net: vm deadlock avoidance core

In order to provide robust networked block devices there must be a guarantee
of progress. That is, the block device must never stall because of (physical)
OOM, because the device itself might be needed to get out of it (reclaim).

This means that the device queue must always be unplugable, this in turn means
that it must always find enough memory to build/send packets over the network
_and_ receive (level 7) ACKs for those packets.

The network stack has a huge capacity for buffering packets; waiting for 
user-space to read them. There is a practical limit imposed to avoid DoS 
scenarios. These two things make for a deadlock; what if the receive limit is
reached and all packets are buffered in non-critical sockets (those not serving
the network block device waiting for an ACK to free a page). 

Memory pressure will add to that; what if there is simply no memory left to
receive packets in.

This patch provides a service to register sockets as critical; SOCK_VMIO
is a promise the socket will never block on receive. Along with with a memory
reserve that will service a limited number of packets this can guarantee a
limited service to these critical sockets.

When we make sure that packets allocated from the reserve will only service
critical sockets we will not lose the memory and can guarantee progress.

(Note on the name SOCK_VMIO; the basic problem is a circular dependency between
the network and virtual memory subsystems which needs to be broken. This does
make VM network IO - and only VM network IO - special, it does not generalize)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/skbuff.h     |   13 +++-
 include/net/sock.h         |   36 +++++++++++++
 net/core/dev.c             |   40 +++++++++++++-
 net/core/skbuff.c          |   51 ++++++++++++++++--
 net/core/sock.c            |  121 +++++++++++++++++++++++++++++++++++++++++++++
 net/ipv4/ip_fragment.c     |    1 
 net/ipv4/ipmr.c            |    4 +
 net/ipv4/route.c           |   15 +++++
 net/ipv4/sysctl_net_ipv4.c |   14 ++++-
 net/ipv4/tcp_ipv4.c        |   27 +++++++++-
 net/ipv6/reassembly.c      |    1 
 net/ipv6/route.c           |   15 +++++
 net/ipv6/sysctl_net_ipv6.c |    6 +-
 net/ipv6/tcp_ipv6.c        |   27 +++++++++-
 net/netfilter/core.c       |    5 +
 security/selinux/avc.c     |    2 
 16 files changed, 355 insertions(+), 23 deletions(-)

Index: linux-2.6-git/include/linux/skbuff.h
===================================================================
--- linux-2.6-git.orig/include/linux/skbuff.h	2006-11-30 10:56:33.000000000 +0100
+++ linux-2.6-git/include/linux/skbuff.h	2006-11-30 11:37:51.000000000 +0100
@@ -283,7 +283,8 @@ struct sk_buff {
 				nfctinfo:3;
 	__u8			pkt_type:3,
 				fclone:2,
-				ipvs_property:1;
+				ipvs_property:1,
+				emergency:1;
 	__be16			protocol;
 
 	void			(*destructor)(struct sk_buff *skb);
@@ -328,10 +329,13 @@ struct sk_buff {
 
 #include <asm/system.h>
 
+#define SKB_ALLOC_FCLONE	0x01
+#define SKB_ALLOC_RX		0x02
+
 extern void kfree_skb(struct sk_buff *skb);
 extern void	       __kfree_skb(struct sk_buff *skb);
 extern struct sk_buff *__alloc_skb(unsigned int size,
-				   gfp_t priority, int fclone);
+				   gfp_t priority, int flags);
 static inline struct sk_buff *alloc_skb(unsigned int size,
 					gfp_t priority)
 {
@@ -341,7 +345,7 @@ static inline struct sk_buff *alloc_skb(
 static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
 					       gfp_t priority)
 {
-	return __alloc_skb(size, priority, 1);
+	return __alloc_skb(size, priority, SKB_ALLOC_FCLONE);
 }
 
 extern struct sk_buff *alloc_skb_from_cache(kmem_cache_t *cp,
@@ -1102,7 +1106,8 @@ static inline void __skb_queue_purge(str
 static inline struct sk_buff *__dev_alloc_skb(unsigned int length,
 					      gfp_t gfp_mask)
 {
-	struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
+	struct sk_buff *skb =
+		__alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX);
 	if (likely(skb))
 		skb_reserve(skb, NET_SKB_PAD);
 	return skb;
Index: linux-2.6-git/include/net/sock.h
===================================================================
--- linux-2.6-git.orig/include/net/sock.h	2006-11-30 10:56:33.000000000 +0100
+++ linux-2.6-git/include/net/sock.h	2006-11-30 11:37:51.000000000 +0100
@@ -391,6 +391,7 @@ enum sock_flags {
 	SOCK_RCVTSTAMP, /* %SO_TIMESTAMP setting */
 	SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
 	SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+	SOCK_VMIO, /* the VM depends on us - make sure we're serviced */
 };
 
 static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
@@ -413,6 +414,40 @@ static inline int sock_flag(struct sock 
 	return test_bit(flag, &sk->sk_flags);
 }
 
+static inline int sk_has_vmio(struct sock *sk)
+{
+	return sock_flag(sk, SOCK_VMIO);
+}
+
+#define MAX_PAGES_PER_SKB 3
+#define MAX_FRAGMENTS ((65536 + 1500 - 1) / 1500)
+/*
+ * Guestimate the per request queue TX upper bound.
+ */
+#define TX_RESERVE_PAGES \
+	(4 * MAX_FRAGMENTS * MAX_PAGES_PER_SKB)
+
+extern atomic_t vmio_socks;
+extern atomic_t emergency_rx_skbs;
+
+static inline int sk_vmio_socks(void)
+{
+	return atomic_read(&vmio_socks);
+}
+
+extern int sk_emergency_skb_get(void);
+
+static inline void sk_emergency_skb_put(void)
+{
+	return atomic_dec(&emergency_rx_skbs);
+}
+
+extern void sk_adjust_memalloc(int socks, int tx_reserve_pages);
+extern void ipfrag_add_memory(int ipfrag_reserve);
+extern void iprt_add_memory(int rt_reserve);
+extern int sk_set_vmio(struct sock *sk);
+extern int sk_clear_vmio(struct sock *sk);
+
 static inline void sk_acceptq_removed(struct sock *sk)
 {
 	sk->sk_ack_backlog--;
@@ -720,6 +755,7 @@ static inline void sk_stream_writequeue_
 
 static inline int sk_stream_rmem_schedule(struct sock *sk, struct sk_buff *skb)
 {
+	// XXX: skb->emergency stuff
 	return (int)skb->truesize <= sk->sk_forward_alloc ||
 		sk_stream_mem_schedule(sk, skb->truesize, 1);
 }
Index: linux-2.6-git/net/core/dev.c
===================================================================
--- linux-2.6-git.orig/net/core/dev.c	2006-11-30 10:56:33.000000000 +0100
+++ linux-2.6-git/net/core/dev.c	2006-11-30 11:37:51.000000000 +0100
@@ -1768,10 +1768,23 @@ int netif_receive_skb(struct sk_buff *sk
 	struct net_device *orig_dev;
 	int ret = NET_RX_DROP;
 	unsigned short type;
+	unsigned long pflags = current->flags;
+
+	/* Emergency skb are special, they should
+	 *  - be delivered to SOCK_VMIO sockets only
+	 *  - stay away from userspace
+	 *  - have bounded memory usage
+	 *
+	 * Use PF_MEMALLOC as a poor mans memory pool - the grouping kind.
+	 * This saves us from propagating the allocation context down to all
+	 * allocation sites.
+	 */
+	if (unlikely(skb->emergency))
+		current->flags |= PF_MEMALLOC;
 
 	/* if we've gotten here through NAPI, check netpoll */
 	if (skb->dev->poll && netpoll_rx(skb))
-		return NET_RX_DROP;
+		goto out;
 
 	if (!skb->tstamp.off_sec)
 		net_timestamp(skb);
@@ -1782,7 +1795,7 @@ int netif_receive_skb(struct sk_buff *sk
 	orig_dev = skb_bond(skb);
 
 	if (!orig_dev)
-		return NET_RX_DROP;
+		goto out;
 
 	__get_cpu_var(netdev_rx_stat).total++;
 
@@ -1799,6 +1812,8 @@ int netif_receive_skb(struct sk_buff *sk
 		goto ncls;
 	}
 #endif
+	if (unlikely(skb->emergency))
+		goto skip_taps;
 
 	list_for_each_entry_rcu(ptype, &ptype_all, list) {
 		if (!ptype->dev || ptype->dev == skb->dev) {
@@ -1808,6 +1823,7 @@ int netif_receive_skb(struct sk_buff *sk
 		}
 	}
 
+skip_taps:
 #ifdef CONFIG_NET_CLS_ACT
 	if (pt_prev) {
 		ret = deliver_skb(skb, pt_prev, orig_dev);
@@ -1820,17 +1836,28 @@ int netif_receive_skb(struct sk_buff *sk
 
 	if (ret == TC_ACT_SHOT || (ret == TC_ACT_STOLEN)) {
 		kfree_skb(skb);
-		goto out;
+		goto unlock;
 	}
 
 	skb->tc_verd = 0;
 ncls:
 #endif
 
+	if (unlikely(skb->emergency))
+		switch(skb->protocol) {
+			case __constant_htons(ETH_P_ARP):
+			case __constant_htons(ETH_P_IP):
+			case __constant_htons(ETH_P_IPV6):
+				break;
+
+			default:
+				goto drop;
+		}
+
 	handle_diverter(skb);
 
 	if (handle_bridge(&skb, &pt_prev, &ret, orig_dev))
-		goto out;
+		goto unlock;
 
 	type = skb->protocol;
 	list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type)&15], list) {
@@ -1845,6 +1872,7 @@ ncls:
 	if (pt_prev) {
 		ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
 	} else {
+drop:
 		kfree_skb(skb);
 		/* Jamal, now you will not able to escape explaining
 		 * me how you were going to use this. :-)
@@ -1852,8 +1880,10 @@ ncls:
 		ret = NET_RX_DROP;
 	}
 
-out:
+unlock:
 	rcu_read_unlock();
+out:
+	current->flags = pflags;
 	return ret;
 }
 
Index: linux-2.6-git/net/core/skbuff.c
===================================================================
--- linux-2.6-git.orig/net/core/skbuff.c	2006-11-30 10:56:33.000000000 +0100
+++ linux-2.6-git/net/core/skbuff.c	2006-11-30 11:37:51.000000000 +0100
@@ -139,29 +139,34 @@ EXPORT_SYMBOL(skb_truesize_bug);
  *	Buffers may only be allocated from interrupts using a @gfp_mask of
  *	%GFP_ATOMIC.
  */
-struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
-			    int fclone)
+struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, int flags)
 {
 	kmem_cache_t *cache;
 	struct skb_shared_info *shinfo;
 	struct sk_buff *skb;
 	u8 *data;
+	int emergency = 0;
 
-	cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
+	size = SKB_DATA_ALIGN(size);
+	cache = (flags & SKB_ALLOC_FCLONE)
+		? skbuff_fclone_cache : skbuff_head_cache;
+	if (flags & SKB_ALLOC_RX)
+		gfp_mask |= __GFP_NOMEMALLOC|__GFP_NOWARN;
 
+retry_alloc:
 	/* Get the HEAD */
 	skb = kmem_cache_alloc(cache, gfp_mask & ~__GFP_DMA);
 	if (!skb)
-		goto out;
+		goto noskb;
 
 	/* Get the DATA. Size must match skb_add_mtu(). */
-	size = SKB_DATA_ALIGN(size);
 	data = kmalloc_track_caller(size + sizeof(struct skb_shared_info),
 			gfp_mask);
 	if (!data)
 		goto nodata;
 
 	memset(skb, 0, offsetof(struct sk_buff, truesize));
+	skb->emergency = emergency;
 	skb->truesize = size + sizeof(struct sk_buff);
 	atomic_set(&skb->users, 1);
 	skb->head = data;
@@ -178,7 +183,7 @@ struct sk_buff *__alloc_skb(unsigned int
 	shinfo->ip6_frag_id = 0;
 	shinfo->frag_list = NULL;
 
-	if (fclone) {
+	if (flags & SKB_ALLOC_FCLONE) {
 		struct sk_buff *child = skb + 1;
 		atomic_t *fclone_ref = (atomic_t *) (child + 1);
 
@@ -186,12 +191,29 @@ struct sk_buff *__alloc_skb(unsigned int
 		atomic_set(fclone_ref, 1);
 
 		child->fclone = SKB_FCLONE_UNAVAILABLE;
+		child->emergency = skb->emergency;
 	}
 out:
 	return skb;
+
 nodata:
 	kmem_cache_free(cache, skb);
 	skb = NULL;
+noskb:
+	/* Attempt emergency allocation when RX skb. */
+	if (likely(!(flags & SKB_ALLOC_RX) || !sk_vmio_socks()))
+		goto out;
+
+	if (!emergency) {
+		if (sk_emergency_skb_get()) {
+			gfp_mask &= ~(__GFP_NOMEMALLOC|__GFP_NOWARN);
+			gfp_mask |= __GFP_EMERGENCY;
+			emergency = 1;
+			goto retry_alloc;
+		}
+	} else
+		sk_emergency_skb_put();
+
 	goto out;
 }
 
@@ -268,7 +290,7 @@ struct sk_buff *__netdev_alloc_skb(struc
 {
 	struct sk_buff *skb;
 
-	skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
+	skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX);
 	if (likely(skb)) {
 		skb_reserve(skb, NET_SKB_PAD);
 		skb->dev = dev;
@@ -317,6 +339,8 @@ static void skb_release_data(struct sk_b
 			skb_drop_fraglist(skb);
 
 		kfree(skb->head);
+		if (unlikely(skb->emergency))
+			sk_emergency_skb_put();
 	}
 }
 
@@ -437,6 +461,9 @@ struct sk_buff *skb_clone(struct sk_buff
 		n->fclone = SKB_FCLONE_CLONE;
 		atomic_inc(fclone_ref);
 	} else {
+		if (unlikely(skb->emergency))
+			gfp_mask |= __GFP_EMERGENCY;
+
 		n = kmem_cache_alloc(skbuff_head_cache, gfp_mask);
 		if (!n)
 			return NULL;
@@ -471,6 +498,7 @@ struct sk_buff *skb_clone(struct sk_buff
 #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
 	C(ipvs_property);
 #endif
+	C(emergency);
 	C(protocol);
 	n->destructor = NULL;
 #ifdef CONFIG_NETFILTER
@@ -686,12 +714,19 @@ int pskb_expand_head(struct sk_buff *skb
 	u8 *data;
 	int size = nhead + (skb->end - skb->head) + ntail;
 	long off;
+	int emergency = 0;
 
 	if (skb_shared(skb))
 		BUG();
 
 	size = SKB_DATA_ALIGN(size);
 
+	if (unlikely(skb->emergency) && sk_emergency_skb_get()) {
+		gfp_mask |= __GFP_EMERGENCY;
+		emergency = 1;
+	} else
+		gfp_mask |= __GFP_NOMEMALLOC;
+
 	data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
 	if (!data)
 		goto nodata;
@@ -724,6 +759,8 @@ int pskb_expand_head(struct sk_buff *skb
 	return 0;
 
 nodata:
+	if (unlikely(emergency))
+		sk_emergency_skb_put();
 	return -ENOMEM;
 }
 
Index: linux-2.6-git/net/core/sock.c
===================================================================
--- linux-2.6-git.orig/net/core/sock.c	2006-11-30 10:56:33.000000000 +0100
+++ linux-2.6-git/net/core/sock.c	2006-11-30 11:37:51.000000000 +0100
@@ -195,6 +195,120 @@ __u32 sysctl_rmem_default __read_mostly 
 /* Maximal space eaten by iovec or ancilliary data plus some space */
 int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
 
+static DEFINE_SPINLOCK(memalloc_lock);
+static int rx_net_reserve;
+
+atomic_t vmio_socks;
+atomic_t emergency_rx_skbs;
+
+static int ipfrag_threshold;
+
+#define ipfrag_mtu()	(1500) /* XXX: should be smallest mtu system wide */
+#define ipfrag_skbs()	(ipfrag_threshold / ipfrag_mtu())
+#define ipfrag_pages()	(ipfrag_threshold / (ipfrag_mtu() * (PAGE_SIZE / ipfrag_mtu())))
+
+static int iprt_pages;
+
+/*
+ * is there room for another emergency skb.
+ */
+int sk_emergency_skb_get(void)
+{
+	int nr = atomic_add_return(1, &emergency_rx_skbs);
+	int thresh = (3 * ipfrag_skbs()) / 2;
+	if (nr < thresh)
+		return 1;
+
+	atomic_dec(&emergency_rx_skbs);
+	return 0;
+}
+
+/**
+ *	sk_adjust_memalloc - adjust the global memalloc reserve for critical RX
+ *	@socks: number of new %SOCK_VMIO sockets
+ *	@tx_resserve_pages: number of pages to (un)reserve for TX
+ *
+ *	This function adjusts the memalloc reserve based on system demand.
+ *	The RX reserve is a limit, and only added once, not for each socket.
+ *
+ *	NOTE:
+ *	   @tx_reserve_pages is an upper-bound of memory used for TX hence
+ *	   we need not account the pages like we do for RX pages.
+ */
+void sk_adjust_memalloc(int socks, int tx_reserve_pages)
+{
+	unsigned long flags;
+	int reserve = tx_reserve_pages;
+	int nr_socks;
+
+	spin_lock_irqsave(&memalloc_lock, flags);
+	nr_socks = atomic_add_return(socks, &vmio_socks);
+	BUG_ON(nr_socks < 0);
+
+	if (nr_socks) {
+		int rx_pages = 2 * ipfrag_pages() + iprt_pages;
+		reserve += rx_pages - rx_net_reserve;
+		rx_net_reserve = rx_pages;
+	} else {
+		reserve -= rx_net_reserve;
+		rx_net_reserve = 0;
+	}
+
+	if (reserve)
+		adjust_memalloc_reserve(reserve);
+	spin_unlock_irqrestore(&memalloc_lock, flags);
+}
+EXPORT_SYMBOL_GPL(sk_adjust_memalloc);
+
+/*
+ * tiny helper function to track the total ipfragment memory
+ * needed because of modular ipv6
+ */
+void ipfrag_add_memory(int frags)
+{
+	ipfrag_threshold += frags;
+	sk_adjust_memalloc(0, 0);
+}
+EXPORT_SYMBOL_GPL(ipfrag_add_memory);
+
+void iprt_add_memory(int pages)
+{
+	iprt_pages += pages;
+	sk_adjust_memalloc(0, 0);
+}
+EXPORT_SYMBOL_GPL(iprt_add_memory);
+
+/**
+ *	sk_set_vmio - sets %SOCK_VMIO
+ *	@sk: socket to set it on
+ *
+ *	Set %SOCK_VMIO on a socket and increase the memalloc reserve
+ *	accordingly.
+ */
+int sk_set_vmio(struct sock *sk)
+{
+	int set = sock_flag(sk, SOCK_VMIO);
+	if (!set) {
+		sk_adjust_memalloc(1, 0);
+		sock_set_flag(sk, SOCK_VMIO);
+		sk->sk_allocation |= __GFP_EMERGENCY;
+	}
+	return !set;
+}
+EXPORT_SYMBOL_GPL(sk_set_vmio);
+
+int sk_clear_vmio(struct sock *sk)
+{
+	int set = sock_flag(sk, SOCK_VMIO);
+	if (set) {
+		sk_adjust_memalloc(-1, 0);
+		sock_reset_flag(sk, SOCK_VMIO);
+		sk->sk_allocation &= ~__GFP_EMERGENCY;
+	}
+	return set;
+}
+EXPORT_SYMBOL_GPL(sk_clear_vmio);
+
 static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
 {
 	struct timeval tv;
@@ -238,6 +352,12 @@ int sock_queue_rcv_skb(struct sock *sk, 
 	int err = 0;
 	int skb_len;
 
+	if (unlikely(skb->emergency)) {
+		if (!sk_has_vmio(sk)) {
+			err = -ENOMEM;
+			goto out;
+		}
+	} else
 	/* Cast skb->rcvbuf to unsigned... It's pointless, but reduces
 	   number of warnings when compiling with -W --ANK
 	 */
@@ -877,6 +997,7 @@ void sk_free(struct sock *sk)
 	struct sk_filter *filter;
 	struct module *owner = sk->sk_prot_creator->owner;
 
+	sk_clear_vmio(sk);
 	if (sk->sk_destruct)
 		sk->sk_destruct(sk);
 
Index: linux-2.6-git/net/ipv4/ipmr.c
===================================================================
--- linux-2.6-git.orig/net/ipv4/ipmr.c	2006-11-30 10:56:33.000000000 +0100
+++ linux-2.6-git/net/ipv4/ipmr.c	2006-11-30 11:37:51.000000000 +0100
@@ -1340,6 +1340,9 @@ int ip_mr_input(struct sk_buff *skb)
 	struct mfc_cache *cache;
 	int local = ((struct rtable*)skb->dst)->rt_flags&RTCF_LOCAL;
 
+	if (unlikely(skb->emergency))
+		goto drop;
+
 	/* Packet is looped back after forward, it should not be
 	   forwarded second time, but still can be delivered locally.
 	 */
@@ -1411,6 +1414,7 @@ int ip_mr_input(struct sk_buff *skb)
 dont_forward:
 	if (local)
 		return ip_local_deliver(skb);
+drop:
 	kfree_skb(skb);
 	return 0;
 }
Index: linux-2.6-git/net/ipv4/sysctl_net_ipv4.c
===================================================================
--- linux-2.6-git.orig/net/ipv4/sysctl_net_ipv4.c	2006-11-30 10:56:33.000000000 +0100
+++ linux-2.6-git/net/ipv4/sysctl_net_ipv4.c	2006-11-30 11:37:51.000000000 +0100
@@ -18,6 +18,7 @@
 #include <net/route.h>
 #include <net/tcp.h>
 #include <net/cipso_ipv4.h>
+#include <net/sock.h>
 
 /* From af_inet.c */
 extern int sysctl_ip_nonlocal_bind;
@@ -129,6 +130,17 @@ static int sysctl_tcp_congestion_control
 	return ret;
 }
 
+int proc_dointvec_fragment(ctl_table *table, int write, struct file *filp,
+		     void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	int ret;
+	int old_thresh = *(int *)table->data;
+	ret = proc_dointvec(table,write,filp,buffer,lenp,ppos);
+	ipfrag_add_memory(*(int *)table->data - old_thresh);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(proc_dointvec_fragment);
+
 ctl_table ipv4_table[] = {
         {
 		.ctl_name	= NET_IPV4_TCP_TIMESTAMPS,
@@ -234,7 +246,7 @@ ctl_table ipv4_table[] = {
 		.data		= &sysctl_ipfrag_high_thresh,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= &proc_dointvec
+		.proc_handler	= &proc_dointvec_fragment
 	},
 	{
 		.ctl_name	= NET_IPV4_IPFRAG_LOW_THRESH,
Index: linux-2.6-git/net/ipv4/tcp_ipv4.c
===================================================================
--- linux-2.6-git.orig/net/ipv4/tcp_ipv4.c	2006-11-30 10:56:33.000000000 +0100
+++ linux-2.6-git/net/ipv4/tcp_ipv4.c	2006-11-30 12:13:29.000000000 +0100
@@ -1046,6 +1046,22 @@ csum_err:
 	goto discard;
 }
 
+static int tcp_v4_backlog_rcv(struct sock *sk, struct sk_buff *skb)
+{
+	int ret;
+	unsigned long pflags = current->flags;
+	if (unlikely(skb->emergency)) {
+		BUG_ON(!sk_has_vmio(sk)); /* we dropped those before queueing */
+		if (!(pflags & PF_MEMALLOC))
+			current->flags |= PF_MEMALLOC;
+	}
+
+	ret = tcp_v4_do_rcv(sk, skb);
+
+	current->flags = pflags;
+	return ret;
+}
+
 /*
  *	From tcp_input.c
  */
@@ -1096,6 +1112,15 @@ int tcp_v4_rcv(struct sk_buff *skb)
 	if (!sk)
 		goto no_tcp_socket;
 
+	if (unlikely(skb->emergency)) {
+	       	if (!sk_has_vmio(sk))
+			goto discard_and_relse;
+		/*
+		   decrease window size..
+		   tcp_enter_quickack_mode(sk);
+		*/
+	}
+
 process:
 	if (sk->sk_state == TCP_TIME_WAIT)
 		goto do_time_wait;
@@ -1848,7 +1873,7 @@ struct proto tcp_prot = {
 	.getsockopt		= tcp_getsockopt,
 	.sendmsg		= tcp_sendmsg,
 	.recvmsg		= tcp_recvmsg,
-	.backlog_rcv		= tcp_v4_do_rcv,
+	.backlog_rcv		= tcp_v4_backlog_rcv,
 	.hash			= tcp_v4_hash,
 	.unhash			= tcp_unhash,
 	.get_port		= tcp_v4_get_port,
Index: linux-2.6-git/net/ipv6/sysctl_net_ipv6.c
===================================================================
--- linux-2.6-git.orig/net/ipv6/sysctl_net_ipv6.c	2006-11-30 10:56:33.000000000 +0100
+++ linux-2.6-git/net/ipv6/sysctl_net_ipv6.c	2006-11-30 11:37:51.000000000 +0100
@@ -15,6 +15,10 @@
 
 #ifdef CONFIG_SYSCTL
 
+extern int proc_dointvec_fragment(ctl_table *table, int write,
+	       	struct file *filp, void __user *buffer, size_t *lenp,
+	       	loff_t *ppos);
+
 static ctl_table ipv6_table[] = {
 	{
 		.ctl_name	= NET_IPV6_ROUTE,
@@ -44,7 +48,7 @@ static ctl_table ipv6_table[] = {
 		.data		= &sysctl_ip6frag_high_thresh,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= &proc_dointvec
+		.proc_handler	= &proc_dointvec_fragment
 	},
 	{
 		.ctl_name	= NET_IPV6_IP6FRAG_LOW_THRESH,
Index: linux-2.6-git/net/netfilter/core.c
===================================================================
--- linux-2.6-git.orig/net/netfilter/core.c	2006-11-30 10:56:33.000000000 +0100
+++ linux-2.6-git/net/netfilter/core.c	2006-11-30 11:37:51.000000000 +0100
@@ -181,6 +181,11 @@ next_hook:
 		kfree_skb(*pskb);
 		ret = -EPERM;
 	} else if ((verdict & NF_VERDICT_MASK)  == NF_QUEUE) {
+		if (unlikely((*pskb)->emergency)) {
+			printk(KERN_ERR "nf_hook: NF_QUEUE encountered for "
+					"emergency skb - skipping rule.\n");
+			goto next_hook;
+		}
 		NFDEBUG("nf_hook: Verdict = QUEUE.\n");
 		if (!nf_queue(*pskb, elem, pf, hook, indev, outdev, okfn,
 			      verdict >> NF_VERDICT_BITS))
Index: linux-2.6-git/security/selinux/avc.c
===================================================================
--- linux-2.6-git.orig/security/selinux/avc.c	2006-11-30 10:56:33.000000000 +0100
+++ linux-2.6-git/security/selinux/avc.c	2006-11-30 11:37:51.000000000 +0100
@@ -333,7 +333,7 @@ static struct avc_node *avc_alloc_node(v
 {
 	struct avc_node *node;
 
-	node = kmem_cache_alloc(avc_node_cachep, SLAB_ATOMIC);
+	node = kmem_cache_alloc(avc_node_cachep, SLAB_ATOMIC | __GFP_NOMEMALLOC);
 	if (!node)
 		goto out;
 
Index: linux-2.6-git/net/ipv4/ip_fragment.c
===================================================================
--- linux-2.6-git.orig/net/ipv4/ip_fragment.c	2006-11-30 10:56:33.000000000 +0100
+++ linux-2.6-git/net/ipv4/ip_fragment.c	2006-11-30 11:37:51.000000000 +0100
@@ -743,6 +743,7 @@ void ipfrag_init(void)
 	ipfrag_secret_timer.function = ipfrag_secret_rebuild;
 	ipfrag_secret_timer.expires = jiffies + sysctl_ipfrag_secret_interval;
 	add_timer(&ipfrag_secret_timer);
+	ipfrag_add_memory(sysctl_ipfrag_high_thresh);
 }
 
 EXPORT_SYMBOL(ip_defrag);
Index: linux-2.6-git/net/ipv6/reassembly.c
===================================================================
--- linux-2.6-git.orig/net/ipv6/reassembly.c	2006-11-30 10:56:33.000000000 +0100
+++ linux-2.6-git/net/ipv6/reassembly.c	2006-11-30 11:37:51.000000000 +0100
@@ -759,4 +759,5 @@ void __init ipv6_frag_init(void)
 	ip6_frag_secret_timer.function = ip6_frag_secret_rebuild;
 	ip6_frag_secret_timer.expires = jiffies + sysctl_ip6frag_secret_interval;
 	add_timer(&ip6_frag_secret_timer);
+	ipfrag_add_memory(sysctl_ip6frag_high_thresh);
 }
Index: linux-2.6-git/net/ipv4/route.c
===================================================================
--- linux-2.6-git.orig/net/ipv4/route.c	2006-11-30 10:56:33.000000000 +0100
+++ linux-2.6-git/net/ipv4/route.c	2006-11-30 11:37:51.000000000 +0100
@@ -2906,6 +2906,17 @@ static int ipv4_sysctl_rtcache_flush_str
 	return 0;
 }
 
+static int proc_dointvec_rt_size(ctl_table *table, int write, struct file *filp,
+		     void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	int ret;
+	int old = *(int *)table->data;
+	ret = proc_dointvec(table,write,filp,buffer,lenp,ppos);
+	iprt_add_memory(kmem_cache_objs_to_pages(ipv4_dst_ops.kmem_cachep,
+				*(int *)table->data - old));
+	return ret;
+}
+
 ctl_table ipv4_route_table[] = {
         {
 		.ctl_name 	= NET_IPV4_ROUTE_FLUSH,
@@ -2948,7 +2959,7 @@ ctl_table ipv4_route_table[] = {
 		.data		= &ip_rt_max_size,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= &proc_dointvec,
+		.proc_handler	= &proc_dointvec_rt_size,
 	},
 	{
 		/*  Deprecated. Use gc_min_interval_ms */
@@ -3175,6 +3186,8 @@ int __init ip_rt_init(void)
 
 	ipv4_dst_ops.gc_thresh = (rt_hash_mask + 1);
 	ip_rt_max_size = (rt_hash_mask + 1) * 16;
+	iprt_add_memory(kmem_cache_objs_to_pages(ipv4_dst_ops.kmem_cachep,
+				ip_rt_max_size));
 
 	devinet_init();
 	ip_fib_init();
Index: linux-2.6-git/net/ipv6/route.c
===================================================================
--- linux-2.6-git.orig/net/ipv6/route.c	2006-11-30 10:56:33.000000000 +0100
+++ linux-2.6-git/net/ipv6/route.c	2006-11-30 11:37:51.000000000 +0100
@@ -2365,6 +2365,17 @@ int ipv6_sysctl_rtcache_flush(ctl_table 
 		return -EINVAL;
 }
 
+static int proc_dointvec_rt_size(ctl_table *table, int write, struct file *filp,
+		     void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	int ret;
+	int old = *(int *)table->data;
+	ret = proc_dointvec(table,write,filp,buffer,lenp,ppos);
+	iprt_add_memory(kmem_cache_objs_to_pages(ip6_dst_ops.kmem_cachep,
+				*(int *)table->data - old));
+	return ret;
+}
+
 ctl_table ipv6_route_table[] = {
         {
 		.ctl_name	=	NET_IPV6_ROUTE_FLUSH, 
@@ -2388,7 +2399,7 @@ ctl_table ipv6_route_table[] = {
          	.data		=	&ip6_rt_max_size,
 		.maxlen		=	sizeof(int),
 		.mode		=	0644,
-         	.proc_handler	=	&proc_dointvec,
+         	.proc_handler	=	&proc_dointvec_rt_size,
 	},
 	{
 		.ctl_name	=	NET_IPV6_ROUTE_GC_MIN_INTERVAL,
@@ -2473,6 +2484,8 @@ void __init ip6_route_init(void)
 
 	proc_net_fops_create("rt6_stats", S_IRUGO, &rt6_stats_seq_fops);
 #endif
+	iprt_add_memory(kmem_cache_objs_to_pages(ip6_dst_ops.kmem_cachep,
+				ip6_rt_max_size));
 #ifdef CONFIG_XFRM
 	xfrm6_init();
 #endif
Index: linux-2.6-git/net/ipv6/tcp_ipv6.c
===================================================================
--- linux-2.6-git.orig/net/ipv6/tcp_ipv6.c	2006-11-30 11:58:57.000000000 +0100
+++ linux-2.6-git/net/ipv6/tcp_ipv6.c	2006-11-30 12:13:29.000000000 +0100
@@ -1180,6 +1180,22 @@ ipv6_pktoptions:
 	return 0;
 }
 
+static int tcp_v6_backlog_rcv(struct sock *sk, struct sk_buff *skb)
+{
+	int ret;
+	unsigned long pflags = current->flags;
+	if (unlikely(skb->emergency)) {
+		BUG_ON(!sk_has_vmio(sk)); /* we dropped those before queueing */
+		if (!(pflags & PF_MEMALLOC))
+			current->flags |= PF_MEMALLOC;
+	}
+
+	ret = tcp_v6_do_rcv(sk, skb);
+
+	current->flags = pflags;
+	return ret;
+}
+
 static int tcp_v6_rcv(struct sk_buff **pskb)
 {
 	struct sk_buff *skb = *pskb;
@@ -1225,6 +1241,15 @@ static int tcp_v6_rcv(struct sk_buff **p
 	if (!sk)
 		goto no_tcp_socket;
 
+	if (unlikely(skb->emergency)) {
+	       	if (!sk_has_vmio(sk))
+			goto discard_and_relse;
+		/*
+		   decrease window size..
+		   tcp_enter_quickack_mode(sk);
+		*/
+	}
+
 process:
 	if (sk->sk_state == TCP_TIME_WAIT)
 		goto do_time_wait;
@@ -1602,7 +1627,7 @@ struct proto tcpv6_prot = {
 	.getsockopt		= tcp_getsockopt,
 	.sendmsg		= tcp_sendmsg,
 	.recvmsg		= tcp_recvmsg,
-	.backlog_rcv		= tcp_v6_do_rcv,
+	.backlog_rcv		= tcp_v6_backlog_rcv,
 	.hash			= tcp_v6_hash,
 	.unhash			= tcp_unhash,
 	.get_port		= tcp_v6_get_port,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/6] mm: slab allocation fairness
  2006-11-30 10:14 ` [RFC][PATCH 1/6] mm: slab allocation fairness Peter Zijlstra
@ 2006-11-30 18:52   ` Christoph Lameter
  2006-11-30 18:55     ` Peter Zijlstra
  2006-11-30 19:02     ` Peter Zijlstra
  0 siblings, 2 replies; 24+ messages in thread
From: Christoph Lameter @ 2006-11-30 18:52 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: netdev, linux-mm, David Miller

On Thu, 30 Nov 2006, Peter Zijlstra wrote:

> The slab has some unfairness wrt gfp flags; when the slab is grown the gfp 
> flags are used to allocate more memory, however when there is slab space 
> available, gfp flags are ignored. Thus it is possible for less critical 
> slab allocations to succeed and gobble up precious memory.

The gfpflags are ignored if there are

1) objects in the per cpu, shared or alien caches

2) objects are in partial or free slabs in the per node queues.

> This patch avoids this by keeping track of the allocation hardness when 
> growing. This is then compared to the current slab alloc's gfp flags.

The approach is to force the allocation of additional slab to increase the 
number of free slabs? The next free will drop the number of free slabs 
back again to the allowed amount.

I would think that one would need a rank with each cached object and 
free slab in order to do this the right way.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/6] mm: slab allocation fairness
  2006-11-30 18:52   ` Christoph Lameter
@ 2006-11-30 18:55     ` Peter Zijlstra
  2006-11-30 19:33       ` Christoph Lameter
  2006-11-30 19:02     ` Peter Zijlstra
  1 sibling, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2006-11-30 18:55 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: netdev, linux-mm, David Miller

On Thu, 2006-11-30 at 10:52 -0800, Christoph Lameter wrote:
> On Thu, 30 Nov 2006, Peter Zijlstra wrote:
> 
> > The slab has some unfairness wrt gfp flags; when the slab is grown the gfp 
> > flags are used to allocate more memory, however when there is slab space 
> > available, gfp flags are ignored. Thus it is possible for less critical 
> > slab allocations to succeed and gobble up precious memory.
> 
> The gfpflags are ignored if there are
> 
> 1) objects in the per cpu, shared or alien caches
> 
> 2) objects are in partial or free slabs in the per node queues.

Yeah, basically as long as free objects can be found. No matter how
'hard' is was to obtain these objects.

> > This patch avoids this by keeping track of the allocation hardness when 
> > growing. This is then compared to the current slab alloc's gfp flags.
> 
> The approach is to force the allocation of additional slab to increase the 
> number of free slabs? The next free will drop the number of free slabs 
> back again to the allowed amount.

No, the forced allocation is to test the allocation hardness at that
point in time. I could not think of another way to test that than to
actually to an allocation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 5/6] slab: kmem_cache_objs_to_pages()
  2006-11-30 18:55   ` Christoph Lameter
@ 2006-11-30 18:55     ` Peter Zijlstra
  2006-11-30 19:06       ` Christoph Lameter
  2006-12-01 12:14     ` Peter Zijlstra
  1 sibling, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2006-11-30 18:55 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: netdev, linux-mm, David Miller

On Thu, 2006-11-30 at 10:55 -0800, Christoph Lameter wrote:
> On Thu, 30 Nov 2006, Peter Zijlstra wrote:
> 
> > +unsigned int kmem_cache_objs_to_pages(struct kmem_cache *cachep, int nr)
> > +{
> > +	return ((nr + cachep->num - 1) / cachep->num) << cachep->gfporder;
> 
> cachep->num refers to the number of objects in a slab of gfporder.

Ah, my bad, thanks!

> thus
> 
> return (nr + cachep->num - 1) / cachep->num;
> 
> But then this is very optimistic estimate that assumes a single node and 
> no free objects in between.

Right, perhaps my bad in wording the intent; the needed information is
how many more pages would I need to grow the slab with in order to store
so many new object.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 5/6] slab: kmem_cache_objs_to_pages()
  2006-11-30 10:14 ` [RFC][PATCH 5/6] slab: kmem_cache_objs_to_pages() Peter Zijlstra
@ 2006-11-30 18:55   ` Christoph Lameter
  2006-11-30 18:55     ` Peter Zijlstra
  2006-12-01 12:14     ` Peter Zijlstra
  0 siblings, 2 replies; 24+ messages in thread
From: Christoph Lameter @ 2006-11-30 18:55 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: netdev, linux-mm, David Miller

On Thu, 30 Nov 2006, Peter Zijlstra wrote:

> +unsigned int kmem_cache_objs_to_pages(struct kmem_cache *cachep, int nr)
> +{
> +	return ((nr + cachep->num - 1) / cachep->num) << cachep->gfporder;

cachep->num refers to the number of objects in a slab of gfporder.

thus

return (nr + cachep->num - 1) / cachep->num;

But then this is very optimistic estimate that assumes a single node and 
no free objects in between.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/6] mm: slab allocation fairness
  2006-11-30 18:52   ` Christoph Lameter
  2006-11-30 18:55     ` Peter Zijlstra
@ 2006-11-30 19:02     ` Peter Zijlstra
  2006-11-30 19:37       ` Christoph Lameter
  1 sibling, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2006-11-30 19:02 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: netdev, linux-mm, David Miller

On Thu, 2006-11-30 at 10:52 -0800, Christoph Lameter wrote:

> I would think that one would need a rank with each cached object and 
> free slab in order to do this the right way.

Allocation hardness is a temporal attribute, ie. it changes over time.
Hence I do it per slab.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 5/6] slab: kmem_cache_objs_to_pages()
  2006-11-30 19:06       ` Christoph Lameter
@ 2006-11-30 19:03         ` Peter Zijlstra
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2006-11-30 19:03 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: netdev, linux-mm, David Miller

On Thu, 2006-11-30 at 11:06 -0800, Christoph Lameter wrote:
> On Thu, 30 Nov 2006, Peter Zijlstra wrote:
> 
> > Right, perhaps my bad in wording the intent; the needed information is
> > how many more pages would I need to grow the slab with in order to store
> > so many new object.
> 
> Would you not have to take objects currently available in 
> caches into account? If you are short on memory then a flushing of all the 
> caches may give you the memory you need (especially on a system with a 
> large number of processors).

Sure, but this gives a safe upper bound.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 5/6] slab: kmem_cache_objs_to_pages()
  2006-11-30 18:55     ` Peter Zijlstra
@ 2006-11-30 19:06       ` Christoph Lameter
  2006-11-30 19:03         ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Lameter @ 2006-11-30 19:06 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: netdev, linux-mm, David Miller

On Thu, 30 Nov 2006, Peter Zijlstra wrote:

> Right, perhaps my bad in wording the intent; the needed information is
> how many more pages would I need to grow the slab with in order to store
> so many new object.

Would you not have to take objects currently available in 
caches into account? If you are short on memory then a flushing of all the 
caches may give you the memory you need (especially on a system with a 
large number of processors).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/6] mm: slab allocation fairness
  2006-11-30 18:55     ` Peter Zijlstra
@ 2006-11-30 19:33       ` Christoph Lameter
  2006-11-30 19:33         ` Peter Zijlstra
  2006-12-01 11:28         ` Peter Zijlstra
  0 siblings, 2 replies; 24+ messages in thread
From: Christoph Lameter @ 2006-11-30 19:33 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: netdev, linux-mm, David Miller

On Thu, 30 Nov 2006, Peter Zijlstra wrote:

> No, the forced allocation is to test the allocation hardness at that
> point in time. I could not think of another way to test that than to
> actually to an allocation.

Typically we do this by checking the number of free pages in a zone 
compared to the high low limits. See mmzone.h.
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/6] mm: slab allocation fairness
  2006-11-30 19:33       ` Christoph Lameter
@ 2006-11-30 19:33         ` Peter Zijlstra
  2006-12-01 11:28         ` Peter Zijlstra
  1 sibling, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2006-11-30 19:33 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: netdev, linux-mm, David Miller

On Thu, 2006-11-30 at 11:33 -0800, Christoph Lameter wrote:
> On Thu, 30 Nov 2006, Peter Zijlstra wrote:
> 
> > No, the forced allocation is to test the allocation hardness at that
> > point in time. I could not think of another way to test that than to
> > actually to an allocation.
> 
> Typically we do this by checking the number of free pages in a zone 
> compared to the high low limits. See mmzone.h.

True, I did think about that and started out that way but saw myself
duplicating a lot of the page allocation code. I'll give it another
try... see if I can factor out the common parts without too much
duplication.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/6] mm: slab allocation fairness
  2006-11-30 19:02     ` Peter Zijlstra
@ 2006-11-30 19:37       ` Christoph Lameter
  2006-11-30 19:40         ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Lameter @ 2006-11-30 19:37 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: netdev, linux-mm, David Miller

On Thu, 30 Nov 2006, Peter Zijlstra wrote:

> On Thu, 2006-11-30 at 10:52 -0800, Christoph Lameter wrote:
> 
> > I would think that one would need a rank with each cached object and 
> > free slab in order to do this the right way.
> 
> Allocation hardness is a temporal attribute, ie. it changes over time.
> Hence I do it per slab.

cached objects are also temporal and change over time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/6] mm: slab allocation fairness
  2006-11-30 19:37       ` Christoph Lameter
@ 2006-11-30 19:40         ` Peter Zijlstra
  2006-11-30 20:11           ` Christoph Lameter
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2006-11-30 19:40 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: netdev, linux-mm, David Miller

On Thu, 2006-11-30 at 11:37 -0800, Christoph Lameter wrote:
> On Thu, 30 Nov 2006, Peter Zijlstra wrote:
> 
> > On Thu, 2006-11-30 at 10:52 -0800, Christoph Lameter wrote:
> > 
> > > I would think that one would need a rank with each cached object and 
> > > free slab in order to do this the right way.
> > 
> > Allocation hardness is a temporal attribute, ie. it changes over time.
> > Hence I do it per slab.
> 
> cached objects are also temporal and change over time.

Sure, but there is nothing wrong with using a slab page with a lower
allocation rank when there is memory aplenty. 

I'm just not seeing how keeping all individual page ranks would make
this better.

The only thing that matters is the actual free pages limit, not that of
a few allocation ago. The stored rank is a safe shortcut for it allows
harder allocation to use easily obtainable free space not the other way
around.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/6] mm: slab allocation fairness
  2006-11-30 19:40         ` Peter Zijlstra
@ 2006-11-30 20:11           ` Christoph Lameter
  2006-11-30 20:15             ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Lameter @ 2006-11-30 20:11 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: netdev, linux-mm, David Miller

On Thu, 30 Nov 2006, Peter Zijlstra wrote:

> Sure, but there is nothing wrong with using a slab page with a lower
> allocation rank when there is memory aplenty. 

What does "a slab page with a lower allocation rank" mean? Slab pages have 
no allocation ranks that I am aware of.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/6] mm: slab allocation fairness
  2006-11-30 20:11           ` Christoph Lameter
@ 2006-11-30 20:15             ` Peter Zijlstra
  2006-11-30 20:29               ` Christoph Lameter
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2006-11-30 20:15 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: netdev, linux-mm, David Miller

On Thu, 2006-11-30 at 12:11 -0800, Christoph Lameter wrote:
> On Thu, 30 Nov 2006, Peter Zijlstra wrote:
> 
> > Sure, but there is nothing wrong with using a slab page with a lower
> > allocation rank when there is memory aplenty. 
> 
> What does "a slab page with a lower allocation rank" mean? Slab pages have 
> no allocation ranks that I am aware of.

I just added allocation rank and didn't you suggest tracking it for all
slab pages instead of per slab?

The rank is an expression of how hard it was to get that page, with 0
being the hardest allocation (ALLOC_NO_WATERMARK) and 16 the easiest
(ALLOC_WMARK_HIGH).

I store the rank of the last allocated page and retest the rank when a
gfp flag indicates a higher rank, that is when the current slab
allocation would have failed to grow the slab under the conditions of
the previous allocation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/6] mm: slab allocation fairness
  2006-11-30 20:15             ` Peter Zijlstra
@ 2006-11-30 20:29               ` Christoph Lameter
  0 siblings, 0 replies; 24+ messages in thread
From: Christoph Lameter @ 2006-11-30 20:29 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: netdev, linux-mm, David Miller

On Thu, 30 Nov 2006, Peter Zijlstra wrote:

> > > Sure, but there is nothing wrong with using a slab page with a lower
> > > allocation rank when there is memory aplenty. 
> > What does "a slab page with a lower allocation rank" mean? Slab pages have 
> > no allocation ranks that I am aware of.
> I just added allocation rank and didn't you suggest tracking it for all
> slab pages instead of per slab?

Yes but that is not in place so I was wondering what you were talking 
about. It would help to have some longer text describing what you intend 
to do and how rank would work throughout the VM.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/6] mm: slab allocation fairness
  2006-11-30 19:33       ` Christoph Lameter
  2006-11-30 19:33         ` Peter Zijlstra
@ 2006-12-01 11:28         ` Peter Zijlstra
  1 sibling, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2006-12-01 11:28 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: netdev, linux-mm, David Miller

On Thu, 2006-11-30 at 11:33 -0800, Christoph Lameter wrote:
> On Thu, 30 Nov 2006, Peter Zijlstra wrote:
> 
> > No, the forced allocation is to test the allocation hardness at that
> > point in time. I could not think of another way to test that than to
> > actually to an allocation.
> 
> Typically we do this by checking the number of free pages in a zone 
> compared to the high low limits. See mmzone.h.

This doesn't work under high load because of direct reclaim. And if I go
run direct reclaim to test if I can raise the free pages level to an
acceptable level for the given gfp flags, I might as well do the whole
allocation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 5/6] slab: kmem_cache_objs_to_pages()
  2006-11-30 18:55   ` Christoph Lameter
  2006-11-30 18:55     ` Peter Zijlstra
@ 2006-12-01 12:14     ` Peter Zijlstra
  1 sibling, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2006-12-01 12:14 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: netdev, linux-mm, David Miller

On Thu, 2006-11-30 at 10:55 -0800, Christoph Lameter wrote:
> On Thu, 30 Nov 2006, Peter Zijlstra wrote:
> 
> > +unsigned int kmem_cache_objs_to_pages(struct kmem_cache *cachep, int nr)
> > +{
> > +	return ((nr + cachep->num - 1) / cachep->num) << cachep->gfporder;
> 
> cachep->num refers to the number of objects in a slab of gfporder.
> 
> thus
> 
> return (nr + cachep->num - 1) / cachep->num;

No, that would give the number of slabs needed, I want pages.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2006-12-01 12:14 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-11-30 10:14 [RFC][PATCH 0/6] VM deadlock avoidance -v9 Peter Zijlstra
2006-11-30 10:14 ` [RFC][PATCH 1/6] mm: slab allocation fairness Peter Zijlstra
2006-11-30 18:52   ` Christoph Lameter
2006-11-30 18:55     ` Peter Zijlstra
2006-11-30 19:33       ` Christoph Lameter
2006-11-30 19:33         ` Peter Zijlstra
2006-12-01 11:28         ` Peter Zijlstra
2006-11-30 19:02     ` Peter Zijlstra
2006-11-30 19:37       ` Christoph Lameter
2006-11-30 19:40         ` Peter Zijlstra
2006-11-30 20:11           ` Christoph Lameter
2006-11-30 20:15             ` Peter Zijlstra
2006-11-30 20:29               ` Christoph Lameter
2006-11-30 10:14 ` [RFC][PATCH 2/6] mm: allow PF_MEMALLOC from softirq context Peter Zijlstra
2006-11-30 10:14 ` [RFC][PATCH 3/6] mm: serialize access to min_free_kbytes Peter Zijlstra
2006-11-30 10:14 ` [RFC][PATCH 4/6] mm: emergency pool and __GFP_EMERGENCY Peter Zijlstra
2006-11-30 10:14 ` [RFC][PATCH 5/6] slab: kmem_cache_objs_to_pages() Peter Zijlstra
2006-11-30 18:55   ` Christoph Lameter
2006-11-30 18:55     ` Peter Zijlstra
2006-11-30 19:06       ` Christoph Lameter
2006-11-30 19:03         ` Peter Zijlstra
2006-12-01 12:14     ` Peter Zijlstra
2006-11-30 10:14 ` [RFC][PATCH 6/6] net: vm deadlock avoidance core Peter Zijlstra
2006-11-30 12:04   ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox