[RFC][PATCH 0/4] VM deadlock prevention -v4

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH 0/4] VM deadlock prevention -v4
@ 2006-08-12 14:14 Peter Zijlstra
  2006-08-12 14:14 ` [RFC][PATCH 1/4] pfn_to_kaddr() for UML Peter Zijlstra
                   ` (4 more replies)
  0 siblings, 5 replies; 30+ messages in thread
From: Peter Zijlstra @ 2006-08-12 14:14 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Indan Zupancic, Peter Zijlstra, Evgeniy Polyakov,
	Daniel Phillips, Rik van Riel, David Miller

Hi,

here the latest effort, it includes a whole new trivial allocator with a
horrid name and an almost full rewrite of the deadlock prevention core.
This version does not do anything per device and hence does not depend 
on the new netdev_alloc_skb() API.

The reason to add a second allocator to the receive side is twofold:
1) it allows easy detection of the memory pressure / OOM situation;
2) it allows the receive path to be unbounded and go at full speed when
   resources permit.

The choice of using the global memalloc reserve as a mempool makes that
the new allocator has to release pages as soon as possible; if we were
to hoard pages in the allocator the memalloc reserve would not get 
replenished readily.

Peter

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC][PATCH 1/4] pfn_to_kaddr() for UML
  2006-08-12 14:14 [RFC][PATCH 0/4] VM deadlock prevention -v4 Peter Zijlstra
@ 2006-08-12 14:14 ` Peter Zijlstra
  2006-08-12 14:14 ` [RFC][PATCH 2/4] SROG allocator Peter Zijlstra
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2006-08-12 14:14 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Indan Zupancic, Peter Zijlstra, Evgeniy Polyakov,
	Daniel Phillips, Rik van Riel, David Miller

Update UML with a proper 'pfn_to_kaddr()' definition, the SROG allocator
uses it.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Daniel Phillips <phillips@google.com>

---
 include/asm-um/page.h |    2 ++
 1 file changed, 2 insertions(+)

Index: linux-2.6/include/asm-um/page.h
===================================================================
--- linux-2.6.orig/include/asm-um/page.h
+++ linux-2.6/include/asm-um/page.h
@@ -111,6 +111,8 @@ extern unsigned long uml_physmem;
 #define pfn_valid(pfn) ((pfn) < max_mapnr)
 #define virt_addr_valid(v) pfn_valid(phys_to_pfn(__pa(v)))
 
+#define pfn_to_kaddr(pfn)	__va((pfn) << PAGE_SHIFT)
+
 extern struct page *arch_validate(struct page *page, gfp_t mask, int order);
 #define HAVE_ARCH_VALIDATE
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC][PATCH 2/4] SROG allocator
  2006-08-12 14:14 [RFC][PATCH 0/4] VM deadlock prevention -v4 Peter Zijlstra
  2006-08-12 14:14 ` [RFC][PATCH 1/4] pfn_to_kaddr() for UML Peter Zijlstra
@ 2006-08-12 14:14 ` Peter Zijlstra
  2006-08-12 14:14 ` [RFC][PATCH 3/4] deadlock prevention core Peter Zijlstra
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2006-08-12 14:14 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Indan Zupancic, Peter Zijlstra, Evgeniy Polyakov,
	Daniel Phillips, Rik van Riel, David Miller

A simple memory allocator with a horrid name.
(if someone knows the proper name of this thing, please speak up.
It's too trivial to not be described somewhere)

Its use is for cases where you have multiple objects of various sizes
that have similar livetimes and should release pages as soon as possible.

In a few words, it allocates pages and packs the objects in them, the
pages are kept on a list and the free space blocks are kept in address
order. On free insertion sort is used and front and back merges are tried.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

---
 include/linux/srog.h |   14 ++
 mm/Makefile          |    3 
 mm/srog.c            |  290 +++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 306 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/srog.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/srog.h	2006-08-12 11:49:10.000000000 +0200
@@ -0,0 +1,14 @@
+#ifndef _LINUX_SROG_H_
+#define _LINUX_SROG_H_
+
+#ifdef __KERNEL__
+
+#include <linux/types.h>
+
+extern void *srog_alloc(void *srogp, unsigned long size, gfp_t gfp_mask);
+extern void srog_free(void *srogp, void *obj);
+extern void srog_link(void *srogp1, void *srogp2);
+
+#endif /* __KERNEL__ */
+
+#endif /* _LINUX_SSROG_H_ */
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile	2006-08-12 11:49:05.000000000 +0200
+++ linux-2.6/mm/Makefile	2006-08-12 11:49:10.000000000 +0200
@@ -10,7 +10,8 @@ mmu-$(CONFIG_MMU)	:= fremap.o highmem.o 
 obj-y			:= bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
 			   page_alloc.o page-writeback.o pdflush.o \
 			   readahead.o swap.o truncate.o vmscan.o \
-			   prio_tree.o util.o mmzone.o vmstat.o $(mmu-y)
+			   prio_tree.o util.o mmzone.o vmstat.o srog.o \
+			   $(mmu-y)
 
 obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o thrash.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
Index: linux-2.6/mm/srog.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/mm/srog.c	2006-08-12 11:54:27.000000000 +0200
@@ -0,0 +1,289 @@
+/*
+ * mm/srog.c
+ *
+ * Written by Peter Zijlstra <a.p.zijlstra@chello.nl>
+ * Released under the GPLv2, see the file COPYING for details.
+ *
+ * Small Related Object Group Allocator
+ *
+ * Trivial allocator that has to release pages ASAP and pack objects
+ * of various sizes. It need not be fast, just reliable and relative
+ * small.
+ *
+ * The approach is a list of pages (need not be 0-order), each of
+ * these pages are subdivided on demand. The free blocks in each
+ * of these pages are kept in address order.
+ *
+ * On free, the block tries to merge with the previous and next block.
+ *
+ * When a page is found to be empty, its taken off the list and returned
+ * to the system.
+ */
+
+#include <linux/config.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/mm.h>
+#include <linux/bitops.h>
+#include <linux/module.h>
+
+/**
+ *	srog_free_area - structure describing a free block of memory
+ *	@size: size of the block
+ *	@next: pointer to the next free block
+ */
+struct srog_free_area {
+	unsigned int size;
+	union {
+		struct srog_free_area *next;
+		char obj[0];
+	};
+};
+
+#define SROG_MAGIC	0xf00fde3d
+
+/**
+ *	srog_head - structure managing a page
+ *	@list: list of all related pages
+ *	@free: first free area descriptor
+ *	@order: allocation order of this page
+ *	@gfp_mask: allocation flags
+ *	@magic: simple sanity check
+ */
+struct srog_head {
+	struct list_head list;
+	struct srog_free_area free;
+	unsigned int order;
+	gfp_t gfp_mask;
+	unsigned int magic;
+};
+
+#define ceiling_log2(x)	fls_long((x) - 1)
+
+#define SROG_SIZE(srog) \
+	((PAGE_SIZE << (srog)->order) - sizeof(struct srog_head))
+
+/**
+ *	__srog_alloc - allocate a new page and initialize the head
+ *	@size: minimum size that must fit
+ *	@gfp_mask: allocation flags
+ */
+static struct srog_head *__srog_alloc(unsigned long size, gfp_t gfp_mask)
+{
+	unsigned long pages;
+	unsigned int order;
+	struct page *page;
+	struct srog_head *srog = NULL;
+
+	size += sizeof(struct srog_head);
+
+	pages = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	order = ceiling_log2(pages);
+
+	page = alloc_pages(gfp_mask & ~__GFP_HIGHMEM, order);
+	if (!page)
+		goto out;
+
+	srog = (struct srog_head *)pfn_to_kaddr(page_to_pfn(page));
+
+	INIT_LIST_HEAD(&srog->list);
+	srog->order = order;
+	srog->free.next = (struct srog_free_area *)(srog + 1);
+	srog->free.size = 0; /* avoid back merge */
+	srog->free.next->next = NULL;
+	srog->free.next->size = SROG_SIZE(srog);
+	srog->gfp_mask = gfp_mask;
+	srog->magic = SROG_MAGIC;
+
+out:
+	return srog;
+}
+
+#define next_srog(s) list_entry((s)->list.next, typeof(*(s)), list)
+
+#define INC_PTR(p, a) ((typeof(p))(((void *)p) + a))
+
+/*
+ * Tricky macro; will go funny on higher order allocations.
+ */
+#define PTR_TO_SROG(p) ((struct srog_head *)((unsigned long)(p) & PAGE_MASK))
+
+/**
+ *	srog_alloc - allocate an object from a SROG; possibly creating one
+ *	@srogp: pointer to a SROG to use; NULL means create new one
+ *	@gfp_mask: allocation flags to use on create
+ *
+ *	Create a SROG when needed.
+ *
+ *	Allocate an object of requested size from the free space, if
+ *	not enough free space is found, add another page to this SROG.
+ */
+void *srog_alloc(void *srogp, unsigned long size, gfp_t gfp_mask)
+{
+	struct srog_head *srog = PTR_TO_SROG(srogp);
+	struct srog_head *srog_iter;
+	void *obj = NULL;
+
+	/*
+	 * Minimum size requirement, we need to store a free area
+	 * descriptor in there. Increment size with one int to store
+	 * the object size in and make sure we stay aligned.
+	 */
+	size = max(size, (unsigned long)sizeof(struct srog_free_area));
+	size += sizeof(unsigned int);
+	size = ALIGN(size, sizeof(void *));
+
+	if (!srog)
+		srog = __srog_alloc(size, gfp_mask);
+	if (!srog)
+		goto out;
+
+	BUG_ON(srog->magic != SROG_MAGIC);
+
+	srog_iter = srog;
+do_alloc:
+	do {
+		/*
+		 * Walk the free block list, take the first block that
+		 * is large enough to accommodate the requested size.
+		 */
+		struct srog_free_area *sfa, *prev = &srog_iter->free;
+		for (sfa = prev->next; sfa; prev = sfa, sfa = sfa->next) {
+			if (sfa->size < size)
+				continue;
+
+			obj = (void *)sfa->obj;
+			sfa->size -= size;
+
+			/*
+			 * Any remaining space is split into a new free block.
+			 */
+			if (sfa->size) {
+				struct srog_free_area *new_sfa =
+					INC_PTR(sfa, size);
+				new_sfa->next = sfa->next;
+				new_sfa->size = sfa->size;
+				prev->next = new_sfa;
+			} else {
+				prev->next = sfa->next;
+			}
+
+			sfa->size = size;
+			goto out;
+		}
+
+		srog_iter = next_srog(srog_iter);
+	} while (srog_iter != srog);
+
+	if (!obj) {
+		struct srog_head *new_srog;
+		/*
+		 * We cannot fail allocation when we just created a new SROG.
+		 */
+		BUG_ON(!srogp);
+
+		new_srog = __srog_alloc(size, srog->gfp_mask);
+		list_add(&new_srog->list, &srog->list);
+		srog_iter = new_srog;
+		goto do_alloc;
+	}
+
+out:
+	return obj;
+}
+EXPORT_SYMBOL_GPL(srog_alloc);
+
+#define OBJ_SIZE_REF(o) (((unsigned int *)(o)) - 1)
+#define OBJ_SFA(o) ((struct srog_free_area *)OBJ_SIZE_REF(o))
+
+/**
+ *	srog_free - free an object
+ *	@srogp: pointer to an early SROG object; may be NULL.
+ *	@obj: object to free.
+ */
+void srog_free(void *srogp, void *obj)
+{
+	struct srog_head *srog = PTR_TO_SROG(srogp);
+	struct srog_free_area *sfa, *prev, *new_sfa;
+
+	if (!srog)
+		srog = PTR_TO_SROG(obj);
+
+	BUG_ON(srog->magic != SROG_MAGIC);
+
+	/*
+	 * Walk the page list when the object does not belong to this page.
+	 */
+	if (((unsigned long)(obj - (void *)srog)) >
+			(PAGE_SIZE << srog->order)) {
+		struct srog_head *srog_iter;
+		list_for_each_entry(srog_iter, &srog->list, list) {
+			if (((unsigned long)(obj - (void *)srog_iter)) <=
+					(PAGE_SIZE << srog_iter->order)) {
+				srog = srog_iter;
+				break;
+			}
+		}
+	}
+
+	/*
+	 * It doesn't seem to belong here at all!?
+	 */
+	BUG_ON(((unsigned long)(obj - (void *)srog)) >
+			(PAGE_SIZE << srog->order));
+
+	/*
+	 * Insertion sort.
+	 */
+	for (prev = &srog->free, sfa = prev->next;
+			sfa; prev = sfa, sfa = sfa->next) {
+		if ((void*)prev < obj)
+			break;
+	}
+
+	new_sfa = OBJ_SFA(obj);
+	new_sfa->next = sfa;
+	prev->next = new_sfa;
+
+	/*
+	 * Merge forward.
+	 */
+	if (INC_PTR(new_sfa, new_sfa->size) == sfa) {
+		new_sfa->size += sfa->size;
+		new_sfa->next = sfa->next;
+	}
+
+	/*
+	 * Merge backward.
+	 */
+	if (INC_PTR(prev, prev->size) == new_sfa) {
+		prev->size += new_sfa->size;
+		prev->next = new_sfa->next;
+	}
+
+	/*
+	 * If the page just became unused, free it.
+	 */
+	if (srog->free.next->size == SROG_SIZE(srog)) {
+		list_del(&srog->list);
+		free_pages((unsigned long)srog, srog->order);
+	}
+}
+EXPORT_SYMBOL_GPL(srog_free);
+
+/**
+ *	srog_link - link two seperatly allocated SROGs into one
+ *	@srogp1: pointer into one SROG
+ *	@srogp2: pointer into another SROG
+ */
+void srog_link(void *srogp1, void *srogp2)
+{
+	struct srog_head *srog1 = PTR_TO_SROG(srogp1);
+	struct srog_head *srog2 = PTR_TO_SROG(srogp2);
+
+	BUG_ON(srog1->magic != SROG_MAGIC);
+	BUG_ON(srog2->magic != SROG_MAGIC);
+
+	__list_splice(&srog2->list, &srog1->list);
+}
+EXPORT_SYMBOL_GPL(srog_link);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC][PATCH 3/4] deadlock prevention core
  2006-08-12 14:14 [RFC][PATCH 0/4] VM deadlock prevention -v4 Peter Zijlstra
  2006-08-12 14:14 ` [RFC][PATCH 1/4] pfn_to_kaddr() for UML Peter Zijlstra
  2006-08-12 14:14 ` [RFC][PATCH 2/4] SROG allocator Peter Zijlstra
@ 2006-08-12 14:14 ` Peter Zijlstra
  2006-08-12 14:41   ` Jeff Garzik
  2006-08-12 17:31   ` [RFC][PATCH 3/4] deadlock prevention core Indan Zupancic
  2006-08-12 14:14 ` [RFC][PATCH 4/4] deadlock prevention for NBD Peter Zijlstra
  2006-08-12 16:51 ` [RFC][PATCH 0/4] VM deadlock prevention -v4 Indan Zupancic
  4 siblings, 2 replies; 30+ messages in thread
From: Peter Zijlstra @ 2006-08-12 14:14 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Indan Zupancic, Peter Zijlstra, Evgeniy Polyakov,
	Daniel Phillips, Rik van Riel, David Miller

The core of the VM deadlock avoidance framework.

>From the 'user' side of things it provides a function to mark a 'struct sock'
as SOCK_MEMALLOC, meaning this socket may dip into the memalloc reserves on
the receive side.

When *dev_alloc_skb() finds it cannot allocate a struct sk_buff the regular
way it will grab some memory from the memalloc reserve.

Network paths will drop !SOCK_MEMALLOC packets ASAP when reserve is being used.

Memalloc sk_buff allocations are not done from the SLAB but are done using 
the new SROG allocator. sk_buff::memalloc records this exception so that 
kfree_skbmem()/skb_clone() and others can do the right thing.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Daniel Phillips <phillips@google.com>

---
 include/linux/gfp.h    |    3 -
 include/linux/mmzone.h |    1 
 include/linux/skbuff.h |    6 +-
 include/net/sock.h     |   40 +++++++++++++++
 mm/page_alloc.c        |   41 ++++++++++++++-
 net/core/skbuff.c      |  127 ++++++++++++++++++++++++++++++++++++++++++++-----
 net/core/sock.c        |   74 ++++++++++++++++++++++++++++
 net/ipv4/af_inet.c     |    3 +
 net/ipv4/icmp.c        |    3 +
 net/ipv4/tcp_ipv4.c    |    3 +
 net/ipv4/udp.c         |    8 ++-
 11 files changed, 290 insertions(+), 19 deletions(-)

Index: linux-2.6/include/linux/gfp.h
===================================================================
--- linux-2.6.orig/include/linux/gfp.h	2006-08-12 12:56:06.000000000 +0200
+++ linux-2.6/include/linux/gfp.h	2006-08-12 12:56:09.000000000 +0200
@@ -46,6 +46,7 @@ struct vm_area_struct;
 #define __GFP_ZERO	((__force gfp_t)0x8000u)/* Return zeroed page on success */
 #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
 #define __GFP_HARDWALL   ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
+#define __GFP_MEMALLOC  ((__force gfp_t)0x40000u) /* Use emergency reserves */
 
 #define __GFP_BITS_SHIFT 20	/* Room for 20 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
@@ -54,7 +55,7 @@ struct vm_area_struct;
 #define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
 			__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
 			__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
-			__GFP_NOMEMALLOC|__GFP_HARDWALL)
+			__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_MEMALLOC)
 
 /* This equals 0, but use constants in case they ever change */
 #define GFP_NOWAIT	(GFP_ATOMIC & ~__GFP_HIGH)
Index: linux-2.6/include/linux/mmzone.h
===================================================================
--- linux-2.6.orig/include/linux/mmzone.h	2006-08-12 12:56:06.000000000 +0200
+++ linux-2.6/include/linux/mmzone.h	2006-08-12 12:56:09.000000000 +0200
@@ -420,6 +420,7 @@ int percpu_pagelist_fraction_sysctl_hand
 					void __user *, size_t *, loff_t *);
 int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
 			struct file *, void __user *, size_t *, loff_t *);
+int adjust_memalloc_reserve(int bytes);
 
 #include <linux/topology.h>
 /* Returns the number of the current Node. */
Index: linux-2.6/include/linux/skbuff.h
===================================================================
--- linux-2.6.orig/include/linux/skbuff.h	2006-08-12 12:56:06.000000000 +0200
+++ linux-2.6/include/linux/skbuff.h	2006-08-12 15:25:33.000000000 +0200
@@ -282,7 +282,8 @@ struct sk_buff {
 				nfctinfo:3;
 	__u8			pkt_type:3,
 				fclone:2,
-				ipvs_property:1;
+				ipvs_property:1,
+				memalloc:1;
 	__be16			protocol;
 
 	void			(*destructor)(struct sk_buff *skb);
@@ -1086,7 +1087,8 @@ static inline void __skb_queue_purge(str
 static inline struct sk_buff *__dev_alloc_skb(unsigned int length,
 					      gfp_t gfp_mask)
 {
-	struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
+	struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD,
+			gfp_mask | __GFP_MEMALLOC);
 	if (likely(skb))
 		skb_reserve(skb, NET_SKB_PAD);
 	return skb;
Index: linux-2.6/include/net/sock.h
===================================================================
--- linux-2.6.orig/include/net/sock.h	2006-08-12 12:56:06.000000000 +0200
+++ linux-2.6/include/net/sock.h	2006-08-12 12:56:38.000000000 +0200
@@ -391,6 +391,7 @@ enum sock_flags {
 	SOCK_RCVTSTAMP, /* %SO_TIMESTAMP setting */
 	SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
 	SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+	SOCK_MEMALLOC, /* protocol can use memalloc reserve */
 };
 
 static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
@@ -413,6 +414,45 @@ static inline int sock_flag(struct sock 
 	return test_bit(flag, &sk->sk_flags);
 }
 
+static inline int sk_is_memalloc(struct sock *sk)
+{
+	return sock_flag(sk, SOCK_MEMALLOC);
+}
+
+/*
+ * Is this high enough, or do we want it to depend on the number of
+ * online devices and online CPUs?
+ *
+ *  #define MAX_CONCURRENT_SKBS (64*nr_devices*num_online_cpus())
+ */
+#define MAX_CONCURRENT_SKBS	128
+
+/*
+ * Used to count skb payloads.
+ *
+ * The assumption is that the sk_buffs themselves are small enough to fit
+ * in the remaining page space.
+ */
+extern atomic_t memalloc_skbs_used;
+
+static inline int memalloc_skbs_try_inc(void)
+{
+	return atomic_add_unless(&memalloc_skbs_used, 1, MAX_CONCURRENT_SKBS);
+}
+
+static inline void memalloc_skbs_dec(void)
+{
+	atomic_dec(&memalloc_skbs_used);
+}
+
+static inline int memalloc_skbs_available(void)
+{
+	return atomic_read(&memalloc_skbs_used) < MAX_CONCURRENT_SKBS;
+}
+
+extern int sk_adjust_memalloc(int);
+extern int sk_set_memalloc(struct sock *sk);
+
 static inline void sk_acceptq_removed(struct sock *sk)
 {
 	sk->sk_ack_backlog--;
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2006-08-12 12:56:06.000000000 +0200
+++ linux-2.6/mm/page_alloc.c	2006-08-12 12:56:09.000000000 +0200
@@ -82,6 +82,7 @@ EXPORT_SYMBOL(zone_table);
 
 static char *zone_names[MAX_NR_ZONES] = { "DMA", "DMA32", "Normal", "HighMem" };
 int min_free_kbytes = 1024;
+int var_free_kbytes;
 
 unsigned long __meminitdata nr_kernel_pages;
 unsigned long __meminitdata nr_all_pages;
@@ -970,8 +971,8 @@ restart:
 
 	/* This allocation should allow future memory freeing. */
 
-	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-			&& !in_interrupt()) {
+	if ((((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
+			&& !in_interrupt()) || (gfp_mask & __GFP_MEMALLOC)) {
 		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
 nofail_alloc:
 			/* go through the zonelist yet again, ignoring mins */
@@ -2196,7 +2197,8 @@ static void setup_per_zone_lowmem_reserv
  */
 void setup_per_zone_pages_min(void)
 {
-	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
+	unsigned pages_min = (min_free_kbytes + var_free_kbytes)
+		>> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
 	struct zone *zone;
 	unsigned long flags;
@@ -2248,6 +2250,39 @@ void setup_per_zone_pages_min(void)
 	calculate_totalreserve_pages();
 }
 
+/**
+ *	adjust_memalloc_reserve - adjust the memalloc reserve
+ *	@pages: number of pages to add
+ *
+ *	It adds a number of pages to the memalloc reserve; if
+ *	the number was positive it kicks kswapd into action to
+ *	satisfy the higher watermarks.
+ */
+int adjust_memalloc_reserve(int pages)
+{
+	int kbytes;
+	int err = 0;
+
+	kbytes = var_free_kbytes + (pages << (PAGE_SHIFT - 10));
+	if (kbytes < 0) {
+		err = -EINVAL;
+		goto out;
+	}
+	var_free_kbytes = kbytes;
+	setup_per_zone_pages_min();
+	if (pages > 0) {
+		struct zone *zone;
+		for_each_zone(zone)
+			wakeup_kswapd(zone, 0);
+	}
+	printk(KERN_DEBUG "RX reserve: %d\n", var_free_kbytes);
+
+out:
+	return err;
+}
+
+EXPORT_SYMBOL_GPL(adjust_memalloc_reserve);
+
 /*
  * Initialise min_free_kbytes.
  *
Index: linux-2.6/net/core/skbuff.c
===================================================================
--- linux-2.6.orig/net/core/skbuff.c	2006-08-12 12:56:06.000000000 +0200
+++ linux-2.6/net/core/skbuff.c	2006-08-12 15:28:15.000000000 +0200
@@ -43,6 +43,7 @@
 #include <linux/kernel.h>
 #include <linux/sched.h>
 #include <linux/mm.h>
+#include <linux/srog.h>
 #include <linux/interrupt.h>
 #include <linux/in.h>
 #include <linux/inet.h>
@@ -126,7 +127,7 @@ EXPORT_SYMBOL(skb_truesize_bug);
  */
 
 /**
- *	__alloc_skb	-	allocate a network buffer
+ *	___alloc_skb	-	allocate a network buffer
  *	@size: size to allocate
  *	@gfp_mask: allocation mask
  *	@fclone: allocate from fclone cache instead of head cache
@@ -139,14 +140,45 @@ EXPORT_SYMBOL(skb_truesize_bug);
  *	Buffers may only be allocated from interrupts using a @gfp_mask of
  *	%GFP_ATOMIC.
  */
-struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
-			    int fclone)
+static
+struct sk_buff *___alloc_skb(unsigned int size, gfp_t gfp_mask, int fclone)
 {
 	kmem_cache_t *cache;
 	struct skb_shared_info *shinfo;
 	struct sk_buff *skb;
 	u8 *data;
 
+	size = SKB_DATA_ALIGN(size);
+
+	if (gfp_mask & __GFP_MEMALLOC) {
+		cache = NULL;
+		skb = NULL;
+		if (!memalloc_skbs_try_inc())
+			goto out;
+
+		/*
+		 * Allocate the data section first because we know the first
+		 * SROG alloc is a valid SROG entry point and skb->head is
+		 * shared between clones. This saves us from tracking SROGs.
+		 */
+		data = srog_alloc(NULL, size + sizeof(struct skb_shared_info),
+				gfp_mask);
+		if (!data)
+			goto dec_out;
+
+		skb = srog_alloc(data, fclone
+				? 2*sizeof(struct sk_buff) + sizeof(atomic_t)
+				: sizeof(struct sk_buff), gfp_mask);
+		if (!skb) {
+			srog_free(NULL, data);
+dec_out:
+			memalloc_skbs_dec();
+			goto out;
+		}
+
+		goto allocated;
+	}
+
 	cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
 
 	/* Get the HEAD */
@@ -155,12 +187,13 @@ struct sk_buff *__alloc_skb(unsigned int
 		goto out;
 
 	/* Get the DATA. Size must match skb_add_mtu(). */
-	size = SKB_DATA_ALIGN(size);
 	data = ____kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
 	if (!data)
 		goto nodata;
 
+allocated:
 	memset(skb, 0, offsetof(struct sk_buff, truesize));
+	skb->memalloc = !cache;
 	skb->truesize = size + sizeof(struct sk_buff);
 	atomic_set(&skb->users, 1);
 	skb->head = data;
@@ -185,6 +218,7 @@ struct sk_buff *__alloc_skb(unsigned int
 		atomic_set(fclone_ref, 1);
 
 		child->fclone = SKB_FCLONE_UNAVAILABLE;
+		child->memalloc = skb->memalloc;
 	}
 out:
 	return skb;
@@ -194,6 +228,18 @@ nodata:
 	goto out;
 }
 
+struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, int fclone)
+{
+	struct sk_buff *skb;
+
+	skb = ___alloc_skb(size, gfp_mask & ~__GFP_MEMALLOC, fclone);
+
+	if (!skb && (gfp_mask & __GFP_MEMALLOC) && memalloc_skbs_available())
+		skb = ___alloc_skb(size, gfp_mask, fclone);
+
+	return skb;
+}
+
 /**
  *	alloc_skb_from_cache	-	allocate a network buffer
  *	@cp: kmem_cache from which to allocate the data area
@@ -267,7 +313,7 @@ struct sk_buff *__netdev_alloc_skb(struc
 {
 	struct sk_buff *skb;
 
-	skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
+	skb = alloc_skb(length + NET_SKB_PAD, gfp_mask | __GFP_MEMALLOC);
 	if (likely(skb))
 		skb_reserve(skb, NET_SKB_PAD);
 	return skb;
@@ -299,7 +345,7 @@ static void skb_clone_fraglist(struct sk
 		skb_get(list);
 }
 
-static void skb_release_data(struct sk_buff *skb)
+static int skb_release_data(struct sk_buff *skb)
 {
 	if (!skb->cloned ||
 	    !atomic_sub_return(skb->nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1,
@@ -313,8 +359,34 @@ static void skb_release_data(struct sk_b
 		if (skb_shinfo(skb)->frag_list)
 			skb_drop_fraglist(skb);
 
-		kfree(skb->head);
+		return 1;
 	}
+	return 0;
+}
+
+static void memalloc_free_skb(struct kmem_cache *cache, void *objp, int free)
+{
+	/*
+	 * This complication is necessary because we need a valid SROG
+	 * pointer. If we let skb_release_data() free the skb data, we
+	 * loose the only valid SROG entry point we know about.
+	 */
+	struct sk_buff *skb = objp;
+	u8 *data = skb->head;
+	srog_free(data, objp);
+	if (free) {
+		srog_free(NULL, data);
+		memalloc_skbs_dec();
+	}
+}
+
+static void kmem_cache_free_skb(struct kmem_cache *cache, void *objp, int free)
+{
+	struct sk_buff *skb = objp;
+	u8 *data = skb->head;
+	kmem_cache_free(cache, objp);
+	if (free)
+		kfree(data);
 }
 
 /*
@@ -324,17 +396,22 @@ void kfree_skbmem(struct sk_buff *skb)
 {
 	struct sk_buff *other;
 	atomic_t *fclone_ref;
+	void (*free_skb)(struct kmem_cache *, void *, int);
+	int free;
+
+	free = skb_release_data(skb);
+
+	free_skb = skb->memalloc ? memalloc_free_skb : kmem_cache_free_skb;
 
-	skb_release_data(skb);
 	switch (skb->fclone) {
 	case SKB_FCLONE_UNAVAILABLE:
-		kmem_cache_free(skbuff_head_cache, skb);
+		free_skb(skbuff_head_cache, skb, free);
 		break;
 
 	case SKB_FCLONE_ORIG:
 		fclone_ref = (atomic_t *) (skb + 2);
 		if (atomic_dec_and_test(fclone_ref))
-			kmem_cache_free(skbuff_fclone_cache, skb);
+			free_skb(skbuff_fclone_cache, skb, free);
 		break;
 
 	case SKB_FCLONE_CLONE:
@@ -347,7 +424,7 @@ void kfree_skbmem(struct sk_buff *skb)
 		skb->fclone = SKB_FCLONE_UNAVAILABLE;
 
 		if (atomic_dec_and_test(fclone_ref))
-			kmem_cache_free(skbuff_fclone_cache, other);
+			free_skb(skbuff_fclone_cache, other, free);
 		break;
 	};
 }
@@ -433,6 +510,11 @@ struct sk_buff *skb_clone(struct sk_buff
 		atomic_t *fclone_ref = (atomic_t *) (n + 1);
 		n->fclone = SKB_FCLONE_CLONE;
 		atomic_inc(fclone_ref);
+	} else if (skb->memalloc) {
+		n = srog_alloc(skb->head, sizeof(struct sk_buff), gfp_mask);
+		if (!n)
+			return NULL;
+		n->fclone = SKB_FCLONE_UNAVAILABLE;
 	} else {
 		n = kmem_cache_alloc(skbuff_head_cache, gfp_mask);
 		if (!n)
@@ -468,6 +550,7 @@ struct sk_buff *skb_clone(struct sk_buff
 #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
 	C(ipvs_property);
 #endif
+	C(memalloc);
 	C(protocol);
 	n->destructor = NULL;
 #ifdef CONFIG_NETFILTER
@@ -688,7 +771,27 @@ int pskb_expand_head(struct sk_buff *skb
 
 	size = SKB_DATA_ALIGN(size);
 
-	data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
+	if (skb->memalloc) {
+		if (!memalloc_skbs_try_inc())
+			goto nodata;
+		/*
+		 * Unfortunately we have to assume skb->head is in the first
+		 * page of the SROG, hence we cannot reuse the old one.
+		 */
+		data = srog_alloc(NULL,
+				size + sizeof(struct skb_shared_info),
+				gfp_mask | __GFP_MEMALLOC);
+		if (!data) {
+			memalloc_skbs_dec();
+			goto nodata;
+		}
+		/*
+		 * But they must end up in the same SROG otherwise we cannot
+		 * reliably free clones.
+		 */
+		srog_link(skb->head, data);
+	} else
+		data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
 	if (!data)
 		goto nodata;
 
Index: linux-2.6/net/ipv4/icmp.c
===================================================================
--- linux-2.6.orig/net/ipv4/icmp.c	2006-08-12 12:56:06.000000000 +0200
+++ linux-2.6/net/ipv4/icmp.c	2006-08-12 12:56:09.000000000 +0200
@@ -938,6 +938,9 @@ int icmp_rcv(struct sk_buff *skb)
 			goto error;
 	}
 
+	if (unlikely(skb->memalloc))
+		goto drop;
+
 	if (!pskb_pull(skb, sizeof(struct icmphdr)))
 		goto error;
 
Index: linux-2.6/net/ipv4/tcp_ipv4.c
===================================================================
--- linux-2.6.orig/net/ipv4/tcp_ipv4.c	2006-08-12 12:56:06.000000000 +0200
+++ linux-2.6/net/ipv4/tcp_ipv4.c	2006-08-12 12:56:09.000000000 +0200
@@ -1093,6 +1093,9 @@ int tcp_v4_rcv(struct sk_buff *skb)
 	if (!sk)
 		goto no_tcp_socket;
 
+	if (unlikely(skb->memalloc && !sk_is_memalloc(sk)))
+		goto discard_and_relse;
+
 process:
 	if (sk->sk_state == TCP_TIME_WAIT)
 		goto do_time_wait;
Index: linux-2.6/net/ipv4/udp.c
===================================================================
--- linux-2.6.orig/net/ipv4/udp.c	2006-08-12 12:56:06.000000000 +0200
+++ linux-2.6/net/ipv4/udp.c	2006-08-12 12:56:09.000000000 +0200
@@ -1136,7 +1136,12 @@ int udp_rcv(struct sk_buff *skb)
 	sk = udp_v4_lookup(saddr, uh->source, daddr, uh->dest, skb->dev->ifindex);
 
 	if (sk != NULL) {
-		int ret = udp_queue_rcv_skb(sk, skb);
+		int ret;
+
+		if (unlikely(skb->memalloc && !sk_is_memalloc(sk)))
+			goto drop_noncritical;
+
+		ret = udp_queue_rcv_skb(sk, skb);
 		sock_put(sk);
 
 		/* a return value > 0 means to resubmit the input, but
@@ -1147,6 +1152,7 @@ int udp_rcv(struct sk_buff *skb)
 		return 0;
 	}
 
+drop_noncritical:
 	if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb))
 		goto drop;
 	nf_reset(skb);
Index: linux-2.6/net/ipv4/af_inet.c
===================================================================
--- linux-2.6.orig/net/ipv4/af_inet.c	2006-08-12 12:56:06.000000000 +0200
+++ linux-2.6/net/ipv4/af_inet.c	2006-08-12 12:56:09.000000000 +0200
@@ -132,6 +132,9 @@ void inet_sock_destruct(struct sock *sk)
 {
 	struct inet_sock *inet = inet_sk(sk);
 
+	if (sk_is_memalloc(sk))
+		sk_adjust_memalloc(-1);
+
 	__skb_queue_purge(&sk->sk_receive_queue);
 	__skb_queue_purge(&sk->sk_error_queue);
 
Index: linux-2.6/net/core/sock.c
===================================================================
--- linux-2.6.orig/net/core/sock.c	2006-08-12 12:56:06.000000000 +0200
+++ linux-2.6/net/core/sock.c	2006-08-12 13:02:59.000000000 +0200
@@ -111,6 +111,8 @@
 #include <linux/poll.h>
 #include <linux/tcp.h>
 #include <linux/init.h>
+#include <linux/inetdevice.h>
+#include <linux/blkdev.h>
 
 #include <asm/uaccess.h>
 #include <asm/system.h>
@@ -195,6 +197,78 @@ __u32 sysctl_rmem_default = SK_RMEM_MAX;
 /* Maximal space eaten by iovec or ancilliary data plus some space */
 int sysctl_optmem_max = sizeof(unsigned long)*(2*UIO_MAXIOV + 512);
 
+static DEFINE_SPINLOCK(memalloc_lock);
+static int memalloc_socks;
+static unsigned long memalloc_reserve;
+
+atomic_t memalloc_skbs_used;
+EXPORT_SYMBOL_GPL(memalloc_skbs_used);
+
+/**
+ *	sk_adjust_memalloc - adjust the global memalloc reserve for this device
+ *	@dev: device that has memalloc demands
+ *	@nr_socks: number of new %SOCK_MEMALLOC sockets
+ *
+ *	This function adjusts the memalloc reserve based on device
+ *	demand. For each %SOCK_MEMALLOC socket this device will reserve
+ *	2 * %MAX_PHYS_SEGMENTS pages for outbound traffic (assumption:
+ *	each %SOCK_MEMALLOC socket will have a %request_queue associated)
+ *	and 5 * %MAX_CONCURRENT_SKBS pages.
+ *
+ *	2 * %MAX_PHYS_SEGMENTS - the request queue can hold up to 150% the
+ *		remaining 50% goes to being sure we can write packets for
+ *		the outgoing pages.
+ *
+ *	5 * %MAX_CONCURRENT_SKBS - for each skb 4 pages for high order
+ *	        jumbo frame allocs, and 1 for good measure.
+ */
+int sk_adjust_memalloc(int nr_socks)
+{
+	unsigned long flags;
+	unsigned long reserve;
+	int err;
+
+	spin_lock_irqsave(&memalloc_lock, flags);
+
+	memalloc_socks += nr_socks;
+	BUG_ON(memalloc_socks < 0);
+
+	reserve = memalloc_socks * 2 * MAX_PHYS_SEGMENTS + /* outbound */
+		  MAX_CONCURRENT_SKBS * 5;		   /* inbound */
+
+	err = adjust_memalloc_reserve(reserve - memalloc_reserve);
+	if (err) {
+		printk(KERN_WARNING
+			"Unable to change RX reserve to: %lu, error: %d\n",
+			reserve, err);
+		goto unlock;
+	}
+	memalloc_reserve = reserve;
+
+unlock:
+	spin_unlock_irqrestore(&memalloc_lock, flags);
+	return err;
+}
+EXPORT_SYMBOL_GPL(sk_adjust_memalloc);
+
+/**
+ *	sk_set_memalloc - sets %SOCK_MEMALLOC
+ *	@sk: socket to set it on
+ *
+ *	Set %SOCK_MEMALLOC on a socket and increase the memalloc reserve
+ *	accordingly.
+ */
+int sk_set_memalloc(struct sock *sk)
+{
+	int err = 0;
+
+	if (!(err = sk_adjust_memalloc(1)))
+		sock_set_flag(sk, SOCK_MEMALLOC);
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(sk_set_memalloc);
+
 static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
 {
 	struct timeval tv;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC][PATCH 4/4] deadlock prevention for NBD
  2006-08-12 14:14 [RFC][PATCH 0/4] VM deadlock prevention -v4 Peter Zijlstra
                   ` (2 preceding siblings ...)
  2006-08-12 14:14 ` [RFC][PATCH 3/4] deadlock prevention core Peter Zijlstra
@ 2006-08-12 14:14 ` Peter Zijlstra
  2006-08-24 14:43   ` Pavel Machek
  2006-08-12 16:51 ` [RFC][PATCH 0/4] VM deadlock prevention -v4 Indan Zupancic
  4 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2006-08-12 14:14 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Indan Zupancic, Peter Zijlstra, Evgeniy Polyakov,
	Daniel Phillips, Rik van Riel, David Miller


Use sk_set_memalloc() on the nbd socket.

Limit each request to 1 page, so that the request throttling also limits the
number of in-flight pages and force the IO scheduler to NOOP as anything else
doesn't make sense anyway.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Daniel Phillips <phillips@google.com>

---
 block/elevator.c       |    5 +++++
 block/ll_rw_blk.c      |   12 ++++++++++--
 drivers/block/nbd.c    |   12 +++++++++++-
 include/linux/blkdev.h |    9 +++++++++
 4 files changed, 35 insertions(+), 3 deletions(-)

Index: linux-2.6/block/ll_rw_blk.c
===================================================================
--- linux-2.6.orig/block/ll_rw_blk.c	2006-08-12 15:38:01.000000000 +0200
+++ linux-2.6/block/ll_rw_blk.c	2006-08-12 15:38:11.000000000 +0200
@@ -1899,6 +1899,14 @@ EXPORT_SYMBOL(blk_init_queue);
 request_queue_t *
 blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
 {
+	return blk_init_queue_node_elv(rfn, lock, node_id, NULL);
+}
+EXPORT_SYMBOL(blk_init_queue_node);
+
+request_queue_t *
+blk_init_queue_node_elv(request_fn_proc *rfn, spinlock_t *lock, int node_id,
+		char *elv_name)
+{
 	request_queue_t *q = blk_alloc_queue_node(GFP_KERNEL, node_id);
 
 	if (!q)
@@ -1939,7 +1947,7 @@ blk_init_queue_node(request_fn_proc *rfn
 	/*
 	 * all done
 	 */
-	if (!elevator_init(q, NULL)) {
+	if (!elevator_init(q, elv_name)) {
 		blk_queue_congestion_threshold(q);
 		return q;
 	}
@@ -1947,7 +1955,7 @@ blk_init_queue_node(request_fn_proc *rfn
 	blk_put_queue(q);
 	return NULL;
 }
-EXPORT_SYMBOL(blk_init_queue_node);
+EXPORT_SYMBOL(blk_init_queue_node_elv);
 
 int blk_get_queue(request_queue_t *q)
 {
Index: linux-2.6/drivers/block/nbd.c
===================================================================
--- linux-2.6.orig/drivers/block/nbd.c	2006-08-12 15:38:01.000000000 +0200
+++ linux-2.6/drivers/block/nbd.c	2006-08-12 15:50:33.000000000 +0200
@@ -361,8 +361,13 @@ static void nbd_do_it(struct nbd_device 
 
 	BUG_ON(lo->magic != LO_MAGIC);
 
+	if (sk_set_memalloc(lo->sock->sk))
+		printk(KERN_WARNING
+				"failed to set SO_MEMALLOC on NBD socket\n");
+
 	while ((req = nbd_read_stat(lo)) != NULL)
 		nbd_end_request(req);
+
 	return;
 }
 
@@ -628,11 +633,16 @@ static int __init nbd_init(void)
 		 * every gendisk to have its very own request_queue struct.
 		 * These structs are big so we dynamically allocate them.
 		 */
-		disk->queue = blk_init_queue(do_nbd_request, &nbd_lock);
+		disk->queue = blk_init_queue_node_elv(do_nbd_request,
+				&nbd_lock, -1, "noop");
 		if (!disk->queue) {
 			put_disk(disk);
 			goto out;
 		}
+		blk_queue_pin_elevator(disk->queue);
+		blk_queue_max_segment_size(disk->queue, PAGE_SIZE);
+		blk_queue_max_hw_segments(disk->queue, 1);
+		blk_queue_max_phys_segments(disk->queue, 1);
 	}
 
 	if (register_blkdev(NBD_MAJOR, "nbd")) {
Index: linux-2.6/include/linux/blkdev.h
===================================================================
--- linux-2.6.orig/include/linux/blkdev.h	2006-08-12 15:38:01.000000000 +0200
+++ linux-2.6/include/linux/blkdev.h	2006-08-12 15:38:11.000000000 +0200
@@ -444,6 +444,12 @@ struct request_queue
 #define QUEUE_FLAG_REENTER	6	/* Re-entrancy avoidance */
 #define QUEUE_FLAG_PLUGGED	7	/* queue is plugged */
 #define QUEUE_FLAG_ELVSWITCH	8	/* don't use elevator, just do FIFO */
+#define QUEUE_FLAG_ELVPINNED	9	/* pin the current elevator */
+
+static inline void blk_queue_pin_elevator(struct request_queue *q)
+{
+	set_bit(QUEUE_FLAG_ELVPINNED, &q->queue_flags);
+}
 
 enum {
 	/*
@@ -696,6 +702,9 @@ static inline void elv_dispatch_add_tail
 /*
  * Access functions for manipulating queue properties
  */
+extern request_queue_t *blk_init_queue_node_elv(request_fn_proc *rfn,
+					spinlock_t *lock, int node_id,
+					char *elv_name);
 extern request_queue_t *blk_init_queue_node(request_fn_proc *rfn,
 					spinlock_t *lock, int node_id);
 extern request_queue_t *blk_init_queue(request_fn_proc *, spinlock_t *);
Index: linux-2.6/block/elevator.c
===================================================================
--- linux-2.6.orig/block/elevator.c	2006-08-12 15:38:01.000000000 +0200
+++ linux-2.6/block/elevator.c	2006-08-12 15:38:11.000000000 +0200
@@ -861,6 +861,11 @@ ssize_t elv_iosched_store(request_queue_
 	size_t len;
 	struct elevator_type *e;
 
+	if (test_bit(QUEUE_FLAG_ELVPINNED, &q->queue_flags)) {
+		printk(KERN_ERR "elevator: cannot switch elevator, pinned\n");
+		return count;
+	}
+
 	elevator_name[sizeof(elevator_name) - 1] = '\0';
 	strncpy(elevator_name, name, sizeof(elevator_name) - 1);
 	len = strlen(elevator_name);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 3/4] deadlock prevention core
  2006-08-12 14:14 ` [RFC][PATCH 3/4] deadlock prevention core Peter Zijlstra
@ 2006-08-12 14:41   ` Jeff Garzik
  2006-08-12 15:06     ` rename *MEMALLOC flags (was: Re: [RFC][PATCH 3/4] deadlock prevention core) Peter Zijlstra
  2006-08-12 17:31   ` [RFC][PATCH 3/4] deadlock prevention core Indan Zupancic
  1 sibling, 1 reply; 30+ messages in thread
From: Jeff Garzik @ 2006-08-12 14:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, netdev, Indan Zupancic, Evgeniy Polyakov,
	Daniel Phillips, Rik van Riel, David Miller

Peter Zijlstra wrote:
> Index: linux-2.6/include/linux/gfp.h
> ===================================================================
> --- linux-2.6.orig/include/linux/gfp.h	2006-08-12 12:56:06.000000000 +0200
> +++ linux-2.6/include/linux/gfp.h	2006-08-12 12:56:09.000000000 +0200
> @@ -46,6 +46,7 @@ struct vm_area_struct;
>  #define __GFP_ZERO	((__force gfp_t)0x8000u)/* Return zeroed page on success */
>  #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
>  #define __GFP_HARDWALL   ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
> +#define __GFP_MEMALLOC  ((__force gfp_t)0x40000u) /* Use emergency reserves */

This symbol name has nothing to do with its purpose.  The entire area of 
code you are modifying could be described as having something to do with 
'memalloc'.

GFP_EMERGENCY or GFP_USE_RESERVES or somesuch would be a far better 
symbol name.

I recognize that is matches with GFP_NOMEMALLOC, but that doesn't change 
the situation anyway.  In fact, a cleanup patch to rename GFP_NOMEMALLOC 
would be nice.

	Jeff


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* rename *MEMALLOC flags (was: Re: [RFC][PATCH 3/4] deadlock prevention core)
  2006-08-12 14:41   ` Jeff Garzik
@ 2006-08-12 15:06     ` Peter Zijlstra
  2006-08-12 15:28       ` Indan Zupancic
  2006-08-14  0:06       ` rename *MEMALLOC flags Daniel Phillips
  0 siblings, 2 replies; 30+ messages in thread
From: Peter Zijlstra @ 2006-08-12 15:06 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: linux-mm, linux-kernel, netdev, Indan Zupancic, Evgeniy Polyakov,
	Daniel Phillips, Rik van Riel, David Miller

On Sat, 2006-08-12 at 10:41 -0400, Jeff Garzik wrote:
> Peter Zijlstra wrote:
> > Index: linux-2.6/include/linux/gfp.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/gfp.h	2006-08-12 12:56:06.000000000 +0200
> > +++ linux-2.6/include/linux/gfp.h	2006-08-12 12:56:09.000000000 +0200
> > @@ -46,6 +46,7 @@ struct vm_area_struct;
> >  #define __GFP_ZERO	((__force gfp_t)0x8000u)/* Return zeroed page on success */
> >  #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
> >  #define __GFP_HARDWALL   ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
> > +#define __GFP_MEMALLOC  ((__force gfp_t)0x40000u) /* Use emergency reserves */
> 
> This symbol name has nothing to do with its purpose.  The entire area of 
> code you are modifying could be described as having something to do with 
> 'memalloc'.
> 
> GFP_EMERGENCY or GFP_USE_RESERVES or somesuch would be a far better 
> symbol name.
> 
> I recognize that is matches with GFP_NOMEMALLOC, but that doesn't change 
> the situation anyway.  In fact, a cleanup patch to rename GFP_NOMEMALLOC 
> would be nice.

I'm rather bad at picking names, but here goes:

PF_MEMALLOC      -> PF_EMERGALLOC
__GFP_NOMEMALLOC -> __GFP_NOEMERGALLOC
__GFP_MEMALLOC   -> __GFP_EMERGALLOC

Is that suitable and shall I prepare patches? Or do we want more ppl to
chime in and have a few more rounds?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: rename *MEMALLOC flags (was: Re: [RFC][PATCH 3/4] deadlock prevention core)
  2006-08-12 15:06     ` rename *MEMALLOC flags (was: Re: [RFC][PATCH 3/4] deadlock prevention core) Peter Zijlstra
@ 2006-08-12 15:28       ` Indan Zupancic
  2006-08-12 15:34         ` Peter Zijlstra
  2006-08-14  0:06       ` rename *MEMALLOC flags Daniel Phillips
  1 sibling, 1 reply; 30+ messages in thread
From: Indan Zupancic @ 2006-08-12 15:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeff Garzik, linux-mm, linux-kernel, netdev, Evgeniy Polyakov,
	Daniel Phillips, Rik van Riel, David Miller

On Sat, August 12, 2006 17:06, Peter Zijlstra said:
> On Sat, 2006-08-12 at 10:41 -0400, Jeff Garzik wrote:
>> Peter Zijlstra wrote:
>> > Index: linux-2.6/include/linux/gfp.h
>> > ===================================================================
>> > --- linux-2.6.orig/include/linux/gfp.h	2006-08-12 12:56:06.000000000 +0200
>> > +++ linux-2.6/include/linux/gfp.h	2006-08-12 12:56:09.000000000 +0200
>> > @@ -46,6 +46,7 @@ struct vm_area_struct;
>> >  #define __GFP_ZERO	((__force gfp_t)0x8000u)/* Return zeroed page on success */
>> >  #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
>> >  #define __GFP_HARDWALL   ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs
>> */
>> > +#define __GFP_MEMALLOC  ((__force gfp_t)0x40000u) /* Use emergency reserves */
>>
>> This symbol name has nothing to do with its purpose.  The entire area of
>> code you are modifying could be described as having something to do with
>> 'memalloc'.
>>
>> GFP_EMERGENCY or GFP_USE_RESERVES or somesuch would be a far better
>> symbol name.
>>
>> I recognize that is matches with GFP_NOMEMALLOC, but that doesn't change
>> the situation anyway.  In fact, a cleanup patch to rename GFP_NOMEMALLOC
>> would be nice.
>
> I'm rather bad at picking names, but here goes:
>
> PF_MEMALLOC      -> PF_EMERGALLOC
> __GFP_NOMEMALLOC -> __GFP_NOEMERGALLOC
> __GFP_MEMALLOC   -> __GFP_EMERGALLOC
>
> Is that suitable and shall I prepare patches? Or do we want more ppl to
> chime in and have a few more rounds?

Pardon my ignorance, but if we're doing cleanup anyway, why not use only one flag instead of two?
Why is __GFP_NOMEMALLOC needed when not setting __GFP_MEMALLOC could mean the same? Or else what
is the expected behaviour if both flags are set?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: rename *MEMALLOC flags (was: Re: [RFC][PATCH 3/4] deadlock prevention core)
  2006-08-12 15:28       ` Indan Zupancic
@ 2006-08-12 15:34         ` Peter Zijlstra
  0 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2006-08-12 15:34 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Jeff Garzik, linux-mm, linux-kernel, netdev, Evgeniy Polyakov,
	Daniel Phillips, Rik van Riel, David Miller

On Sat, 2006-08-12 at 17:28 +0200, Indan Zupancic wrote:
> On Sat, August 12, 2006 17:06, Peter Zijlstra said:
> > On Sat, 2006-08-12 at 10:41 -0400, Jeff Garzik wrote:
> >> Peter Zijlstra wrote:
> >> > Index: linux-2.6/include/linux/gfp.h
> >> > ===================================================================
> >> > --- linux-2.6.orig/include/linux/gfp.h	2006-08-12 12:56:06.000000000 +0200
> >> > +++ linux-2.6/include/linux/gfp.h	2006-08-12 12:56:09.000000000 +0200
> >> > @@ -46,6 +46,7 @@ struct vm_area_struct;
> >> >  #define __GFP_ZERO	((__force gfp_t)0x8000u)/* Return zeroed page on success */
> >> >  #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
> >> >  #define __GFP_HARDWALL   ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs
> >> */
> >> > +#define __GFP_MEMALLOC  ((__force gfp_t)0x40000u) /* Use emergency reserves */
> >>
> >> This symbol name has nothing to do with its purpose.  The entire area of
> >> code you are modifying could be described as having something to do with
> >> 'memalloc'.
> >>
> >> GFP_EMERGENCY or GFP_USE_RESERVES or somesuch would be a far better
> >> symbol name.
> >>
> >> I recognize that is matches with GFP_NOMEMALLOC, but that doesn't change
> >> the situation anyway.  In fact, a cleanup patch to rename GFP_NOMEMALLOC
> >> would be nice.
> >
> > I'm rather bad at picking names, but here goes:
> >
> > PF_MEMALLOC      -> PF_EMERGALLOC
> > __GFP_NOMEMALLOC -> __GFP_NOEMERGALLOC
> > __GFP_MEMALLOC   -> __GFP_EMERGALLOC
    SOCK_MEMALLOC    -> SOCK_EMERGALLOC
> >
> > Is that suitable and shall I prepare patches? Or do we want more ppl to
> > chime in and have a few more rounds?
> 
> Pardon my ignorance, but if we're doing cleanup anyway, why not use only one flag instead of two?
> Why is __GFP_NOMEMALLOC needed when not setting __GFP_MEMALLOC could mean the same? Or else what
> is the expected behaviour if both flags are set?

__GFP_NOMEMALLOC is most authorative; its use is (afaik) to negate
PF_MEMALLOC.

I agree that having both seems odd, but I haven't spend any significant
time on trying to find a 'nicer' solution.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 0/4] VM deadlock prevention -v4
  2006-08-12 14:14 [RFC][PATCH 0/4] VM deadlock prevention -v4 Peter Zijlstra
                   ` (3 preceding siblings ...)
  2006-08-12 14:14 ` [RFC][PATCH 4/4] deadlock prevention for NBD Peter Zijlstra
@ 2006-08-12 16:51 ` Indan Zupancic
  2006-08-12 17:33   ` Peter Zijlstra
  4 siblings, 1 reply; 30+ messages in thread
From: Indan Zupancic @ 2006-08-12 16:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, netdev, Evgeniy Polyakov,
	Daniel Phillips, Rik van Riel, David Miller

On Sat, August 12, 2006 16:14, Peter Zijlstra said:
> Hi,
>
> here the latest effort, it includes a whole new trivial allocator with a
> horrid name and an almost full rewrite of the deadlock prevention core.
> This version does not do anything per device and hence does not depend
> on the new netdev_alloc_skb() API.
>
> The reason to add a second allocator to the receive side is twofold:
> 1) it allows easy detection of the memory pressure / OOM situation;
> 2) it allows the receive path to be unbounded and go at full speed when
>    resources permit.
>
> The choice of using the global memalloc reserve as a mempool makes that
> the new allocator has to release pages as soon as possible; if we were
> to hoard pages in the allocator the memalloc reserve would not get
> replenished readily.

Version 2 had about 250 new lines of code, while v3 has close to 600, when
including the SROG code. And that while things should have become simpler.
So why use SROG instead of the old alloc_pages() based code? And why couldn't
you use a slightly modified SLOB instead of writing a new allocator?
It looks like overkill to me.

Greetings,

Indan


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 3/4] deadlock prevention core
  2006-08-12 14:14 ` [RFC][PATCH 3/4] deadlock prevention core Peter Zijlstra
  2006-08-12 14:41   ` Jeff Garzik
@ 2006-08-12 17:31   ` Indan Zupancic
  2006-08-12 17:44     ` Peter Zijlstra
  1 sibling, 1 reply; 30+ messages in thread
From: Indan Zupancic @ 2006-08-12 17:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, netdev, Evgeniy Polyakov,
	Daniel Phillips, Rik van Riel, David Miller

On Sat, August 12, 2006 16:14, Peter Zijlstra said:
> +struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, int fclone)
> +{
> +	struct sk_buff *skb;
> +
> +	skb = ___alloc_skb(size, gfp_mask & ~__GFP_MEMALLOC, fclone);
> +
> +	if (!skb && (gfp_mask & __GFP_MEMALLOC) && memalloc_skbs_available())
> +		skb = ___alloc_skb(size, gfp_mask, fclone);
> +
> +	return skb;
> +}
> +

I'd drop the memalloc_skbs_available() check, as that's already done by
___alloc_skb.

> +static DEFINE_SPINLOCK(memalloc_lock);
> +static int memalloc_socks;
> +static unsigned long memalloc_reserve;

Why is this a long? adjust_memalloc_reserve() takes an int.
Is it needed at all, considering var_free_kbytes already exists?

Greetings,

Indan


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 0/4] VM deadlock prevention -v4
  2006-08-12 16:51 ` [RFC][PATCH 0/4] VM deadlock prevention -v4 Indan Zupancic
@ 2006-08-12 17:33   ` Peter Zijlstra
  2006-08-12 18:16     ` Indan Zupancic
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2006-08-12 17:33 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: linux-mm, linux-kernel, netdev, Evgeniy Polyakov,
	Daniel Phillips, Rik van Riel, David Miller

On Sat, 2006-08-12 at 18:51 +0200, Indan Zupancic wrote:
> On Sat, August 12, 2006 16:14, Peter Zijlstra said:
> > Hi,
> >
> > here the latest effort, it includes a whole new trivial allocator with a
> > horrid name and an almost full rewrite of the deadlock prevention core.
> > This version does not do anything per device and hence does not depend
> > on the new netdev_alloc_skb() API.
> >
> > The reason to add a second allocator to the receive side is twofold:
> > 1) it allows easy detection of the memory pressure / OOM situation;
> > 2) it allows the receive path to be unbounded and go at full speed when
> >    resources permit.
> >
> > The choice of using the global memalloc reserve as a mempool makes that
> > the new allocator has to release pages as soon as possible; if we were
> > to hoard pages in the allocator the memalloc reserve would not get
> > replenished readily.
> 
> Version 2 had about 250 new lines of code, while v3 has close to 600, when
> including the SROG code. And that while things should have become simpler.
> So why use SROG instead of the old alloc_pages() based code? And why couldn't
> you use a slightly modified SLOB instead of writing a new allocator?
> It looks like overkill to me.

Of the 611 new lines about 150 are new comments.

Simpler yes, but also more complete; the old patches had serious issues
with the alternative allocation scheme.

As for why SROG, because trying to stick all the semantics needed for
all skb operations into the old approach was nasty, I had it almost
complete but it was horror (and more code than the SROG approach).

Why not SLOB, well, I mistakenly assumed that it was a simpler SLAB
allocator, my bad. However after having had a quick peek at it; whereas
it seems similar in intent it does not provide the things I look for.
Making it so and keep the old semantics and make it compile along side
of SLAB will take quite some effort.





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 3/4] deadlock prevention core
  2006-08-12 17:31   ` [RFC][PATCH 3/4] deadlock prevention core Indan Zupancic
@ 2006-08-12 17:44     ` Peter Zijlstra
  2006-08-12 17:54       ` Indan Zupancic
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2006-08-12 17:44 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: linux-mm, linux-kernel, netdev, Evgeniy Polyakov,
	Daniel Phillips, Rik van Riel, David Miller

On Sat, 2006-08-12 at 19:31 +0200, Indan Zupancic wrote:
> On Sat, August 12, 2006 16:14, Peter Zijlstra said:
> > +struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, int fclone)
> > +{
> > +	struct sk_buff *skb;
> > +
> > +	skb = ___alloc_skb(size, gfp_mask & ~__GFP_MEMALLOC, fclone);
> > +
> > +	if (!skb && (gfp_mask & __GFP_MEMALLOC) && memalloc_skbs_available())
> > +		skb = ___alloc_skb(size, gfp_mask, fclone);
> > +
> > +	return skb;
> > +}
> > +
> 
> I'd drop the memalloc_skbs_available() check, as that's already done by
> ___alloc_skb.

Right, thanks. Hmm, its the last occurence of that function, even
better.

> > +static DEFINE_SPINLOCK(memalloc_lock);
> > +static int memalloc_socks;
> > +static unsigned long memalloc_reserve;
> 
> Why is this a long? adjust_memalloc_reserve() takes an int.

Euhm, right :-) long comes naturaly when I think about quantities op
pages. The adjust_memalloc_reserve() argument is an increment, a delta;
perhaps I should change that to long.

> Is it needed at all, considering var_free_kbytes already exists?

Having them separate would allow ajust_memalloc_reserve() to be used by
other callers too (would need some extra locking).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 3/4] deadlock prevention core
  2006-08-12 17:44     ` Peter Zijlstra
@ 2006-08-12 17:54       ` Indan Zupancic
  2006-08-12 18:08         ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Indan Zupancic @ 2006-08-12 17:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, netdev, Evgeniy Polyakov,
	Daniel Phillips, Rik van Riel, David Miller

On Sat, August 12, 2006 19:44, Peter Zijlstra said:
> Euhm, right :-) long comes naturaly when I think about quantities op
> pages. The adjust_memalloc_reserve() argument is an increment, a delta;
> perhaps I should change that to long.

Maybe, but having 16 TB of reserved memory seems plenty for a while.

> Having them separate would allow ajust_memalloc_reserve() to be used by
> other callers too (would need some extra locking).

True, but currently memalloc_reserve isn't used in a sensible way,
or I'm missing something.

Greetings,

Indan


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 3/4] deadlock prevention core
  2006-08-12 17:54       ` Indan Zupancic
@ 2006-08-12 18:08         ` Peter Zijlstra
  2006-08-12 18:32           ` Indan Zupancic
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2006-08-12 18:08 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: linux-mm, linux-kernel, netdev, Evgeniy Polyakov,
	Daniel Phillips, Rik van Riel, David Miller

On Sat, 2006-08-12 at 19:54 +0200, Indan Zupancic wrote:
> On Sat, August 12, 2006 19:44, Peter Zijlstra said:
> > Euhm, right :-) long comes naturaly when I think about quantities op
> > pages. The adjust_memalloc_reserve() argument is an increment, a delta;
> > perhaps I should change that to long.
> 
> Maybe, but having 16 TB of reserved memory seems plenty for a while.

Oh, for sure, but since it doesn't really matter all that much, I'd
rather go for proper.

> > Having them separate would allow ajust_memalloc_reserve() to be used by
> > other callers too (would need some extra locking).
> 
> True, but currently memalloc_reserve isn't used in a sensible way,
> or I'm missing something.

Well, I'm somewhat reluctant to stick network related code into mm/, it
seems well separated now.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 0/4] VM deadlock prevention -v4
  2006-08-12 17:33   ` Peter Zijlstra
@ 2006-08-12 18:16     ` Indan Zupancic
  2006-08-12 18:54       ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Indan Zupancic @ 2006-08-12 18:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, netdev, Evgeniy Polyakov,
	Daniel Phillips, Rik van Riel, David Miller

On Sat, August 12, 2006 19:33, Peter Zijlstra said:
> Simpler yes, but also more complete; the old patches had serious issues
> with the alternative allocation scheme.

It sure is more complete, and looks nicer, but the price is IMHO too high.
I'm curious what those serious issues are, and if they can't be fixed.

> As for why SROG, because trying to stick all the semantics needed for
> all skb operations into the old approach was nasty, I had it almost
> complete but it was horror (and more code than the SROG approach).

What was missing or wrong in the old approach? Can't you use the new
approach, but use alloc_pages() instead of SROG?

Sorry if I bug you so, but I'm also trying to increase my knowledge here. ;-)

Greetings,

Indan


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 3/4] deadlock prevention core
  2006-08-12 18:08         ` Peter Zijlstra
@ 2006-08-12 18:32           ` Indan Zupancic
  2006-08-12 18:47             ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Indan Zupancic @ 2006-08-12 18:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, netdev, Evgeniy Polyakov,
	Daniel Phillips, Rik van Riel, David Miller

On Sat, August 12, 2006 20:08, Peter Zijlstra said:
> On Sat, 2006-08-12 at 19:54 +0200, Indan Zupancic wrote:
>> True, but currently memalloc_reserve isn't used in a sensible way,
>> or I'm missing something.
>
> Well, I'm somewhat reluctant to stick network related code into mm/, it
> seems well separated now.

What I had in mind was something like:

+static DEFINE_SPINLOCK(memalloc_lock);
+static int memalloc_socks;
+
+atomic_t memalloc_skbs_used;
+EXPORT_SYMBOL_GPL(memalloc_skbs_used);
+
+int sk_adjust_memalloc(int nr_socks)
+{
+	unsigned long flags;
+	unsigned int reserve;
+	int err;
+
+	spin_lock_irqsave(&memalloc_lock, flags);
+
+	memalloc_socks += nr_socks;
+	BUG_ON(memalloc_socks < 0);
+
+	reserve = nr_socks * (2 * MAX_PHYS_SEGMENTS + 	/* outbound */
+			      5 * MAX_CONCURRENT_SKBS);	/* inbound */
+
+	err = adjust_memalloc_reserve(reserve);
+	spin_unlock_irqrestore(&memalloc_lock, flags);
+	if (err) {
+		printk(KERN_WARNING
+			"Unable to change RX reserve to: %lu, error: %d\n",
+			reserve, err);
+	}
+	return err;
+}

The original code missed the brackets, so 5 * MAX_CONCURRENT_SKBS wasn't done
per socket. But the comment said it was per socket, so I added in this version.

Greetings,

Indan


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 3/4] deadlock prevention core
  2006-08-12 18:32           ` Indan Zupancic
@ 2006-08-12 18:47             ` Peter Zijlstra
  2006-08-12 19:45               ` Indan Zupancic
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2006-08-12 18:47 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: linux-mm, linux-kernel, netdev, Evgeniy Polyakov,
	Daniel Phillips, Rik van Riel, David Miller

On Sat, 2006-08-12 at 20:32 +0200, Indan Zupancic wrote:
> On Sat, August 12, 2006 20:08, Peter Zijlstra said:
> > On Sat, 2006-08-12 at 19:54 +0200, Indan Zupancic wrote:
> >> True, but currently memalloc_reserve isn't used in a sensible way,
> >> or I'm missing something.
> >
> > Well, I'm somewhat reluctant to stick network related code into mm/, it
> > seems well separated now.
> 
> What I had in mind was something like:
> 
> +static DEFINE_SPINLOCK(memalloc_lock);
> +static int memalloc_socks;
> +
> +atomic_t memalloc_skbs_used;
> +EXPORT_SYMBOL_GPL(memalloc_skbs_used);
> +
> +int sk_adjust_memalloc(int nr_socks)
> +{
> +	unsigned long flags;
> +	unsigned int reserve;
> +	int err;
> +
> +	spin_lock_irqsave(&memalloc_lock, flags);
> +
> +	memalloc_socks += nr_socks;
> +	BUG_ON(memalloc_socks < 0);
> +
> +	reserve = nr_socks * (2 * MAX_PHYS_SEGMENTS + 	/* outbound */
> +			      5 * MAX_CONCURRENT_SKBS);	/* inbound */
> +
> +	err = adjust_memalloc_reserve(reserve);
> +	spin_unlock_irqrestore(&memalloc_lock, flags);
> +	if (err) {
> +		printk(KERN_WARNING
> +			"Unable to change RX reserve to: %lu, error: %d\n",
> +			reserve, err);
> +	}
> +	return err;
> +}
> 
> The original code missed the brackets, so 5 * MAX_CONCURRENT_SKBS wasn't done
> per socket. But the comment said it was per socket, so I added in this version.

Ah right, I did that in v3, with a similar comment, but I realised that
the inbound reserve need not be per socket, and that the comment was
ambiguous enough to allow this reading.

Why not per socket, its used to place an upper bound, its not
calculating one, its setting one.

Like you can see in memalloc_skbs_inc(), it limits to
MAX_CONCURRENT_SKBS.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 0/4] VM deadlock prevention -v4
  2006-08-12 18:16     ` Indan Zupancic
@ 2006-08-12 18:54       ` Peter Zijlstra
  2006-08-12 20:05         ` Indan Zupancic
  2006-08-14  0:42         ` Daniel Phillips
  0 siblings, 2 replies; 30+ messages in thread
From: Peter Zijlstra @ 2006-08-12 18:54 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: linux-mm, linux-kernel, netdev, Evgeniy Polyakov,
	Daniel Phillips, Rik van Riel, David Miller

On Sat, 2006-08-12 at 20:16 +0200, Indan Zupancic wrote:
> On Sat, August 12, 2006 19:33, Peter Zijlstra said:
> > Simpler yes, but also more complete; the old patches had serious issues
> > with the alternative allocation scheme.
> 
> It sure is more complete, and looks nicer, but the price is IMHO too high.
> I'm curious what those serious issues are, and if they can't be fixed.
> 
> > As for why SROG, because trying to stick all the semantics needed for
> > all skb operations into the old approach was nasty, I had it almost
> > complete but it was horror (and more code than the SROG approach).
> 
> What was missing or wrong in the old approach? Can't you use the new
> approach, but use alloc_pages() instead of SROG?
> 
> Sorry if I bug you so, but I'm also trying to increase my knowledge here. ;-)

I'm almost sorry I threw that code out, you'd understand instantly..

Lemme see what I can do to explain; what I need/want is:
 - single allocation group per packet - that is, when I free a packet
and all its associated object I get my memory back.
 - not waste too much space managing the various objects

skb operations want to allocate various sk_buffs for the same data
clones. Also, it wants to be able to break the COW or realloc the data.

The trivial approach would be one page (or higher alloc page) per
object, and that will work quite well, except that it'll waste a _lot_
of memory. 

So I tried manual packing (parts of that you have seen in previous
attempts). This gets hard when you want to do unlimited clones and COW
breaks. To do either you need to go link several pages.

So needing a list of pages and wanting packing gave me SROG. The biggest
wart is having to deal with higher order pages. Explicitly coding in
knowledge of the object you're packing just makes the code bigger - such
is the power of abstraction.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 3/4] deadlock prevention core
  2006-08-12 18:47             ` Peter Zijlstra
@ 2006-08-12 19:45               ` Indan Zupancic
  0 siblings, 0 replies; 30+ messages in thread
From: Indan Zupancic @ 2006-08-12 19:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, netdev, Evgeniy Polyakov,
	Daniel Phillips, Rik van Riel, David Miller

On Sat, August 12, 2006 20:47, Peter Zijlstra said:
> Ah right, I did that in v3, with a similar comment, but I realised that
> the inbound reserve need not be per socket, and that the comment was
> ambiguous enough to allow this reading.

True, but better to change the comment than to confuse people.
Lots of it is outdated because reservations aren't per device anymore.

Changes to your version:
- Got rid of memalloc_socks.
- Don't include inetdevice.h (it isn't needed anymore, right?)
- Updated comment.

(I'm editing the diff, so this won't apply)

Index: linux-2.6/net/core/sock.c
===================================================================
--- linux-2.6.orig/net/core/sock.c	2006-08-12 12:56:06.000000000 +0200
+++ linux-2.6/net/core/sock.c	2006-08-12 13:02:59.000000000 +0200
@@ -111,6 +111,8 @@
 #include <linux/poll.h>
 #include <linux/tcp.h>
 #include <linux/init.h>
+#include <linux/blkdev.h>

 #include <asm/uaccess.h>
 #include <asm/system.h>
@@ -195,6 +197,78 @@ __u32 sysctl_rmem_default = SK_RMEM_MAX;
 /* Maximal space eaten by iovec or ancilliary data plus some space */
 int sysctl_optmem_max = sizeof(unsigned long)*(2*UIO_MAXIOV + 512);

+static DEFINE_SPINLOCK(memalloc_lock);
+static int memalloc_socks;
+static unsigned long memalloc_reserve;
+
+atomic_t memalloc_skbs_used;
+EXPORT_SYMBOL_GPL(memalloc_skbs_used);
+
+/**
+ *        sk_adjust_memalloc - adjust the global memalloc reserve
+ *        @nr_socks: number of new %SOCK_MEMALLOC sockets
+ *
+ *        This function adjusts the memalloc reserve based on system demand.
+ *        For each %SOCK_MEMALLOC socket 2 * %MAX_PHYS_SEGMENTS pages are
+ *        reserved for outbound traffic (assumption: each %SOCK_MEMALLOC
+ *        socket will have a %request_queue associated).
+ *
+ *        Pages for inbound traffic are already reserved.
+ *
+ *        2 * %MAX_PHYS_SEGMENTS - the request queue can hold up to 150%,
+ *                the remaining 50% goes to being sure we can write packets
+ *                for the outgoing pages.
+ */
+static DEFINE_SPINLOCK(memalloc_lock);
+static int memalloc_socks;
+
+atomic_t memalloc_skbs_used;
+EXPORT_SYMBOL_GPL(memalloc_skbs_used);
+
+int sk_adjust_memalloc(int nr_socks)
+{
+	unsigned long flags;
+	unsigned int reserve;
+	int err;
+
+	spin_lock_irqsave(&memalloc_lock, flags);
+
+	memalloc_socks += nr_socks;
+	BUG_ON(memalloc_socks < 0);
+
+	reserve = nr_socks * 2 * MAX_PHYS_SEGMENTS;	/* outbound */
+
+	err = adjust_memalloc_reserve(reserve);
+	spin_unlock_irqrestore(&memalloc_lock, flags);
+	if (err) {
+		printk(KERN_WARNING
+			"Unable to adjust RX reserve by %lu, error: %d\n",
+			reserve, err);
+	}
+	return err;
+}
+EXPORT_SYMBOL_GPL(sk_adjust_memalloc);

What's missing now is an adjust_memalloc_reserve(5 * MAX_CONCURRENT_SKBS)
call in some init code.

Greetings,

Indan


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 0/4] VM deadlock prevention -v4
  2006-08-12 18:54       ` Peter Zijlstra
@ 2006-08-12 20:05         ` Indan Zupancic
  2006-08-14  0:42         ` Daniel Phillips
  1 sibling, 0 replies; 30+ messages in thread
From: Indan Zupancic @ 2006-08-12 20:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, netdev, Evgeniy Polyakov,
	Daniel Phillips, Rik van Riel, David Miller

On Sat, August 12, 2006 20:54, Peter Zijlstra said:
>  - single allocation group per packet - that is, when I free a packet
> and all its associated object I get my memory back.

This is easy.

>  - not waste too much space managing the various objects

This too, when ignoring clones and COW.

> skb operations want to allocate various sk_buffs for the same data
> clones. Also, it wants to be able to break the COW or realloc the data.

So this seems to be what adds all the complexity.

> So I tried manual packing (parts of that you have seen in previous
> attempts). This gets hard when you want to do unlimited clones and COW
> breaks. To do either you need to go link several pages.

It gets messy quite quickly, yes.

> So needing a list of pages and wanting packing gave me SROG. The biggest
> wart is having to deal with higher order pages. Explicitly coding in
> knowledge of the object you're packing just makes the code bigger - such
> is the power of abstraction.

I assume you meant "Not explicitly coding in", or else I'm tempted to disagree.
Abstraction that has only one user which uses it in one way only adds bloat.
But looking at the code a bit more I'm afraid you're right.

Greetings,

Indan


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: rename *MEMALLOC flags
  2006-08-12 15:06     ` rename *MEMALLOC flags (was: Re: [RFC][PATCH 3/4] deadlock prevention core) Peter Zijlstra
  2006-08-12 15:28       ` Indan Zupancic
@ 2006-08-14  0:06       ` Daniel Phillips
  2006-08-14  1:00         ` Paul Jackson
  1 sibling, 1 reply; 30+ messages in thread
From: Daniel Phillips @ 2006-08-14  0:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeff Garzik, linux-mm, linux-kernel, netdev, Indan Zupancic,
	Evgeniy Polyakov, Rik van Riel, David Miller

Peter Zijlstra wrote:
>Jeff Garzik in his infinite wisdom spake thusly:
>>Peter Zijlstra wrote:
>>
>>>Index: linux-2.6/include/linux/gfp.h
>>>===================================================================
>>>--- linux-2.6.orig/include/linux/gfp.h	2006-08-12 12:56:06.000000000 +0200
>>>+++ linux-2.6/include/linux/gfp.h	2006-08-12 12:56:09.000000000 +0200
>>>@@ -46,6 +46,7 @@ struct vm_area_struct;
>>> #define __GFP_ZERO	((__force gfp_t)0x8000u)/* Return zeroed page on success */
>>> #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
>>> #define __GFP_HARDWALL   ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
>>>+#define __GFP_MEMALLOC  ((__force gfp_t)0x40000u) /* Use emergency reserves */
>>
>>This symbol name has nothing to do with its purpose.  The entire area of 
>>code you are modifying could be described as having something to do with 
>>'memalloc'.
>>
>>GFP_EMERGENCY or GFP_USE_RESERVES or somesuch would be a far better 
>>symbol name.
>>
>>I recognize that is matches with GFP_NOMEMALLOC, but that doesn't change 
>>the situation anyway.  In fact, a cleanup patch to rename GFP_NOMEMALLOC 
>>would be nice.
> 
> I'm rather bad at picking names, but here goes:
> 
> PF_MEMALLOC      -> PF_EMERGALLOC
> __GFP_NOMEMALLOC -> __GFP_NOEMERGALLOC
> __GFP_MEMALLOC   -> __GFP_EMERGALLOC
> 
> Is that suitable and shall I prepare patches? Or do we want more ppl to
> chime in and have a few more rounds?

MEMALLOC is the name Linus chose to name exactly the reserve from which we
are allocating.  Perhaps that was just Linus being denser than jgarzik and
not realizing that he should have called it EMERGALLOC right from the start.

BUT since Linus did call it MEMALLOC, we should too.  Or just email Linus
and tell him how much better EMERGALLOC rolls off the tongue, and could we
please change all occurances of MEMALLOC to EMERGALLOC.  Then don't read
your email for a week ;-)

Inventing a new name for an existing thing is very poor taste on grounds of
grepability alone.

Regards,

Daniel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 0/4] VM deadlock prevention -v4
  2006-08-12 18:54       ` Peter Zijlstra
  2006-08-12 20:05         ` Indan Zupancic
@ 2006-08-14  0:42         ` Daniel Phillips
  2006-08-14  5:20           ` Evgeniy Polyakov
  1 sibling, 1 reply; 30+ messages in thread
From: Daniel Phillips @ 2006-08-14  0:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Indan Zupancic, linux-mm, linux-kernel, netdev, Evgeniy Polyakov,
	Rik van Riel, David Miller

Peter Zijlstra wrote:
> On Sat, 2006-08-12 at 20:16 +0200, Indan Zupancic wrote:
>>What was missing or wrong in the old approach? Can't you use the new
>>approach, but use alloc_pages() instead of SROG?
>>
>>Sorry if I bug you so, but I'm also trying to increase my knowledge here. ;-)
> 
> I'm almost sorry I threw that code out...

Good instinct :-)

> Lemme see what I can do to explain; what I need/want is:
>  - single allocation group per packet - that is, when I free a packet
> and all its associated object I get my memory back.

First, try to recast all your objects as pages, as Evgeniy Polyakov
suggested.  Then if there is some place where that just doesn't work
(please point it out) put a mempool there and tweak the high level
reservation setup accordingly.

>  - not waste too much space managing the various objects

If we waste a little space only where the network would have otherwise
dropped a packet, that is still a big win.  We just need to be sure the
normal path does not become more wasteful.

> skb operations want to allocate various sk_buffs for the same data
> clones. Also, it wants to be able to break the COW or realloc the data.
> 
> The trivial approach would be one page (or higher alloc page) per
> object, and that will work quite well, except that it'll waste a _lot_
> of memory. 

High order allocations are just way too undependable without active
defragmentation, which isn't even on the horizon at the moment.  We
just need to treat any network hardware that can't scatter/gather into
single pages as too broken to use for network block io.

As for sk_buff cow break, we need to look at which network paths do it
(netfilter obviously, probably others) and decide whether we just want
to declare that the feature breaks network block IO, or fix the feature
so it plays well with reserve accounting.

> So I tried manual packing (parts of that you have seen in previous
> attempts). This gets hard when you want to do unlimited clones and COW
> breaks. To do either you need to go link several pages.

You need to avoid the temptation to fix the entire world on the first
attempt.  Sure, there will be people who call for gobs of overengineering
right from the start, but simple has always worked better for Linux than
lathering on layers of complexity just to support some feature that may
arguably be broken by design.  For example, swapping through a firewall.

Changing from per-interface to a global block IO reserve was a great
simplification, we need more of those.

Looking forward to -v5 ;-)

Regards,

Daniel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: rename *MEMALLOC flags
  2006-08-14  0:06       ` rename *MEMALLOC flags Daniel Phillips
@ 2006-08-14  1:00         ` Paul Jackson
  2006-08-14  3:42           ` Nick Piggin
  0 siblings, 1 reply; 30+ messages in thread
From: Paul Jackson @ 2006-08-14  1:00 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: a.p.zijlstra, jeff, linux-mm, linux-kernel, netdev, indan,
	johnpol, riel, davem, Nick Piggin

Daniel wrote:
> Inventing a new name for an existing thing is very poor taste on grounds of
> grepability alone.

I wouldn't say 'very poor taste' -- just something that should be
done infrequently, with good reason, and with reasonable concensus,
especially from the key maintainers in the affected area.

Good names are good taste, in my book.  But stable naming is good too.

I wonder what Nick thinks of this?  Looks like he added
__GFP_NOMEMALLOC a year ago, following the naming style of PF_MEMALLOC.

I added him to the cc list.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: rename *MEMALLOC flags
  2006-08-14  1:00         ` Paul Jackson
@ 2006-08-14  3:42           ` Nick Piggin
  0 siblings, 0 replies; 30+ messages in thread
From: Nick Piggin @ 2006-08-14  3:42 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Daniel Phillips, a.p.zijlstra, jeff, linux-mm, linux-kernel,
	netdev, indan, johnpol, riel, davem

Paul Jackson wrote:
> Daniel wrote:
> 
>>Inventing a new name for an existing thing is very poor taste on grounds of
>>grepability alone.
> 
> 
> I wouldn't say 'very poor taste' -- just something that should be
> done infrequently, with good reason, and with reasonable concensus,
> especially from the key maintainers in the affected area.
> 
> Good names are good taste, in my book.  But stable naming is good too.
> 
> I wonder what Nick thinks of this?  Looks like he added
> __GFP_NOMEMALLOC a year ago, following the naming style of PF_MEMALLOC.
> 
> I added him to the cc list.
> 

__GFP_NOMEMALLOC was added to prevent mempool backed allocations from
accessing the emergency reserve. Because that would just shift deadlocks
from mempool "safe" sites to those which have not been converted.

PF_MEMALLOC is a good name: PF_MEMALLOC says that the task is currently
allocating memory. It does not say anything about the actual allocator
implementation details to handle this (1. don't recurse into reclaim; 2.
allow access to reserves), but that is a good thing.

__GFP_NOMEMALLOC and __GFP_MEMALLOC are poorly named (I take the blame).
It isn't that the task is suddenly no longer allocating in the context
of an allocation, it is just that you want to allow or deny access to
the reserve.

__GFP_NOMEMALLOC should be something like __GFP_EMERG_NEVER and
__GFP_MEMALLOC should be _ALWAYS. Or something like that.

NOMEMALLOC is specific enough that I don't mind a rename at this stage.
Renaming PF_MEMALLOC would be wrong, however.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 0/4] VM deadlock prevention -v4
  2006-08-14  0:42         ` Daniel Phillips
@ 2006-08-14  5:20           ` Evgeniy Polyakov
  2006-08-14 12:21             ` Rik van Riel
  0 siblings, 1 reply; 30+ messages in thread
From: Evgeniy Polyakov @ 2006-08-14  5:20 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Peter Zijlstra, Indan Zupancic, linux-mm, linux-kernel, netdev,
	Rik van Riel, David Miller

On Sun, Aug 13, 2006 at 05:42:47PM -0700, Daniel Phillips (phillips@google.com) wrote:
> High order allocations are just way too undependable without active
> defragmentation, which isn't even on the horizon at the moment.  We
> just need to treat any network hardware that can't scatter/gather into
> single pages as too broken to use for network block io.

A bit of network tree allocator free advertisement - per-CPU self 
defragmentation works reliably in that allocator, one could even find a
graphs of memory usage for NTA and SLAB-like allocator.

> As for sk_buff cow break, we need to look at which network paths do it
> (netfilter obviously, probably others) and decide whether we just want
> to declare that the feature breaks network block IO, or fix the feature
> so it plays well with reserve accounting.

I would suggest to consider skb cow (cloning) as a must.


> Regards,
> 
> Daniel

-- 
	Evgeniy Polyakov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 0/4] VM deadlock prevention -v4
  2006-08-14  5:20           ` Evgeniy Polyakov
@ 2006-08-14 12:21             ` Rik van Riel
  2006-08-14 12:51               ` Herbert Xu
  0 siblings, 1 reply; 30+ messages in thread
From: Rik van Riel @ 2006-08-14 12:21 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Daniel Phillips, Peter Zijlstra, Indan Zupancic, linux-mm,
	linux-kernel, netdev, David Miller

Evgeniy Polyakov wrote:
> On Sun, Aug 13, 2006 at 05:42:47PM -0700, Daniel Phillips (phillips@google.com) wrote:

>> As for sk_buff cow break, we need to look at which network paths do it
>> (netfilter obviously, probably others) and decide whether we just want
>> to declare that the feature breaks network block IO, or fix the feature
>> so it plays well with reserve accounting.
> 
> I would suggest to consider skb cow (cloning) as a must.

That should not be any problem, since skb's (including cowed ones)
are short lived anyway.  Allocating a little bit more memory is
fine when we have a guarantee that the memory will be freed again
shortly.

-- 
What is important?  What you want to be true, or what is true?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 0/4] VM deadlock prevention -v4
  2006-08-14 12:21             ` Rik van Riel
@ 2006-08-14 12:51               ` Herbert Xu
  2006-08-14 14:22                 ` Rik van Riel
  0 siblings, 1 reply; 30+ messages in thread
From: Herbert Xu @ 2006-08-14 12:51 UTC (permalink / raw)
  To: Rik van Riel
  Cc: johnpol, phillips, a.p.zijlstra, indan, linux-mm, linux-kernel,
	netdev, davem

Rik van Riel <riel@redhat.com> wrote:
> 
> That should not be any problem, since skb's (including cowed ones)
> are short lived anyway.  Allocating a little bit more memory is
> fine when we have a guarantee that the memory will be freed again
> shortly.

I'm not sure about the context the comment applies to, but skb's are
not necessarily short-lived.  For example, they could be queued for
a few seconds for ARP/NDISC and even longer for IPsec SA resolution.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 0/4] VM deadlock prevention -v4
  2006-08-14 12:51               ` Herbert Xu
@ 2006-08-14 14:22                 ` Rik van Riel
  0 siblings, 0 replies; 30+ messages in thread
From: Rik van Riel @ 2006-08-14 14:22 UTC (permalink / raw)
  To: Herbert Xu
  Cc: johnpol, phillips, a.p.zijlstra, indan, linux-mm, linux-kernel,
	netdev, davem

Herbert Xu wrote:
> Rik van Riel <riel@redhat.com> wrote:
>> That should not be any problem, since skb's (including cowed ones)
>> are short lived anyway.  Allocating a little bit more memory is
>> fine when we have a guarantee that the memory will be freed again
>> shortly.
> 
> I'm not sure about the context the comment applies to, but skb's are
> not necessarily short-lived.  For example, they could be queued for
> a few seconds for ARP/NDISC and even longer for IPsec SA resolution.

That's still below the threshold where it should cause problems
with the VM going OOM.  Especially if there aren't too many of
these packets.

-- 
All Rights Reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 4/4] deadlock prevention for NBD
  2006-08-12 14:14 ` [RFC][PATCH 4/4] deadlock prevention for NBD Peter Zijlstra
@ 2006-08-24 14:43   ` Pavel Machek
  0 siblings, 0 replies; 30+ messages in thread
From: Pavel Machek @ 2006-08-24 14:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, netdev, Indan Zupancic, Evgeniy Polyakov,
	Daniel Phillips, Rik van Riel, David Miller

Hi!

> Limit each request to 1 page, so that the request throttling also limits the
> number of in-flight pages and force the IO scheduler to NOOP as anything else
> doesn't make sense anyway.

I'd like to understand why it breaks with other schedulers before
merging this. Maybe the failure in NOOP is just harder to trigger?

							Pavel

-- 
Thanks for all the (sleeping) penguins.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2006-08-24 14:43 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-08-12 14:14 [RFC][PATCH 0/4] VM deadlock prevention -v4 Peter Zijlstra
2006-08-12 14:14 ` [RFC][PATCH 1/4] pfn_to_kaddr() for UML Peter Zijlstra
2006-08-12 14:14 ` [RFC][PATCH 2/4] SROG allocator Peter Zijlstra
2006-08-12 14:14 ` [RFC][PATCH 3/4] deadlock prevention core Peter Zijlstra
2006-08-12 14:41   ` Jeff Garzik
2006-08-12 15:06     ` rename *MEMALLOC flags (was: Re: [RFC][PATCH 3/4] deadlock prevention core) Peter Zijlstra
2006-08-12 15:28       ` Indan Zupancic
2006-08-12 15:34         ` Peter Zijlstra
2006-08-14  0:06       ` rename *MEMALLOC flags Daniel Phillips
2006-08-14  1:00         ` Paul Jackson
2006-08-14  3:42           ` Nick Piggin
2006-08-12 17:31   ` [RFC][PATCH 3/4] deadlock prevention core Indan Zupancic
2006-08-12 17:44     ` Peter Zijlstra
2006-08-12 17:54       ` Indan Zupancic
2006-08-12 18:08         ` Peter Zijlstra
2006-08-12 18:32           ` Indan Zupancic
2006-08-12 18:47             ` Peter Zijlstra
2006-08-12 19:45               ` Indan Zupancic
2006-08-12 14:14 ` [RFC][PATCH 4/4] deadlock prevention for NBD Peter Zijlstra
2006-08-24 14:43   ` Pavel Machek
2006-08-12 16:51 ` [RFC][PATCH 0/4] VM deadlock prevention -v4 Indan Zupancic
2006-08-12 17:33   ` Peter Zijlstra
2006-08-12 18:16     ` Indan Zupancic
2006-08-12 18:54       ` Peter Zijlstra
2006-08-12 20:05         ` Indan Zupancic
2006-08-14  0:42         ` Daniel Phillips
2006-08-14  5:20           ` Evgeniy Polyakov
2006-08-14 12:21             ` Rik van Riel
2006-08-14 12:51               ` Herbert Xu
2006-08-14 14:22                 ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox