[PATCH 00/15] per device dirty throttling -v6

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/15] per device dirty throttling -v6
@ 2007-05-10 10:08 Peter Zijlstra
  2007-05-10 10:08 ` [PATCH 01/15] nfs: remove congestion_end() Peter Zijlstra
                   ` (15 more replies)
  0 siblings, 16 replies; 18+ messages in thread
From: Peter Zijlstra @ 2007-05-10 10:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou

The latest version of the per device dirty throttling patches.

I put in quite a few comments, and added an patch to do per task dirty
throttling as well, for RFCs sake :-)

I haven't yet come around to do anything but integrety testing on this code
base, ie. it built a kernel. I hope to do more tests shorty if time permits...

Perhaps the people on bugzilla.kernel.org #7372 might be willing to help out
there.

Oh, patches are against 2.6.21-mm2

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 01/15] nfs: remove congestion_end()
  2007-05-10 10:08 [PATCH 00/15] per device dirty throttling -v6 Peter Zijlstra
@ 2007-05-10 10:08 ` Peter Zijlstra
  2007-05-10 10:08 ` [PATCH 02/15] lib: percpu_counter variable batch Peter Zijlstra
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2007-05-10 10:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou

[-- Attachment #1: nfs_congestion_fixup.patch --]
[-- Type: text/plain, Size: 2413 bytes --]

Its redundant, clear_bdi_congested() already wakes the waiters.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/nfs/write.c              |    4 +---
 include/linux/backing-dev.h |    1 -
 mm/backing-dev.c            |   13 -------------
 3 files changed, 1 insertion(+), 17 deletions(-)

Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c	2007-04-22 18:12:46.000000000 +0200
+++ linux-2.6/fs/nfs/write.c	2007-04-22 18:12:49.000000000 +0200
@@ -235,10 +235,8 @@ static void nfs_end_page_writeback(struc
 	struct nfs_server *nfss = NFS_SERVER(inode);
 
 	end_page_writeback(page);
-	if (atomic_long_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH) {
+	if (atomic_long_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
 		clear_bdi_congested(&nfss->backing_dev_info, WRITE);
-		congestion_end(WRITE);
-	}
 }
 
 /*
Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h	2007-04-22 18:12:48.000000000 +0200
+++ linux-2.6/include/linux/backing-dev.h	2007-04-22 18:12:49.000000000 +0200
@@ -96,7 +96,6 @@ void clear_bdi_congested(struct backing_
 void set_bdi_congested(struct backing_dev_info *bdi, int rw);
 long congestion_wait(int rw, long timeout);
 long congestion_wait_interruptible(int rw, long timeout);
-void congestion_end(int rw);
 
 #define bdi_cap_writeback_dirty(bdi) \
 	(!((bdi)->capabilities & BDI_CAP_NO_WRITEBACK))
Index: linux-2.6/mm/backing-dev.c
===================================================================
--- linux-2.6.orig/mm/backing-dev.c	2007-04-22 18:12:46.000000000 +0200
+++ linux-2.6/mm/backing-dev.c	2007-04-22 18:12:49.000000000 +0200
@@ -70,16 +70,3 @@ long congestion_wait_interruptible(int r
 	return ret;
 }
 EXPORT_SYMBOL(congestion_wait_interruptible);
-
-/**
- * congestion_end - wake up sleepers on a congested backing_dev_info
- * @rw: READ or WRITE
- */
-void congestion_end(int rw)
-{
-	wait_queue_head_t *wqh = &congestion_wqh[rw];
-
-	if (waitqueue_active(wqh))
-		wake_up(wqh);
-}
-EXPORT_SYMBOL(congestion_end);

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 02/15] lib: percpu_counter variable batch
  2007-05-10 10:08 [PATCH 00/15] per device dirty throttling -v6 Peter Zijlstra
  2007-05-10 10:08 ` [PATCH 01/15] nfs: remove congestion_end() Peter Zijlstra
@ 2007-05-10 10:08 ` Peter Zijlstra
  2007-05-10 10:08 ` [PATCH 03/15] lib: percpu_counter_mod64 Peter Zijlstra
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2007-05-10 10:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou

[-- Attachment #1: percpu_counter_batch.patch --]
[-- Type: text/plain, Size: 2627 bytes --]

Because the current batch setup has an quadric error bound on the counter,
allow for an alternative setup.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/percpu_counter.h |   10 +++++++++-
 lib/percpu_counter.c           |    6 +++---
 2 files changed, 12 insertions(+), 4 deletions(-)

Index: linux-2.6/include/linux/percpu_counter.h
===================================================================
--- linux-2.6.orig/include/linux/percpu_counter.h	2007-04-29 18:14:37.000000000 +0200
+++ linux-2.6/include/linux/percpu_counter.h	2007-04-29 18:15:26.000000000 +0200
@@ -38,9 +38,14 @@ static inline void percpu_counter_destro
 	free_percpu(fbc->counters);
 }
 
-void percpu_counter_mod(struct percpu_counter *fbc, s32 amount);
+void __percpu_counter_mod(struct percpu_counter *fbc, s32 amount, s32 batch);
 s64 percpu_counter_sum(struct percpu_counter *fbc);
 
+static inline void percpu_counter_mod(struct percpu_counter *fbc, s32 amount)
+{
+	__percpu_counter_mod(fbc, amount, FBC_BATCH);
+}
+
 static inline s64 percpu_counter_read(struct percpu_counter *fbc)
 {
 	return fbc->count;
@@ -76,6 +81,9 @@ static inline void percpu_counter_destro
 {
 }
 
+#define __percpu_counter_mod(fbc, amount, batch) \
+	percpu_counter_mod(fbc, amount)
+
 static inline void
 percpu_counter_mod(struct percpu_counter *fbc, s32 amount)
 {
Index: linux-2.6/lib/percpu_counter.c
===================================================================
--- linux-2.6.orig/lib/percpu_counter.c	2007-04-29 18:14:37.000000000 +0200
+++ linux-2.6/lib/percpu_counter.c	2007-04-29 18:14:40.000000000 +0200
@@ -5,7 +5,7 @@
 #include <linux/percpu_counter.h>
 #include <linux/module.h>
 
-void percpu_counter_mod(struct percpu_counter *fbc, s32 amount)
+void __percpu_counter_mod(struct percpu_counter *fbc, s32 amount, s32 batch)
 {
 	long count;
 	s32 *pcount;
@@ -13,7 +13,7 @@ void percpu_counter_mod(struct percpu_co
 
 	pcount = per_cpu_ptr(fbc->counters, cpu);
 	count = *pcount + amount;
-	if (count >= FBC_BATCH || count <= -FBC_BATCH) {
+	if (count >= batch || count <= -batch) {
 		spin_lock(&fbc->lock);
 		fbc->count += count;
 		*pcount = 0;
@@ -23,7 +23,7 @@ void percpu_counter_mod(struct percpu_co
 	}
 	put_cpu();
 }
-EXPORT_SYMBOL(percpu_counter_mod);
+EXPORT_SYMBOL(__percpu_counter_mod);
 
 /*
  * Add up all the per-cpu counts, return the result.  This is a more accurate

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 03/15] lib: percpu_counter_mod64
  2007-05-10 10:08 [PATCH 00/15] per device dirty throttling -v6 Peter Zijlstra
  2007-05-10 10:08 ` [PATCH 01/15] nfs: remove congestion_end() Peter Zijlstra
  2007-05-10 10:08 ` [PATCH 02/15] lib: percpu_counter variable batch Peter Zijlstra
@ 2007-05-10 10:08 ` Peter Zijlstra
  2007-05-10 10:08 ` [PATCH 04/15] lib: percpu_counter_set Peter Zijlstra
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2007-05-10 10:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou

[-- Attachment #1: percpu_counter_mod.patch --]
[-- Type: text/plain, Size: 2847 bytes --]

Add percpu_counter_mod64() to allow large modifications.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/percpu_counter.h |   17 +++++++++++++++++
 lib/percpu_counter.c           |   20 ++++++++++++++++++++
 2 files changed, 37 insertions(+)

Index: linux-2.6/include/linux/percpu_counter.h
===================================================================
--- linux-2.6.orig/include/linux/percpu_counter.h	2007-04-29 18:15:26.000000000 +0200
+++ linux-2.6/include/linux/percpu_counter.h	2007-04-29 18:16:49.000000000 +0200
@@ -39,6 +39,7 @@ static inline void percpu_counter_destro
 }
 
 void __percpu_counter_mod(struct percpu_counter *fbc, s32 amount, s32 batch);
+void __percpu_counter_mod64(struct percpu_counter *fbc, s64 amount, s32 batch);
 s64 percpu_counter_sum(struct percpu_counter *fbc);
 
 static inline void percpu_counter_mod(struct percpu_counter *fbc, s32 amount)
@@ -46,6 +47,11 @@ static inline void percpu_counter_mod(st
 	__percpu_counter_mod(fbc, amount, FBC_BATCH);
 }
 
+static inline void percpu_counter_mod64(struct percpu_counter *fbc, s64 amount)
+{
+	__percpu_counter_mod64(fbc, amount, FBC_BATCH);
+}
+
 static inline s64 percpu_counter_read(struct percpu_counter *fbc)
 {
 	return fbc->count;
@@ -92,6 +98,17 @@ percpu_counter_mod(struct percpu_counter
 	preempt_enable();
 }
 
+#define __percpu_counter_mod64(fbc, amount, batch) \
+	percpu_counter_mod64(fbc, amount)
+
+static inline void
+percpu_counter_mod64(struct percpu_counter *fbc, s64 amount)
+{
+	preempt_disable();
+	fbc->count += amount;
+	preempt_enable();
+}
+
 static inline s64 percpu_counter_read(struct percpu_counter *fbc)
 {
 	return fbc->count;
Index: linux-2.6/lib/percpu_counter.c
===================================================================
--- linux-2.6.orig/lib/percpu_counter.c	2007-04-29 18:14:40.000000000 +0200
+++ linux-2.6/lib/percpu_counter.c	2007-04-29 18:15:34.000000000 +0200
@@ -25,6 +25,26 @@ void __percpu_counter_mod(struct percpu_
 }
 EXPORT_SYMBOL(__percpu_counter_mod);
 
+void __percpu_counter_mod64(struct percpu_counter *fbc, s64 amount, s32 batch)
+{
+	s64 count;
+	s32 *pcount;
+	int cpu = get_cpu();
+
+	pcount = per_cpu_ptr(fbc->counters, cpu);
+	count = *pcount + amount;
+	if (count >= batch || count <= -batch) {
+		spin_lock(&fbc->lock);
+		fbc->count += count;
+		*pcount = 0;
+		spin_unlock(&fbc->lock);
+	} else {
+		*pcount = count;
+	}
+	put_cpu();
+}
+EXPORT_SYMBOL(__percpu_counter_mod64);
+
 /*
  * Add up all the per-cpu counts, return the result.  This is a more accurate
  * but much slower version of percpu_counter_read_positive()

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 04/15] lib: percpu_counter_set
  2007-05-10 10:08 [PATCH 00/15] per device dirty throttling -v6 Peter Zijlstra
                   ` (2 preceding siblings ...)
  2007-05-10 10:08 ` [PATCH 03/15] lib: percpu_counter_mod64 Peter Zijlstra
@ 2007-05-10 10:08 ` Peter Zijlstra
  2007-05-10 10:08 ` [PATCH 05/15] lib: percpu_count_sum_signed() Peter Zijlstra
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2007-05-10 10:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou

[-- Attachment #1: percpu_counter_set.patch --]
[-- Type: text/plain, Size: 2102 bytes --]

Provide a method to set a percpu counter to a specified value.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/percpu_counter.h |    6 ++++++
 lib/percpu_counter.c           |   13 +++++++++++++
 2 files changed, 19 insertions(+)

Index: linux-2.6/include/linux/percpu_counter.h
===================================================================
--- linux-2.6.orig/include/linux/percpu_counter.h	2007-05-02 12:44:13.000000000 +0200
+++ linux-2.6/include/linux/percpu_counter.h	2007-05-02 19:06:44.000000000 +0200
@@ -38,6 +38,7 @@ static inline void percpu_counter_destro
 	free_percpu(fbc->counters);
 }
 
+void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
 void __percpu_counter_mod(struct percpu_counter *fbc, s32 amount, s32 batch);
 void __percpu_counter_mod64(struct percpu_counter *fbc, s64 amount, s32 batch);
 s64 percpu_counter_sum(struct percpu_counter *fbc);
@@ -87,6 +88,11 @@ static inline void percpu_counter_destro
 {
 }
 
+static inline void percpu_counter_set(struct percpu_counter *fbc, s64 amount)
+{
+	fbc->count = amount;
+}
+
 #define __percpu_counter_mod(fbc, amount, batch) \
 	percpu_counter_mod(fbc, amount)
 
Index: linux-2.6/lib/percpu_counter.c
===================================================================
--- linux-2.6.orig/lib/percpu_counter.c	2007-05-02 12:44:13.000000000 +0200
+++ linux-2.6/lib/percpu_counter.c	2007-05-02 19:08:12.000000000 +0200
@@ -5,6 +5,19 @@
 #include <linux/percpu_counter.h>
 #include <linux/module.h>
 
+void percpu_counter_set(struct percpu_counter *fbc, s64 amount)
+{
+	int cpu;
+
+	spin_lock(&fbc->lock);
+	for_each_possible_cpu(cpu) {
+		s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
+		*pcount = 0;
+	}
+	fbc->count = amount;
+	spin_unlock(&fbc->lock);
+}
+
 void __percpu_counter_mod(struct percpu_counter *fbc, s32 amount, s32 batch)
 {
 	long count;

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 05/15] lib: percpu_count_sum_signed()
  2007-05-10 10:08 [PATCH 00/15] per device dirty throttling -v6 Peter Zijlstra
                   ` (3 preceding siblings ...)
  2007-05-10 10:08 ` [PATCH 04/15] lib: percpu_counter_set Peter Zijlstra
@ 2007-05-10 10:08 ` Peter Zijlstra
  2007-05-10 10:08 ` [PATCH 06/15] mm: bdi init hooks Peter Zijlstra
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2007-05-10 10:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou

[-- Attachment #1: percpu_counter_sum.patch --]
[-- Type: text/plain, Size: 2800 bytes --]

Provide an accurate version of percpu_counter_read.

Should we go and replace the current use of percpu_counter_sum()
with percpu_counter_sum_positive(), and call this new primitive
percpu_counter_sum() instead?

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/percpu_counter.h |   18 +++++++++++++++++-
 lib/percpu_counter.c           |    6 +++---
 2 files changed, 20 insertions(+), 4 deletions(-)

Index: linux-2.6/include/linux/percpu_counter.h
===================================================================
--- linux-2.6.orig/include/linux/percpu_counter.h	2007-05-02 19:43:34.000000000 +0200
+++ linux-2.6/include/linux/percpu_counter.h	2007-05-04 09:42:47.000000000 +0200
@@ -41,7 +41,18 @@ static inline void percpu_counter_destro
 void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
 void __percpu_counter_mod(struct percpu_counter *fbc, s32 amount, s32 batch);
 void __percpu_counter_mod64(struct percpu_counter *fbc, s64 amount, s32 batch);
-s64 percpu_counter_sum(struct percpu_counter *fbc);
+s64 __percpu_counter_sum(struct percpu_counter *fbc);
+
+static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
+{
+	s64 ret = __percpu_counter_sum(fbc);
+	return ret < 0 ? 0 : ret;
+}
+
+static inline s64 percpu_counter_sum_signed(struct percpu_counter *fbc)
+{
+	return __percpu_counter_sum(fbc);
+}
 
 static inline void percpu_counter_mod(struct percpu_counter *fbc, s32 amount)
 {
@@ -130,6 +141,11 @@ static inline s64 percpu_counter_sum(str
 	return percpu_counter_read_positive(fbc);
 }
 
+static inline s64 percpu_counter_sum_signed(struct percpu_counter *fbc)
+{
+	return fbc->count;
+}
+
 #endif	/* CONFIG_SMP */
 
 static inline void percpu_counter_inc(struct percpu_counter *fbc)
Index: linux-2.6/lib/percpu_counter.c
===================================================================
--- linux-2.6.orig/lib/percpu_counter.c	2007-05-02 19:43:34.000000000 +0200
+++ linux-2.6/lib/percpu_counter.c	2007-05-04 09:38:36.000000000 +0200
@@ -62,7 +62,7 @@ EXPORT_SYMBOL(__percpu_counter_mod64);
  * Add up all the per-cpu counts, return the result.  This is a more accurate
  * but much slower version of percpu_counter_read_positive()
  */
-s64 percpu_counter_sum(struct percpu_counter *fbc)
+s64 __percpu_counter_sum(struct percpu_counter *fbc)
 {
 	s64 ret;
 	int cpu;
@@ -74,6 +74,6 @@ s64 percpu_counter_sum(struct percpu_cou
 		ret += *pcount;
 	}
 	spin_unlock(&fbc->lock);
-	return ret < 0 ? 0 : ret;
+	return ret;
 }
-EXPORT_SYMBOL(percpu_counter_sum);
+EXPORT_SYMBOL(__percpu_counter_sum);

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 06/15] mm: bdi init hooks
  2007-05-10 10:08 [PATCH 00/15] per device dirty throttling -v6 Peter Zijlstra
                   ` (4 preceding siblings ...)
  2007-05-10 10:08 ` [PATCH 05/15] lib: percpu_count_sum_signed() Peter Zijlstra
@ 2007-05-10 10:08 ` Peter Zijlstra
  2007-05-10 10:08 ` [PATCH 07/15] mtd: give mtdconcat devices their own backing_dev_info Peter Zijlstra
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2007-05-10 10:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou

[-- Attachment #1: bdi_init.patch --]
[-- Type: text/plain, Size: 13962 bytes --]

provide BDI constructor/destructor hooks

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 block/ll_rw_blk.c               |    2 ++
 drivers/block/rd.c              |    6 ++++++
 drivers/char/mem.c              |    2 ++
 drivers/mtd/mtdcore.c           |    5 +++++
 fs/char_dev.c                   |    1 +
 fs/configfs/configfs_internal.h |    2 ++
 fs/configfs/inode.c             |    8 ++++++++
 fs/configfs/mount.c             |    2 ++
 fs/fuse/inode.c                 |    2 ++
 fs/hugetlbfs/inode.c            |    3 +++
 fs/nfs/client.c                 |    3 +++
 fs/ocfs2/dlm/dlmfs.c            |    6 +++++-
 fs/ramfs/inode.c                |    1 +
 fs/sysfs/inode.c                |    5 +++++
 fs/sysfs/mount.c                |    2 ++
 fs/sysfs/sysfs.h                |    1 +
 include/linux/backing-dev.h     |    7 +++++++
 kernel/cpuset.c                 |    3 +++
 mm/readahead.c                  |   13 ++++++++++---
 mm/shmem.c                      |    1 +
 mm/swap.c                       |    4 ++++
 21 files changed, 75 insertions(+), 4 deletions(-)

Index: linux-2.6/block/ll_rw_blk.c
===================================================================
--- linux-2.6.orig/block/ll_rw_blk.c	2007-05-10 10:19:00.000000000 +0200
+++ linux-2.6/block/ll_rw_blk.c	2007-05-10 10:21:59.000000000 +0200
@@ -1774,6 +1774,7 @@ static void blk_release_queue(struct kob
 
 	blk_trace_shutdown(q);
 
+	bdi_destroy(&q->backing_dev_info);
 	kmem_cache_free(requestq_cachep, q);
 }
 
@@ -1841,6 +1842,7 @@ request_queue_t *blk_alloc_queue_node(gf
 
 	q->backing_dev_info.unplug_io_fn = blk_backing_dev_unplug;
 	q->backing_dev_info.unplug_io_data = q;
+	bdi_init(&q->backing_dev_info);
 
 	mutex_init(&q->sysfs_lock);
 
Index: linux-2.6/drivers/block/rd.c
===================================================================
--- linux-2.6.orig/drivers/block/rd.c	2007-05-10 10:11:53.000000000 +0200
+++ linux-2.6/drivers/block/rd.c	2007-05-10 10:21:53.000000000 +0200
@@ -411,6 +411,9 @@ static void __exit rd_cleanup(void)
 		blk_cleanup_queue(rd_queue[i]);
 	}
 	unregister_blkdev(RAMDISK_MAJOR, "ramdisk");
+
+	bdi_destroy(&rd_file_backing_dev_info);
+	bdi_destroy(&rd_backing_dev_info);
 }
 
 /*
@@ -421,6 +424,9 @@ static int __init rd_init(void)
 	int i;
 	int err = -ENOMEM;
 
+	bdi_init(&rd_backing_dev_info);
+	bdi_init(&rd_file_backing_dev_info);
+
 	if (rd_blocksize > PAGE_SIZE || rd_blocksize < 512 ||
 			(rd_blocksize & (rd_blocksize-1))) {
 		printk("RAMDISK: wrong blocksize %d, reverting to defaults\n",
Index: linux-2.6/drivers/char/mem.c
===================================================================
--- linux-2.6.orig/drivers/char/mem.c	2007-05-10 10:11:53.000000000 +0200
+++ linux-2.6/drivers/char/mem.c	2007-05-10 10:21:53.000000000 +0200
@@ -987,6 +987,8 @@ static int __init chr_dev_init(void)
 			      MKDEV(MEM_MAJOR, devlist[i].minor),
 			      devlist[i].name);
 
+	bdi_init(&zero_bdi);
+
 	return 0;
 }
 
Index: linux-2.6/fs/char_dev.c
===================================================================
--- linux-2.6.orig/fs/char_dev.c	2007-05-10 10:11:53.000000000 +0200
+++ linux-2.6/fs/char_dev.c	2007-05-10 10:21:53.000000000 +0200
@@ -546,6 +546,7 @@ static struct kobject *base_probe(dev_t 
 void __init chrdev_init(void)
 {
 	cdev_map = kobj_map_init(base_probe, &chrdevs_lock);
+	bdi_init(&directly_mappable_cdev_bdi);
 }
 
 
Index: linux-2.6/fs/fuse/inode.c
===================================================================
--- linux-2.6.orig/fs/fuse/inode.c	2007-05-10 10:19:02.000000000 +0200
+++ linux-2.6/fs/fuse/inode.c	2007-05-10 10:21:53.000000000 +0200
@@ -432,6 +432,7 @@ static struct fuse_conn *new_conn(void)
 		atomic_set(&fc->num_waiting, 0);
 		fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
 		fc->bdi.unplug_io_fn = default_unplug_io_fn;
+		bdi_init(&fc->bdi);
 		fc->reqctr = 0;
 		fc->blocked = 1;
 		get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
@@ -445,6 +446,7 @@ void fuse_conn_put(struct fuse_conn *fc)
 		if (fc->destroy_req)
 			fuse_request_free(fc->destroy_req);
 		mutex_destroy(&fc->inst_mutex);
+		bdi_destroy(&fc->bdi);
 		kfree(fc);
 	}
 }
Index: linux-2.6/fs/nfs/client.c
===================================================================
--- linux-2.6.orig/fs/nfs/client.c	2007-05-10 10:19:02.000000000 +0200
+++ linux-2.6/fs/nfs/client.c	2007-05-10 10:22:09.000000000 +0200
@@ -658,6 +658,8 @@ static void nfs_server_set_fsinfo(struct
 	if (server->rsize > NFS_MAX_FILE_IO_SIZE)
 		server->rsize = NFS_MAX_FILE_IO_SIZE;
 	server->rpages = (server->rsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+
+	bdi_init(&server->backing_dev_info);
 	server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD;
 
 	if (server->wsize > max_rpc_payload)
@@ -787,6 +789,7 @@ void nfs_free_server(struct nfs_server *
 	nfs_put_client(server->nfs_client);
 
 	nfs_free_iostats(server->io_stats);
+	bdi_destroy(&server->backing_dev_info);
 	kfree(server);
 	nfs_release_automount_timer();
 	dprintk("<-- nfs_free_server()\n");
Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h	2007-05-10 10:21:46.000000000 +0200
+++ linux-2.6/include/linux/backing-dev.h	2007-05-10 10:21:53.000000000 +0200
@@ -34,6 +34,13 @@ struct backing_dev_info {
 	void *unplug_io_data;
 };
 
+static inline void bdi_init(struct backing_dev_info *bdi)
+{
+}
+
+static inline void bdi_destroy(struct backing_dev_info *bdi)
+{
+}
 
 /*
  * Flags in backing_dev_info::capability
Index: linux-2.6/drivers/mtd/mtdcore.c
===================================================================
--- linux-2.6.orig/drivers/mtd/mtdcore.c	2007-05-10 10:11:53.000000000 +0200
+++ linux-2.6/drivers/mtd/mtdcore.c	2007-05-10 10:21:53.000000000 +0200
@@ -60,6 +60,7 @@ int add_mtd_device(struct mtd_info *mtd)
 			break;
 		}
 	}
+	bdi_init(mtd->backing_dev_info);
 
 	BUG_ON(mtd->writesize == 0);
 	mutex_lock(&mtd_table_mutex);
@@ -142,6 +143,10 @@ int del_mtd_device (struct mtd_info *mtd
 	}
 
 	mutex_unlock(&mtd_table_mutex);
+
+	if (mtd->backing_dev_info)
+		bdi_destroy(mtd->backing_dev_info);
+
 	return ret;
 }
 
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c	2007-05-10 10:19:02.000000000 +0200
+++ linux-2.6/fs/hugetlbfs/inode.c	2007-05-10 10:21:53.000000000 +0200
@@ -828,6 +828,8 @@ static int __init init_hugetlbfs_fs(void
  out:
 	if (error)
 		kmem_cache_destroy(hugetlbfs_inode_cachep);
+	else
+		bdi_init(&hugetlbfs_backing_dev_info);
 	return error;
 }
 
@@ -835,6 +837,7 @@ static void __exit exit_hugetlbfs_fs(voi
 {
 	kmem_cache_destroy(hugetlbfs_inode_cachep);
 	unregister_filesystem(&hugetlbfs_fs_type);
+	bdi_destroy(&hugetlbfs_backing_dev_info);
 }
 
 module_init(init_hugetlbfs_fs)
Index: linux-2.6/fs/ocfs2/dlm/dlmfs.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/dlm/dlmfs.c	2007-05-10 10:11:53.000000000 +0200
+++ linux-2.6/fs/ocfs2/dlm/dlmfs.c	2007-05-10 10:21:53.000000000 +0200
@@ -613,8 +613,10 @@ bail:
 			kmem_cache_destroy(dlmfs_inode_cache);
 		if (cleanup_worker)
 			destroy_workqueue(user_dlm_worker);
-	} else
+	} else {
+		bdi_init(&dlmfs_backing_dev_info);
 		printk("OCFS2 User DLM kernel interface loaded\n");
+	}
 	return status;
 }
 
@@ -626,6 +628,8 @@ static void __exit exit_dlmfs_fs(void)
 	destroy_workqueue(user_dlm_worker);
 
 	kmem_cache_destroy(dlmfs_inode_cache);
+
+	bdi_destroy(&dlmfs_backing_dev_info);
 }
 
 MODULE_AUTHOR("Oracle");
Index: linux-2.6/fs/configfs/configfs_internal.h
===================================================================
--- linux-2.6.orig/fs/configfs/configfs_internal.h	2007-05-10 10:11:53.000000000 +0200
+++ linux-2.6/fs/configfs/configfs_internal.h	2007-05-10 10:21:53.000000000 +0200
@@ -55,6 +55,8 @@ extern int configfs_is_root(struct confi
 
 extern struct inode * configfs_new_inode(mode_t mode, struct configfs_dirent *);
 extern int configfs_create(struct dentry *, int mode, int (*init)(struct inode *));
+extern void configfs_inode_init(void);
+extern void configfs_inode_exit(void);
 
 extern int configfs_create_file(struct config_item *, const struct configfs_attribute *);
 extern int configfs_make_dirent(struct configfs_dirent *,
Index: linux-2.6/fs/configfs/inode.c
===================================================================
--- linux-2.6.orig/fs/configfs/inode.c	2007-05-10 10:11:53.000000000 +0200
+++ linux-2.6/fs/configfs/inode.c	2007-05-10 10:21:53.000000000 +0200
@@ -255,4 +255,12 @@ void configfs_hash_and_remove(struct den
 	mutex_unlock(&dir->d_inode->i_mutex);
 }
 
+void __init configfs_inode_init(void)
+{
+	bdi_init(&configfs_backing_dev_info);
+}
 
+void __exit configfs_inode_exit(void)
+{
+	bdi_destroy(&configfs_backing_dev_info);
+}
Index: linux-2.6/fs/configfs/mount.c
===================================================================
--- linux-2.6.orig/fs/configfs/mount.c	2007-05-10 10:19:02.000000000 +0200
+++ linux-2.6/fs/configfs/mount.c	2007-05-10 10:21:53.000000000 +0200
@@ -156,6 +156,7 @@ static int __init configfs_init(void)
 		configfs_dir_cachep = NULL;
 	}
 
+	configfs_inode_init();
 out:
 	return err;
 }
@@ -166,6 +167,7 @@ static void __exit configfs_exit(void)
 	subsystem_unregister(&config_subsys);
 	kmem_cache_destroy(configfs_dir_cachep);
 	configfs_dir_cachep = NULL;
+	configfs_inode_exit();
 }
 
 MODULE_AUTHOR("Oracle");
Index: linux-2.6/fs/ramfs/inode.c
===================================================================
--- linux-2.6.orig/fs/ramfs/inode.c	2007-05-10 10:11:53.000000000 +0200
+++ linux-2.6/fs/ramfs/inode.c	2007-05-10 10:21:53.000000000 +0200
@@ -223,6 +223,7 @@ module_exit(exit_ramfs_fs)
 
 int __init init_rootfs(void)
 {
+	bdi_init(&ramfs_backing_dev_info);
 	return register_filesystem(&rootfs_fs_type);
 }
 
Index: linux-2.6/fs/sysfs/inode.c
===================================================================
--- linux-2.6.orig/fs/sysfs/inode.c	2007-05-10 10:19:02.000000000 +0200
+++ linux-2.6/fs/sysfs/inode.c	2007-05-10 10:21:53.000000000 +0200
@@ -33,6 +33,11 @@ static const struct inode_operations sys
 	.setattr	= sysfs_setattr,
 };
 
+void __init sysfs_inode_init(void)
+{
+	bdi_init(&sysfs_backing_dev_info);
+}
+
 void sysfs_delete_inode(struct inode *inode)
 {
 	/* Free the shadowed directory inode operations */
Index: linux-2.6/fs/sysfs/mount.c
===================================================================
--- linux-2.6.orig/fs/sysfs/mount.c	2007-05-10 10:19:02.000000000 +0200
+++ linux-2.6/fs/sysfs/mount.c	2007-05-10 10:21:53.000000000 +0200
@@ -101,6 +101,8 @@ int __init sysfs_init(void)
 	} else
 		goto out_err;
 out:
+	if (!err)
+		sysfs_inode_init();
 	return err;
 out_err:
 	kmem_cache_destroy(sysfs_dir_cachep);
Index: linux-2.6/fs/sysfs/sysfs.h
===================================================================
--- linux-2.6.orig/fs/sysfs/sysfs.h	2007-05-10 10:19:02.000000000 +0200
+++ linux-2.6/fs/sysfs/sysfs.h	2007-05-10 10:22:15.000000000 +0200
@@ -60,6 +60,7 @@ extern void sysfs_delete_inode(struct in
 extern struct inode * sysfs_new_inode(mode_t mode, struct sysfs_dirent *);
 extern int sysfs_create(struct sysfs_dirent *sd, struct dentry *dentry,
 			int mode, int (*init)(struct inode *));
+extern void sysfs_inode_init(void);
 
 extern void release_sysfs_dirent(struct sysfs_dirent * sd);
 extern int sysfs_dirent_exist(struct sysfs_dirent *, const unsigned char *);
Index: linux-2.6/kernel/cpuset.c
===================================================================
--- linux-2.6.orig/kernel/cpuset.c	2007-05-10 10:19:03.000000000 +0200
+++ linux-2.6/kernel/cpuset.c	2007-05-10 10:21:53.000000000 +0200
@@ -1935,6 +1935,7 @@ int __init cpuset_init_early(void)
 
 	tsk->cpuset = &top_cpuset;
 	tsk->cpuset->mems_generation = cpuset_mems_generation++;
+
 	return 0;
 }
 
@@ -1977,6 +1978,8 @@ int __init cpuset_init(void)
 	/* memory_pressure_enabled is in root cpuset only */
 	if (err == 0)
 		err = cpuset_add_file(root, &cft_memory_pressure_enabled);
+	if (!err)
+		bdi_init(&cpuset_backing_dev_info);
 out:
 	return err;
 }
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c	2007-05-10 10:11:53.000000000 +0200
+++ linux-2.6/mm/shmem.c	2007-05-10 10:21:53.000000000 +0200
@@ -2477,6 +2477,7 @@ static int __init init_tmpfs(void)
 		printk(KERN_ERR "Could not kern_mount tmpfs\n");
 		goto out1;
 	}
+	bdi_init(&shmem_backing_dev_info);
 	return 0;
 
 out1:
Index: linux-2.6/mm/swap.c
===================================================================
--- linux-2.6.orig/mm/swap.c	2007-05-10 10:11:53.000000000 +0200
+++ linux-2.6/mm/swap.c	2007-05-10 10:21:53.000000000 +0200
@@ -564,6 +564,10 @@ void __init swap_setup(void)
 {
 	unsigned long megs = num_physpages >> (20 - PAGE_SHIFT);
 
+#ifdef CONFIG_SWAP
+	bdi_init(swapper_space.backing_dev_info);
+#endif
+
 	/* Use a smaller cluster for small-memory machines */
 	if (megs < 16)
 		page_cluster = 2;
Index: linux-2.6/mm/readahead.c
===================================================================
--- linux-2.6.orig/mm/readahead.c	2007-05-10 10:19:03.000000000 +0200
+++ linux-2.6/mm/readahead.c	2007-05-10 10:22:59.000000000 +0200
@@ -581,3 +581,10 @@ unsigned long max_sane_readahead(unsigne
 	return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE)
 		+ node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
 }
+
+static int __init readahead_init(void)
+{
+	bdi_init(&default_backing_dev_info);
+	return 0;
+}
+subsys_initcall(readahead_init);
\ No newline at end of file

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 07/15] mtd: give mtdconcat devices their own backing_dev_info
  2007-05-10 10:08 [PATCH 00/15] per device dirty throttling -v6 Peter Zijlstra
                   ` (5 preceding siblings ...)
  2007-05-10 10:08 ` [PATCH 06/15] mm: bdi init hooks Peter Zijlstra
@ 2007-05-10 10:08 ` Peter Zijlstra
  2007-05-10 10:08 ` [PATCH 08/15] mm: scalable bdi statistics counters Peter Zijlstra
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2007-05-10 10:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, Robert Kaiser

[-- Attachment #1: bdi_mtdconcat.patch --]
[-- Type: text/plain, Size: 3508 bytes --]

These are actual devices, give them their own BDI.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Kaiser <rkaiser@sysgo.de>
---
 drivers/mtd/mtdconcat.c |   28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

Index: linux-2.6/drivers/mtd/mtdconcat.c
===================================================================
--- linux-2.6.orig/drivers/mtd/mtdconcat.c	2007-04-22 18:55:17.000000000 +0200
+++ linux-2.6/drivers/mtd/mtdconcat.c	2007-04-22 19:01:42.000000000 +0200
@@ -32,6 +32,7 @@ struct mtd_concat {
 	struct mtd_info mtd;
 	int num_subdev;
 	struct mtd_info **subdev;
+	struct backing_dev_info backing_dev_info;
 };
 
 /*
@@ -782,10 +783,9 @@ struct mtd_info *mtd_concat_create(struc
 
 	for (i = 1; i < num_devs; i++) {
 		if (concat->mtd.type != subdev[i]->type) {
-			kfree(concat);
 			printk("Incompatible device type on \"%s\"\n",
 			       subdev[i]->name);
-			return NULL;
+			goto error;
 		}
 		if (concat->mtd.flags != subdev[i]->flags) {
 			/*
@@ -794,10 +794,9 @@ struct mtd_info *mtd_concat_create(struc
 			 */
 			if ((concat->mtd.flags ^ subdev[i]->
 			     flags) & ~MTD_WRITEABLE) {
-				kfree(concat);
 				printk("Incompatible device flags on \"%s\"\n",
 				       subdev[i]->name);
-				return NULL;
+				goto error;
 			} else
 				/* if writeable attribute differs,
 				   make super device writeable */
@@ -809,9 +808,12 @@ struct mtd_info *mtd_concat_create(struc
 		 * - copy-mapping is still permitted
 		 */
 		if (concat->mtd.backing_dev_info !=
-		    subdev[i]->backing_dev_info)
+		    subdev[i]->backing_dev_info) {
+			concat->backing_dev_info = default_backing_dev_info;
+			bdi_init(&concat->backing_dev_info);
 			concat->mtd.backing_dev_info =
-				&default_backing_dev_info;
+				&concat->backing_dev_info;
+		}
 
 		concat->mtd.size += subdev[i]->size;
 		concat->mtd.ecc_stats.badblocks +=
@@ -821,10 +823,9 @@ struct mtd_info *mtd_concat_create(struc
 		    concat->mtd.oobsize    !=  subdev[i]->oobsize ||
 		    !concat->mtd.read_oob  != !subdev[i]->read_oob ||
 		    !concat->mtd.write_oob != !subdev[i]->write_oob) {
-			kfree(concat);
 			printk("Incompatible OOB or ECC data on \"%s\"\n",
 			       subdev[i]->name);
-			return NULL;
+			goto error;
 		}
 		concat->subdev[i] = subdev[i];
 
@@ -903,11 +904,10 @@ struct mtd_info *mtd_concat_create(struc
 		    kmalloc(num_erase_region *
 			    sizeof (struct mtd_erase_region_info), GFP_KERNEL);
 		if (!erase_region_p) {
-			kfree(concat);
 			printk
 			    ("memory allocation error while creating erase region list"
 			     " for device \"%s\"\n", name);
-			return NULL;
+			goto error;
 		}
 
 		/*
@@ -968,6 +968,12 @@ struct mtd_info *mtd_concat_create(struc
 	}
 
 	return &concat->mtd;
+
+error:
+	if (concat->mtd.backing_dev_info == &concat->backing_dev_info)
+		bdi_destroy(&concat->backing_dev_info);
+	kfree(concat);
+	return NULL;
 }
 
 /*
@@ -977,6 +983,8 @@ struct mtd_info *mtd_concat_create(struc
 void mtd_concat_destroy(struct mtd_info *mtd)
 {
 	struct mtd_concat *concat = CONCAT(mtd);
+	if (concat->mtd.backing_dev_info == &concat->backing_dev_info)
+		bdi_destroy(&concat->backing_dev_info);
 	if (concat->mtd.numeraseregions)
 		kfree(concat->mtd.eraseregions);
 	kfree(concat);

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 08/15] mm: scalable bdi statistics counters.
  2007-05-10 10:08 [PATCH 00/15] per device dirty throttling -v6 Peter Zijlstra
                   ` (6 preceding siblings ...)
  2007-05-10 10:08 ` [PATCH 07/15] mtd: give mtdconcat devices their own backing_dev_info Peter Zijlstra
@ 2007-05-10 10:08 ` Peter Zijlstra
  2007-05-10 10:08 ` [PATCH 09/15] mm: count reclaimable pages per BDI Peter Zijlstra
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2007-05-10 10:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou

[-- Attachment #1: bdi_stat.patch --]
[-- Type: text/plain, Size: 4601 bytes --]

Provide scalable per backing_dev_info statistics counters.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/backing-dev.h |   96 +++++++++++++++++++++++++++++++++++++++++++-
 mm/backing-dev.c            |   21 +++++++++
 2 files changed, 115 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h	2007-05-10 10:21:53.000000000 +0200
+++ linux-2.6/include/linux/backing-dev.h	2007-05-10 10:23:26.000000000 +0200
@@ -8,6 +8,8 @@
 #ifndef _LINUX_BACKING_DEV_H
 #define _LINUX_BACKING_DEV_H
 
+#include <linux/percpu_counter.h>
+#include <linux/log2.h>
 #include <asm/atomic.h>
 
 struct page;
@@ -24,6 +26,12 @@ enum bdi_state {
 
 typedef int (congested_fn)(void *, int);
 
+enum bdi_stat_item {
+	NR_BDI_STAT_ITEMS
+};
+
+#define BDI_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
+
 struct backing_dev_info {
 	unsigned long ra_pages;	/* max readahead in PAGE_CACHE_SIZE units */
 	unsigned long state;	/* Always use atomic bitops on this */
@@ -32,14 +40,98 @@ struct backing_dev_info {
 	void *congested_data;	/* Pointer to aux data for congested func */
 	void (*unplug_io_fn)(struct backing_dev_info *, struct page *);
 	void *unplug_io_data;
+
+	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 };
 
-static inline void bdi_init(struct backing_dev_info *bdi)
+void bdi_init(struct backing_dev_info *bdi);
+void bdi_destroy(struct backing_dev_info *bdi);
+
+static inline void __mod_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item, s32 amount)
+{
+	__percpu_counter_mod(&bdi->bdi_stat[item], amount, BDI_STAT_BATCH);
+}
+
+static inline void __mod_bdi_stat64(struct backing_dev_info *bdi,
+		enum bdi_stat_item item, s64 amount)
+{
+	__percpu_counter_mod64(&bdi->bdi_stat[item], amount, BDI_STAT_BATCH);
+}
+
+static inline void __inc_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	__mod_bdi_stat(bdi, item, 1);
+}
+
+static inline void inc_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__inc_bdi_stat(bdi, item);
+	local_irq_restore(flags);
+}
+
+static inline void __dec_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
 {
+	__mod_bdi_stat(bdi, item, -1);
 }
 
-static inline void bdi_destroy(struct backing_dev_info *bdi)
+static inline void dec_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
 {
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__dec_bdi_stat(bdi, item);
+	local_irq_restore(flags);
+}
+
+static inline u64 bdi_stat_unsigned(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	return percpu_counter_read(&bdi->bdi_stat[item]);
+}
+
+static inline s64 bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	return percpu_counter_read_positive(&bdi->bdi_stat[item]);
+}
+
+static inline s64 __bdi_stat_sum(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	return percpu_counter_sum(&bdi->bdi_stat[item]);
+}
+
+static inline s64 bdi_stat_sum(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	s64 sum;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	sum = __bdi_stat_sum(bdi, item);
+	local_irq_restore(flags);
+
+	return sum;
+}
+
+/*
+ * maximal error of a stat counter.
+ */
+static inline unsigned long bdi_stat_error(struct backing_dev_info *bdi)
+{
+#ifdef CONFIG_SMP
+	return nr_cpu_ids * BDI_STAT_BATCH;
+#else
+	return 1;
+#endif
 }
 
 /*
Index: linux-2.6/mm/backing-dev.c
===================================================================
--- linux-2.6.orig/mm/backing-dev.c	2007-05-10 10:21:46.000000000 +0200
+++ linux-2.6/mm/backing-dev.c	2007-05-10 10:23:08.000000000 +0200
@@ -5,6 +5,24 @@
 #include <linux/sched.h>
 #include <linux/module.h>
 
+void bdi_init(struct backing_dev_info *bdi)
+{
+	int i;
+
+	for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
+		percpu_counter_init(&bdi->bdi_stat[i], 0);
+}
+EXPORT_SYMBOL(bdi_init);
+
+void bdi_destroy(struct backing_dev_info *bdi)
+{
+	int i;
+
+	for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
+		percpu_counter_destroy(&bdi->bdi_stat[i]);
+}
+EXPORT_SYMBOL(bdi_destroy);
+
 static wait_queue_head_t congestion_wqh[2] = {
 		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
 		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 09/15] mm: count reclaimable pages per BDI
  2007-05-10 10:08 [PATCH 00/15] per device dirty throttling -v6 Peter Zijlstra
                   ` (7 preceding siblings ...)
  2007-05-10 10:08 ` [PATCH 08/15] mm: scalable bdi statistics counters Peter Zijlstra
@ 2007-05-10 10:08 ` Peter Zijlstra
  2007-05-10 10:08 ` [PATCH 10/15] mm: count writeback " Peter Zijlstra
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2007-05-10 10:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou

[-- Attachment #1: bdi_stat_reclaimable.patch --]
[-- Type: text/plain, Size: 4681 bytes --]

Count per BDI reclaimable pages; nr_reclaimable = nr_dirty + nr_unstable.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/buffer.c                 |    2 ++
 fs/nfs/write.c              |    7 +++++++
 include/linux/backing-dev.h |    1 +
 mm/page-writeback.c         |    4 ++++
 mm/truncate.c               |    2 ++
 5 files changed, 16 insertions(+)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c	2007-04-24 21:18:29.000000000 +0200
+++ linux-2.6/fs/buffer.c	2007-04-25 08:20:10.000000000 +0200
@@ -734,6 +734,8 @@ int __set_page_dirty_buffers(struct page
 	if (page->mapping) {	/* Race with truncate? */
 		if (mapping_cap_account_dirty(mapping)) {
 			__inc_zone_page_state(page, NR_FILE_DIRTY);
+			__inc_bdi_stat(mapping->backing_dev_info,
+					BDI_RECLAIMABLE);
 			task_io_account_write(PAGE_CACHE_SIZE);
 		}
 		radix_tree_tag_set(&mapping->page_tree,
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c	2007-04-24 21:18:30.000000000 +0200
+++ linux-2.6/mm/page-writeback.c	2007-04-25 08:20:10.000000000 +0200
@@ -828,6 +828,8 @@ int __set_page_dirty_nobuffers(struct pa
 			BUG_ON(mapping2 != mapping);
 			if (mapping_cap_account_dirty(mapping)) {
 				__inc_zone_page_state(page, NR_FILE_DIRTY);
+				__inc_bdi_stat(mapping->backing_dev_info,
+						BDI_RECLAIMABLE);
 				task_io_account_write(PAGE_CACHE_SIZE);
 			}
 			radix_tree_tag_set(&mapping->page_tree,
@@ -961,6 +963,8 @@ int clear_page_dirty_for_io(struct page 
 		 */
 		if (TestClearPageDirty(page)) {
 			dec_zone_page_state(page, NR_FILE_DIRTY);
+			dec_bdi_stat(mapping->backing_dev_info,
+					BDI_RECLAIMABLE);
 			return 1;
 		}
 		return 0;
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c	2007-04-24 21:18:30.000000000 +0200
+++ linux-2.6/mm/truncate.c	2007-04-25 08:20:10.000000000 +0200
@@ -72,6 +72,8 @@ void cancel_dirty_page(struct page *page
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
 			dec_zone_page_state(page, NR_FILE_DIRTY);
+			dec_bdi_stat(mapping->backing_dev_info,
+					BDI_RECLAIMABLE);
 			if (account_size)
 				task_io_account_cancelled_write(account_size);
 		}
Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c	2007-04-25 08:17:04.000000000 +0200
+++ linux-2.6/fs/nfs/write.c	2007-04-25 08:20:29.000000000 +0200
@@ -454,6 +454,7 @@ nfs_mark_request_commit(struct nfs_page 
 	set_bit(PG_NEED_COMMIT, &(req)->wb_flags);
 	spin_unlock(&nfsi->req_lock);
 	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
+	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
 }
 
@@ -552,6 +553,8 @@ static void nfs_cancel_commit_list(struc
 	while(!list_empty(head)) {
 		req = nfs_list_entry(head->next);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
+		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
+				BDI_RECLAIMABLE);
 		nfs_list_remove_request(req);
 		clear_bit(PG_NEED_COMMIT, &(req)->wb_flags);
 		nfs_inode_remove_request(req);
@@ -1269,6 +1272,8 @@ nfs_commit_list(struct inode *inode, str
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
+		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
+				BDI_RECLAIMABLE);
 		nfs_clear_page_writeback(req);
 	}
 	return -ENOMEM;
@@ -1294,6 +1299,8 @@ static void nfs_commit_done(struct rpc_t
 		nfs_list_remove_request(req);
 		clear_bit(PG_NEED_COMMIT, &(req)->wb_flags);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
+		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
+				BDI_RECLAIMABLE);
 
 		dprintk("NFS: commit (%s/%Ld %d@%Ld)",
 			req->wb_context->dentry->d_inode->i_sb->s_id,
Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h	2007-04-25 08:20:03.000000000 +0200
+++ linux-2.6/include/linux/backing-dev.h	2007-04-25 08:20:10.000000000 +0200
@@ -27,6 +27,7 @@ enum bdi_state {
 typedef int (congested_fn)(void *, int);
 
 enum bdi_stat_item {
+	BDI_RECLAIMABLE,
 	NR_BDI_STAT_ITEMS
 };
 

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 10/15] mm: count writeback pages per BDI
  2007-05-10 10:08 [PATCH 00/15] per device dirty throttling -v6 Peter Zijlstra
                   ` (8 preceding siblings ...)
  2007-05-10 10:08 ` [PATCH 09/15] mm: count reclaimable pages per BDI Peter Zijlstra
@ 2007-05-10 10:08 ` Peter Zijlstra
  2007-05-10 10:08 ` [PATCH 11/15] mm: expose BDI statistics in sysfs Peter Zijlstra
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2007-05-10 10:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou

[-- Attachment #1: bdi_stat_writeback.patch --]
[-- Type: text/plain, Size: 2300 bytes --]

Count per BDI writeback pages.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/backing-dev.h |    1 +
 mm/page-writeback.c         |   12 ++++++++++--
 2 files changed, 11 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c	2007-04-20 15:27:28.000000000 +0200
+++ linux-2.6/mm/page-writeback.c	2007-04-20 15:28:10.000000000 +0200
@@ -979,14 +979,18 @@ int test_clear_page_writeback(struct pag
 	int ret;
 
 	if (mapping) {
+		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
 
 		write_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestClearPageWriteback(page);
-		if (ret)
+		if (ret) {
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
+			if (bdi_cap_writeback_dirty(bdi))
+				__dec_bdi_stat(bdi, BDI_WRITEBACK);
+		}
 		write_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
 		ret = TestClearPageWriteback(page);
@@ -1002,14 +1006,18 @@ int test_set_page_writeback(struct page 
 	int ret;
 
 	if (mapping) {
+		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
 
 		write_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestSetPageWriteback(page);
-		if (!ret)
+		if (!ret) {
 			radix_tree_tag_set(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
+			if (bdi_cap_writeback_dirty(bdi))
+				__inc_bdi_stat(bdi, BDI_WRITEBACK);
+		}
 		if (!PageDirty(page))
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h	2007-04-20 15:25:47.000000000 +0200
+++ linux-2.6/include/linux/backing-dev.h	2007-04-20 15:28:17.000000000 +0200
@@ -27,6 +27,7 @@ typedef int (congested_fn)(void *, int);
 
 enum bdi_stat_item {
 	BDI_RECLAIMABLE,
+	BDI_WRITEBACK,
 	NR_BDI_STAT_ITEMS
 };
 

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 11/15] mm: expose BDI statistics in sysfs.
  2007-05-10 10:08 [PATCH 00/15] per device dirty throttling -v6 Peter Zijlstra
                   ` (9 preceding siblings ...)
  2007-05-10 10:08 ` [PATCH 10/15] mm: count writeback " Peter Zijlstra
@ 2007-05-10 10:08 ` Peter Zijlstra
  2007-05-10 10:08 ` [PATCH 12/15] mm: per device dirty threshold Peter Zijlstra
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2007-05-10 10:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou

[-- Attachment #1: bdi_stat_sysfs.patch --]
[-- Type: text/plain, Size: 2261 bytes --]

Expose the per BDI stats in /sys/block/<dev>/queue/*

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 block/ll_rw_blk.c |   29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

Index: linux-2.6/block/ll_rw_blk.c
===================================================================
--- linux-2.6.orig/block/ll_rw_blk.c	2007-05-10 10:21:59.000000000 +0200
+++ linux-2.6/block/ll_rw_blk.c	2007-05-10 10:23:39.000000000 +0200
@@ -3978,6 +3978,23 @@ static ssize_t queue_max_hw_sectors_show
 	return queue_var_show(max_hw_sectors_kb, (page));
 }
 
+static ssize_t queue_nr_reclaimable_show(struct request_queue *q, char *page)
+{
+	unsigned long long nr_reclaimable =
+		bdi_stat(&q->backing_dev_info, BDI_RECLAIMABLE);
+
+	return sprintf(page, "%llu\n",
+			nr_reclaimable >> (PAGE_CACHE_SHIFT - 10));
+}
+
+static ssize_t queue_nr_writeback_show(struct request_queue *q, char *page)
+{
+	unsigned long long nr_writeback =
+		bdi_stat(&q->backing_dev_info, BDI_WRITEBACK);
+
+	return sprintf(page, "%llu\n",
+			nr_writeback >> (PAGE_CACHE_SHIFT - 10));
+}
 
 static struct queue_sysfs_entry queue_requests_entry = {
 	.attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR },
@@ -4002,6 +4019,16 @@ static struct queue_sysfs_entry queue_ma
 	.show = queue_max_hw_sectors_show,
 };
 
+static struct queue_sysfs_entry queue_reclaimable_entry = {
+	.attr = {.name = "reclaimable_kb", .mode = S_IRUGO },
+	.show = queue_nr_reclaimable_show,
+};
+
+static struct queue_sysfs_entry queue_writeback_entry = {
+	.attr = {.name = "writeback_kb", .mode = S_IRUGO },
+	.show = queue_nr_writeback_show,
+};
+
 static struct queue_sysfs_entry queue_iosched_entry = {
 	.attr = {.name = "scheduler", .mode = S_IRUGO | S_IWUSR },
 	.show = elv_iosched_show,
@@ -4013,6 +4040,8 @@ static struct attribute *default_attrs[]
 	&queue_ra_entry.attr,
 	&queue_max_hw_sectors_entry.attr,
 	&queue_max_sectors_entry.attr,
+	&queue_reclaimable_entry.attr,
+	&queue_writeback_entry.attr,
 	&queue_iosched_entry.attr,
 	NULL,
 };

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 12/15] mm: per device dirty threshold
  2007-05-10 10:08 [PATCH 00/15] per device dirty throttling -v6 Peter Zijlstra
                   ` (10 preceding siblings ...)
  2007-05-10 10:08 ` [PATCH 11/15] mm: expose BDI statistics in sysfs Peter Zijlstra
@ 2007-05-10 10:08 ` Peter Zijlstra
  2007-05-10 10:08 ` [PATCH 13/15] debug: sysfs files for the current ratio/size/total Peter Zijlstra
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2007-05-10 10:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou

[-- Attachment #1: writeback-balance-per-backing_dev.patch --]
[-- Type: text/plain, Size: 20717 bytes --]

Scale writeback cache per backing device, proportional to its writeout speed.

By decoupling the BDI dirty thresholds a number of problems we currently have
will go away, namely:

 - mutual interference starvation (for any number of BDIs);
 - deadlocks with stacked BDIs (loop, FUSE and local NFS mounts).

It might be that all dirty pages are for a single BDI while other BDIs are
idling. By giving each BDI a 'fair' share of the dirty limit, each one can have
dirty pages outstanding and make progress.

A global threshold also creates a deadlock for stacked BDIs; when A writes to
B, and A generates enough dirty pages to get throttled, B will never start
writeback until the dirty pages go away. Again, by giving each BDI its own
'independent' dirty limit, this problem is avoided.

So the problem is to determine how to distribute the total dirty limit across
the BDIs fairly and efficiently. A DBI that has a large dirty limit but does
not have any dirty pages outstanding is a waste.

What is done is to keep a floating proportion between the DBIs based on
writeback completions. This way faster/more active devices get a larger share
than slower/idle devices.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/backing-dev.h |    4 
 kernel/sysctl.c             |    5 
 mm/backing-dev.c            |    5 
 mm/page-writeback.c         |  467 ++++++++++++++++++++++++++++++++++++++++----
 4 files changed, 447 insertions(+), 34 deletions(-)

Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h	2007-05-10 10:41:23.000000000 +0200
+++ linux-2.6/include/linux/backing-dev.h	2007-05-10 10:49:39.000000000 +0200
@@ -10,6 +10,7 @@
 
 #include <linux/percpu_counter.h>
 #include <linux/log2.h>
+#include <linux/proportions.h>
 #include <asm/atomic.h>
 
 struct page;
@@ -44,6 +45,9 @@ struct backing_dev_info {
 	void *unplug_io_data;
 
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
+
+	struct prop_local completions;
+	int dirty_exceeded;
 };
 
 void bdi_init(struct backing_dev_info *bdi);
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c	2007-05-10 10:41:23.000000000 +0200
+++ linux-2.6/mm/page-writeback.c	2007-05-10 10:49:39.000000000 +0200
@@ -2,6 +2,7 @@
  * mm/page-writeback.c
  *
  * Copyright (C) 2002, Linus Torvalds.
+ * Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
  *
  * Contains functions related to writing back dirty pages at the
  * address_space level.
@@ -49,8 +50,6 @@
  */
 static long ratelimit_pages = 32;
 
-static int dirty_exceeded __cacheline_aligned_in_smp;	/* Dirty mem may be over limit */
-
 /*
  * When balance_dirty_pages decides that the caller needs to perform some
  * non-background writeback, this is how many pages it will attempt to write.
@@ -103,6 +102,338 @@ EXPORT_SYMBOL(laptop_mode);
 static void background_writeout(unsigned long _min_pages);
 
 /*
+ * Scale the writeback cache size proportional to the relative writeout speeds.
+ *
+ * We do this by keeping a floating proportion between BDIs, based on page
+ * writeback completions [end_page_writeback()]. Those devices that write out
+ * pages fastest will get the larger share, while the slower will get a smaller
+ * share.
+ *
+ * We use page writeout completions because we are interested in getting rid of
+ * dirty pages. Having them written out is the primary goal.
+ *
+ * We introduce a concept of time, a period over which we measure these events,
+ * because demand can/will vary over time. The length of this period itself is
+ * measured in page writeback completions.
+ *
+ * DETAILS:
+ *
+ * The floating proportion is a time derivative with an exponentially decaying
+ * history:
+ *
+ *   p_{j} = \Sum_{i=0} (dx_{j}/dt_{-i}) / 2^(1+i)
+ *
+ * Where j is an element from {BDIs}, x_{j} is j's number of completions, and i
+ * the time period over which the differential is taken. So d/dt_{-i} is the
+ * differential over the i-th last period.
+ *
+ * The decaying history gives smooth transitions. The time differential carries
+ * the notion of speed.
+ *
+ * The denominator is 2^(1+i) because we want the series to be normalised, ie.
+ *
+ *   \Sum_{i=0} 1/2^(1+i) = 1
+ *
+ * Further more, if we measure time (t) in the same events as x; so that:
+ *
+ *   t = \Sum_{j} x_{j}
+ *
+ * we get that:
+ *
+ *   \Sum_{j} p_{j} = 1
+ *
+ * Writing this in an iterative fashion we get (dropping the 'd's):
+ *
+ *   if (++x_{j}, ++t > period)
+ *     t /= 2;
+ *     for_each (j)
+ *       x_{j} /= 2;
+ *
+ * so that:
+ *
+ *   p_{j} = x_{j} / t;
+ *
+ * We optimize away the '/= 2' for the global time delta by noting that:
+ *
+ *   if (++t > period) t /= 2:
+ *
+ * Can be approximated by:
+ *
+ *   period/2 + (++t % period/2)
+ *
+ * [ Furthermore, when we choose period to be 2^n it can be written in terms of
+ *   binary operations and wraparound artefacts disappear. ]
+ *
+ * Also note that this yields a natural counter of the elapsed periods:
+ *
+ *   c = t / (period/2)
+ *
+ * [ Its monotonic increasing property can be applied to mitigate the wrap-
+ *   around issue. ]
+ *
+ * This allows us to do away with the loop over all BDIs on each period
+ * expiration. By remembering the period count under which it was last
+ * accessed as c_{j}, we can obtain the number of 'missed' cycles from:
+ *
+ *   c - c_{j}
+ *
+ * We can then lazily catch up to the global period count every time we are
+ * going to use x_{j}, by doing:
+ *
+ *   x_{j} /= 2^(c - c_{j}), c_{j} = c
+ */
+
+struct vm_completions_data {
+	/*
+	 * The period over which we differentiate (in pages)
+	 *
+	 *   period = 2^shift
+	 */
+	int shift;
+
+	/*
+	 * The total page writeback completion counter aka 'time'.
+	 *
+	 * Treated as an unsigned long; the lower 'shift - 1' bits are the
+	 * counter bits, the remaining upper bits the period counter.
+	 */
+	struct percpu_counter completions;
+};
+
+static int vm_completions_index;
+static struct vm_completions_data vm_completions[2];
+static DEFINE_MUTEX(vm_completions_mutex);
+
+static unsigned long determine_dirtyable_memory(void);
+
+/*
+ * couple the period to the dirty_ratio:
+ *
+ *   period/2 ~ roundup_pow_of_two(dirty limit)
+ */
+static int calc_period_shift(void)
+{
+	unsigned long dirty_total;
+
+	dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) / 100;
+	return 2 + ilog2(dirty_total - 1);
+}
+
+static void vcd_init(void)
+{
+	vm_completions[0].shift = calc_period_shift();
+	percpu_counter_init(&vm_completions[0].completions, 0);
+	percpu_counter_init(&vm_completions[1].completions, 0);
+}
+
+/*
+ * We have two copies, and flip between them to make it seem like an atomic
+ * update. The update is not really atomic wrt the completions counter, but
+ * it is internally consistent with the bit layout depending on shift.
+ *
+ * We calculate the new shift, copy the completions count, move the bits around
+ * and flip the index.
+ */
+static void vcd_flip(void)
+{
+	int index;
+	int shift;
+	int offset;
+	u64 completions;
+	unsigned long flags;
+
+	mutex_lock(&vm_completions_mutex);
+
+	index = vm_completions_index ^ 1;
+	shift = calc_period_shift();
+	offset = vm_completions[vm_completions_index].shift - shift;
+	if (!offset)
+		goto out;
+
+	vm_completions[index].shift = shift;
+
+	local_irq_save(flags);
+	completions = percpu_counter_sum_signed(
+			&vm_completions[vm_completions_index].completions);
+
+	if (offset < 0)
+		completions <<= -offset;
+	else
+		completions >>= offset;
+
+	percpu_counter_set(&vm_completions[index].completions, completions);
+
+	/*
+	 * ensure the new vcd is fully written before the switch
+	 */
+	smp_wmb();
+	vm_completions_index = index;
+	local_irq_restore(flags);
+
+	synchronize_rcu();
+
+out:
+	mutex_unlock(&vm_completions_mutex);
+}
+
+/*
+ * wrap the access to the data in an rcu_read_lock() section;
+ * this is used to track the active references.
+ */
+static struct vm_completions_data *get_vcd(void)
+{
+	int index;
+
+	rcu_read_lock();
+	index = vm_completions_index;
+	/*
+	 * match the wmb from vcd_flip()
+	 */
+	smp_rmb();
+	return &vm_completions[index];
+}
+
+static void put_vcd(struct vm_completions_data *vcd)
+{
+	rcu_read_unlock();
+}
+
+/*
+ * update the period when the dirty ratio changes.
+ */
+int dirty_ratio_handler(ctl_table *table, int write,
+		struct file *filp, void __user *buffer, size_t *lenp,
+		loff_t *ppos)
+{
+	int old_ratio = vm_dirty_ratio;
+	int ret = proc_dointvec_minmax(table, write, filp, buffer, lenp, ppos);
+	if (ret == 0 && write && vm_dirty_ratio != old_ratio)
+		vcd_flip();
+	return ret;
+}
+
+/*
+ * adjust the bdi local data to changes in the bit layout.
+ */
+static void bdi_adjust_shift(struct backing_dev_info *bdi, int shift)
+{
+	int offset = bdi->shift - shift;
+
+	if (!offset)
+		return;
+
+	if (offset < 0)
+		bdi->period <<= -offset;
+	else
+		bdi->period >>= offset;
+
+	bdi->shift = shift;
+}
+
+/*
+ * Catch up with missed period expirations.
+ *
+ *   until (c_{j} == c)
+ *     x_{j} -= x_{j}/2;
+ *     c_{j}++;
+ */
+static void bdi_writeout_norm(struct backing_dev_info *bdi,
+		struct vm_completions_data *vcd)
+{
+	unsigned long period = 1UL << (vcd->shift - 1);
+	unsigned long period_mask = ~(period - 1);
+	unsigned long global_period;
+	unsigned long flags;
+
+	global_period = percpu_counter_read(&vcd->completions);
+	global_period &= period_mask;
+
+	/*
+	 * Fast path - check if the local and global period count still match
+	 * outside of the lock.
+	 */
+	if (bdi->period == global_period)
+		return;
+
+	spin_lock_irqsave(&bdi->lock, flags);
+	bdi_adjust_shift(bdi, vcd->shift);
+	/*
+	 * For each missed period, we half the local counter.
+	 * basically:
+	 *   bdi_stat(bdi, BDI_COMPLETION) >> (global_period - bdi->period);
+	 *
+	 * but since the distributed nature of percpu counters make division
+	 * rather hard, use a regular subtraction loop. This is safe, because
+	 * BDI_COMPLETION will only every be incremented, hence the subtraction
+	 * can never result in a negative number.
+	 */
+	while (bdi->period != global_period) {
+		unsigned long val = bdi_stat_unsigned(bdi, BDI_COMPLETION);
+		unsigned long half = (val + 1) >> 1;
+
+		/*
+		 * Half of zero won't be much less, break out.
+		 * This limits the loop to shift iterations, even
+		 * if we missed a million.
+		 */
+		if (!val)
+			break;
+
+		/*
+		 * Iff shift >32 half might exceed the limits of
+		 * the regular percpu_counter_mod.
+		 */
+		__mod_bdi_stat64(bdi, BDI_COMPLETION, -half);
+		bdi->period += period;
+	}
+	bdi->period = global_period;
+	bdi->shift = vcd->shift;
+	spin_unlock_irqrestore(&bdi->lock, flags);
+}
+
+/*
+ * Increment the BDI's writeout completion count and the global writeout
+ * completion count. Called from test_clear_page_writeback().
+ *
+ *   ++x_{j}, ++t
+ */
+static void __bdi_writeout_inc(struct backing_dev_info *bdi)
+{
+	struct vm_completions_data *vcd = get_vcd();
+	/* Catch up with missed period expirations before using the counter. */
+	bdi_writeout_norm(bdi, vcd);
+	__inc_bdi_stat(bdi, BDI_COMPLETION);
+
+	percpu_counter_mod(&vcd->completions, 1);
+	put_vcd(vcd);
+}
+
+/*
+ * Obtain an accurate fraction of the BDI's portion.
+ *
+ *   p_{j} = x_{j} / (period/2 + t % period/2)
+ */
+static void bdi_writeout_fraction(struct backing_dev_info *bdi,
+	       	long *numerator, long *denominator)
+{
+	struct vm_completions_data *vcd = get_vcd();
+	unsigned long period_2 = 1UL << (vcd->shift - 1);
+	unsigned long counter_mask = period_2 - 1;
+	unsigned long global_count;
+
+	if (bdi_cap_writeback_dirty(bdi)) {
+		/* Catch up with the period expirations before use. */
+		bdi_writeout_norm(bdi, vcd);
+		*numerator = bdi_stat(bdi, BDI_COMPLETION);
+	} else
+		*numerator = 0;
+
+	global_count = percpu_counter_read(&vcd->completions);
+	*denominator = period_2 + (global_count & counter_mask);
+	put_vcd(vcd);
+}
+
+/*
  * Work out the current dirty-memory clamping and background writeout
  * thresholds.
  *
@@ -158,8 +489,8 @@ static unsigned long determine_dirtyable
 }
 
 static void
-get_dirty_limits(long *pbackground, long *pdirty,
-					struct address_space *mapping)
+get_dirty_limits(long *pbackground, long *pdirty, long *pbdi_dirty,
+		 struct backing_dev_info *bdi)
 {
 	int background_ratio;		/* Percentages */
 	int dirty_ratio;
@@ -193,6 +524,45 @@ get_dirty_limits(long *pbackground, long
 	}
 	*pbackground = background;
 	*pdirty = dirty;
+
+	if (bdi) {
+		long long bdi_dirty = dirty;
+		long numerator, denominator;
+
+		/*
+		 * Calculate this BDI's share of the dirty ratio.
+		 */
+		bdi_writeout_fraction(bdi, &numerator, &denominator);
+
+		bdi_dirty *= numerator;
+		do_div(bdi_dirty, denominator);
+
+		*pbdi_dirty = bdi_dirty;
+	}
+}
+
+/*
+ * Clip the earned share of dirty pages to that which is actually available.
+ * This avoids exceeding the total dirty_limit when the floating averages
+ * fluctuate too quickly.
+ */
+static void
+clip_bdi_dirty_limit(struct backing_dev_info *bdi, long dirty, long *pbdi_dirty)
+{
+	long avail_dirty;
+
+	avail_dirty = dirty -
+		(global_page_state(NR_FILE_DIRTY) +
+		 global_page_state(NR_WRITEBACK) +
+		 global_page_state(NR_UNSTABLE_NFS));
+
+	if (avail_dirty < 0)
+		avail_dirty = 0;
+
+	avail_dirty += bdi_stat(bdi, BDI_RECLAIMABLE) +
+		bdi_stat(bdi, BDI_WRITEBACK);
+
+	*pbdi_dirty = min(*pbdi_dirty, avail_dirty);
 }
 
 /*
@@ -204,9 +574,11 @@ get_dirty_limits(long *pbackground, long
  */
 static void balance_dirty_pages(struct address_space *mapping)
 {
-	long nr_reclaimable;
+	long bdi_nr_reclaimable;
+	long bdi_nr_writeback;
 	long background_thresh;
 	long dirty_thresh;
+	long bdi_thresh;
 	unsigned long pages_written = 0;
 	unsigned long write_chunk = sync_writeback_pages();
 
@@ -221,15 +593,16 @@ static void balance_dirty_pages(struct a
 			.range_cyclic	= 1,
 		};
 
-		get_dirty_limits(&background_thresh, &dirty_thresh, mapping);
-		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-		if (nr_reclaimable + global_page_state(NR_WRITEBACK) <=
-			dirty_thresh)
+		get_dirty_limits(&background_thresh, &dirty_thresh,
+				&bdi_thresh, bdi);
+		clip_bdi_dirty_limit(bdi, dirty_thresh, &bdi_thresh);
+		bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
+		bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
+		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
 				break;
 
-		if (!dirty_exceeded)
-			dirty_exceeded = 1;
+		if (!bdi->dirty_exceeded)
+			bdi->dirty_exceeded = 1;
 
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
@@ -237,16 +610,37 @@ static void balance_dirty_pages(struct a
 		 * written to the server's write cache, but has not yet
 		 * been flushed to permanent storage.
 		 */
-		if (nr_reclaimable) {
+		if (bdi_nr_reclaimable) {
 			writeback_inodes(&wbc);
-			get_dirty_limits(&background_thresh,
-					 	&dirty_thresh, mapping);
-			nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-			if (nr_reclaimable +
-				global_page_state(NR_WRITEBACK)
-					<= dirty_thresh)
-						break;
+
+			get_dirty_limits(&background_thresh, &dirty_thresh,
+				       &bdi_thresh, bdi);
+			clip_bdi_dirty_limit(bdi, dirty_thresh, &bdi_thresh);
+
+			/*
+			 * In order to avoid the stacked BDI deadlock we need
+			 * to ensure we accurately count the 'dirty' pages
+			 * when the threshold is low.
+			 *
+			 * Otherwise it would be possible to get thresh+n pages
+			 * reported dirty, even though there are thresh-m pages
+			 * actually dirty; with m+n sitting in the percpu deltas.
+			 */
+			if (bdi_thresh < 2*bdi_stat_error(bdi)) {
+				bdi_nr_reclaimable =
+					bdi_stat_sum(bdi, BDI_RECLAIMABLE);
+				bdi_nr_writeback =
+					bdi_stat_sum(bdi, BDI_WRITEBACK);
+			} else {
+				bdi_nr_reclaimable =
+					bdi_stat(bdi, BDI_RECLAIMABLE);
+				bdi_nr_writeback =
+					bdi_stat(bdi, BDI_WRITEBACK);
+			}
+
+			if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
+				break;
+
 			pages_written += write_chunk - wbc.nr_to_write;
 			if (pages_written >= write_chunk)
 				break;		/* We've done our duty */
@@ -254,9 +648,9 @@ static void balance_dirty_pages(struct a
 		congestion_wait(WRITE, HZ/10);
 	}
 
-	if (nr_reclaimable + global_page_state(NR_WRITEBACK)
-		<= dirty_thresh && dirty_exceeded)
-			dirty_exceeded = 0;
+	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
+			bdi->dirty_exceeded)
+		bdi->dirty_exceeded = 0;
 
 	if (writeback_in_progress(bdi))
 		return;		/* pdflush is already working this queue */
@@ -270,7 +664,9 @@ static void balance_dirty_pages(struct a
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
 	if ((laptop_mode && pages_written) ||
-	     (!laptop_mode && (nr_reclaimable > background_thresh)))
+			(!laptop_mode && (global_page_state(NR_FILE_DIRTY)
+					  + global_page_state(NR_UNSTABLE_NFS)
+					  > background_thresh)))
 		pdflush_operation(background_writeout, 0);
 }
 
@@ -306,7 +702,7 @@ void balance_dirty_pages_ratelimited_nr(
 	unsigned long *p;
 
 	ratelimit = ratelimit_pages;
-	if (dirty_exceeded)
+	if (mapping->backing_dev_info->dirty_exceeded)
 		ratelimit = 8;
 
 	/*
@@ -342,7 +738,7 @@ void throttle_vm_writeout(gfp_t gfp_mask
 	}
 
         for ( ; ; ) {
-		get_dirty_limits(&background_thresh, &dirty_thresh, NULL);
+		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
 
                 /*
                  * Boost the allowable dirty threshold a bit for page
@@ -377,7 +773,7 @@ static void background_writeout(unsigned
 		long background_thresh;
 		long dirty_thresh;
 
-		get_dirty_limits(&background_thresh, &dirty_thresh, NULL);
+		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
 		if (global_page_state(NR_FILE_DIRTY) +
 			global_page_state(NR_UNSTABLE_NFS) < background_thresh
 				&& min_pages <= 0)
@@ -479,11 +875,13 @@ int dirty_writeback_centisecs_handler(ct
 		struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
 {
 	proc_dointvec_userhz_jiffies(table, write, file, buffer, length, ppos);
-	if (dirty_writeback_interval) {
-		mod_timer(&wb_timer,
-			jiffies + dirty_writeback_interval);
+	if (write) {
+		if (dirty_writeback_interval) {
+			mod_timer(&wb_timer,
+					jiffies + dirty_writeback_interval);
 		} else {
-		del_timer(&wb_timer);
+			del_timer(&wb_timer);
+		}
 	}
 	return 0;
 }
@@ -585,6 +983,7 @@ void __init page_writeback_init(void)
 	mod_timer(&wb_timer, jiffies + dirty_writeback_interval);
 	writeback_set_ratelimit();
 	register_cpu_notifier(&ratelimit_nb);
+	vcd_init();
 }
 
 /**
@@ -988,8 +1387,10 @@ int test_clear_page_writeback(struct pag
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
-			if (bdi_cap_writeback_dirty(bdi))
+			if (bdi_cap_writeback_dirty(bdi)) {
 				__dec_bdi_stat(bdi, BDI_WRITEBACK);
+				__bdi_writeout_inc(bdi);
+			}
 		}
 		write_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
Index: linux-2.6/mm/backing-dev.c
===================================================================
--- linux-2.6.orig/mm/backing-dev.c	2007-05-10 10:41:23.000000000 +0200
+++ linux-2.6/mm/backing-dev.c	2007-05-10 10:49:39.000000000 +0200
@@ -11,6 +11,11 @@ void bdi_init(struct backing_dev_info *b
 
 	for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
 		percpu_counter_init(&bdi->bdi_stat[i], 0);
+
+	spin_lock_init(&bdi->lock);
+	bdi->period = 0;
+	bdi->dirty_exceeded = 0;
+
 }
 EXPORT_SYMBOL(bdi_init);
 
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c	2007-05-10 10:41:23.000000000 +0200
+++ linux-2.6/kernel/sysctl.c	2007-05-10 10:49:39.000000000 +0200
@@ -162,6 +162,9 @@ extern ctl_table inotify_table[];
 int sysctl_legacy_va_layout;
 #endif
 
+extern int dirty_ratio_handler(ctl_table *table, int write,
+		struct file *filp, void __user *buffer, size_t *lenp,
+		loff_t *ppos);
 
 /* The default sysctl tables: */
 
@@ -685,7 +688,7 @@ static ctl_table vm_table[] = {
 		.data		= &vm_dirty_ratio,
 		.maxlen		= sizeof(vm_dirty_ratio),
 		.mode		= 0644,
-		.proc_handler	= &proc_dointvec_minmax,
+		.proc_handler	= &dirty_ratio_handler,
 		.strategy	= &sysctl_intvec,
 		.extra1		= &zero,
 		.extra2		= &one_hundred,

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 13/15] debug: sysfs files for the current ratio/size/total
  2007-05-10 10:08 [PATCH 00/15] per device dirty throttling -v6 Peter Zijlstra
                   ` (11 preceding siblings ...)
  2007-05-10 10:08 ` [PATCH 12/15] mm: per device dirty threshold Peter Zijlstra
@ 2007-05-10 10:08 ` Peter Zijlstra
  2007-05-10 10:08 ` [PATCH 14/15] lib: abstract the floating proportion Peter Zijlstra
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2007-05-10 10:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou

[-- Attachment #1: bdi_stat_debug.patch --]
[-- Type: text/plain, Size: 3812 bytes --]

Expose the per bdi dirty limits in sysfs

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 block/ll_rw_blk.c   |   50 ++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/page-writeback.c |    4 ++--
 2 files changed, 52 insertions(+), 2 deletions(-)

Index: linux-2.6/block/ll_rw_blk.c
===================================================================
--- linux-2.6.orig/block/ll_rw_blk.c	2007-05-04 09:57:52.000000000 +0200
+++ linux-2.6/block/ll_rw_blk.c	2007-05-04 10:00:23.000000000 +0200
@@ -3998,6 +3998,38 @@ static ssize_t queue_nr_writeback_show(s
 			nr_writeback >> (PAGE_CACHE_SHIFT - 10));
 }
 
+extern void bdi_writeout_fraction(struct backing_dev_info *bdi,
+	       	long *numerator, long *denominator);
+
+static ssize_t queue_nr_cache_ratio_show(struct request_queue *q, char *page)
+{
+	long scale, div;
+
+	bdi_writeout_fraction(&q->backing_dev_info, &scale, &div);
+	scale *= 1024;
+	scale /= div;
+
+	return sprintf(page, "%ld\n", scale);
+}
+
+extern void
+get_dirty_limits(long *pbackground, long *pdirty, long *pbdi_dirty,
+		struct backing_dev_info *bdi);
+
+static ssize_t queue_nr_cache_size_show(struct request_queue *q, char *page)
+{
+	long background, dirty, bdi_dirty;
+	get_dirty_limits(&background, &dirty, &bdi_dirty, &q->backing_dev_info);
+	return sprintf(page, "%ld\n", bdi_dirty);
+}
+
+static ssize_t queue_nr_cache_total_show(struct request_queue *q, char *page)
+{
+	long background, dirty, bdi_dirty;
+	get_dirty_limits(&background, &dirty, &bdi_dirty, &q->backing_dev_info);
+	return sprintf(page, "%ld\n", dirty);
+}
+
 static struct queue_sysfs_entry queue_requests_entry = {
 	.attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_requests_show,
@@ -4037,6 +4069,21 @@ static struct queue_sysfs_entry queue_wr
 	.show = queue_nr_writeback_show,
 };
 
+static struct queue_sysfs_entry queue_cache_ratio_entry = {
+	.attr = {.name = "cache_ratio", .mode = S_IRUGO },
+	.show = queue_nr_cache_ratio_show,
+};
+
+static struct queue_sysfs_entry queue_cache_size_entry = {
+	.attr = {.name = "cache_size", .mode = S_IRUGO },
+	.show = queue_nr_cache_size_show,
+};
+
+static struct queue_sysfs_entry queue_cache_total_entry = {
+	.attr = {.name = "cache_total", .mode = S_IRUGO },
+	.show = queue_nr_cache_total_show,
+};
+
 static struct queue_sysfs_entry queue_iosched_entry = {
 	.attr = {.name = "scheduler", .mode = S_IRUGO | S_IWUSR },
 	.show = elv_iosched_show,
@@ -4051,6 +4098,9 @@ static struct attribute *default_attrs[]
 	&queue_max_sectors_entry.attr,
 	&queue_reclaimable_entry.attr,
 	&queue_writeback_entry.attr,
+	&queue_cache_ratio_entry.attr,
+	&queue_cache_size_entry.attr,
+	&queue_cache_total_entry.attr,
 	&queue_iosched_entry.attr,
 	NULL,
 };
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c	2007-05-04 09:58:04.000000000 +0200
+++ linux-2.6/mm/page-writeback.c	2007-05-04 10:00:37.000000000 +0200
@@ -402,7 +402,7 @@ static void __bdi_writeout_inc(struct ba
  *
  *   p_{j} = x_{j} / (period/2 + t % period/2)
  */
-static void bdi_writeout_fraction(struct backing_dev_info *bdi,
+void bdi_writeout_fraction(struct backing_dev_info *bdi,
 	       	long *numerator, long *denominator)
 {
 	struct vm_completions_data *vcd = get_vcd();
@@ -477,7 +477,7 @@ static unsigned long determine_dirtyable
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
-static void
+void
 get_dirty_limits(long *pbackground, long *pdirty, long *pbdi_dirty,
 		 struct backing_dev_info *bdi)
 {

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 14/15] lib: abstract the floating proportion
  2007-05-10 10:08 [PATCH 00/15] per device dirty throttling -v6 Peter Zijlstra
                   ` (12 preceding siblings ...)
  2007-05-10 10:08 ` [PATCH 13/15] debug: sysfs files for the current ratio/size/total Peter Zijlstra
@ 2007-05-10 10:08 ` Peter Zijlstra
  2007-05-10 10:08 ` [PATCH 15/15] mm: dirty balancing for tasks Peter Zijlstra
  2007-05-15  4:48 ` [PATCH 00/15] per device dirty throttling -v6 Neil Brown
  15 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2007-05-10 10:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou

[-- Attachment #1: proportions.patch --]
[-- Type: text/plain, Size: 20283 bytes --]

pull out the floating proportion stuff and make it a lib

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/proportions.h |   82 ++++++++++++
 lib/Makefile                |    2 
 lib/proportions.c           |  259 +++++++++++++++++++++++++++++++++++++++
 mm/backing-dev.c            |    5 
 mm/page-writeback.c         |  290 ++------------------------------------------
 5 files changed, 364 insertions(+), 274 deletions(-)

Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c	2007-05-10 10:52:24.000000000 +0200
+++ linux-2.6/mm/page-writeback.c	2007-05-10 11:06:12.000000000 +0200
@@ -116,93 +116,8 @@ static void background_writeout(unsigned
  * because demand can/will vary over time. The length of this period itself is
  * measured in page writeback completions.
  *
- * DETAILS:
- *
- * The floating proportion is a time derivative with an exponentially decaying
- * history:
- *
- *   p_{j} = \Sum_{i=0} (dx_{j}/dt_{-i}) / 2^(1+i)
- *
- * Where j is an element from {BDIs}, x_{j} is j's number of completions, and i
- * the time period over which the differential is taken. So d/dt_{-i} is the
- * differential over the i-th last period.
- *
- * The decaying history gives smooth transitions. The time differential carries
- * the notion of speed.
- *
- * The denominator is 2^(1+i) because we want the series to be normalised, ie.
- *
- *   \Sum_{i=0} 1/2^(1+i) = 1
- *
- * Further more, if we measure time (t) in the same events as x; so that:
- *
- *   t = \Sum_{j} x_{j}
- *
- * we get that:
- *
- *   \Sum_{j} p_{j} = 1
- *
- * Writing this in an iterative fashion we get (dropping the 'd's):
- *
- *   if (++x_{j}, ++t > period)
- *     t /= 2;
- *     for_each (j)
- *       x_{j} /= 2;
- *
- * so that:
- *
- *   p_{j} = x_{j} / t;
- *
- * We optimize away the '/= 2' for the global time delta by noting that:
- *
- *   if (++t > period) t /= 2:
- *
- * Can be approximated by:
- *
- *   period/2 + (++t % period/2)
- *
- * [ Furthermore, when we choose period to be 2^n it can be written in terms of
- *   binary operations and wraparound artefacts disappear. ]
- *
- * Also note that this yields a natural counter of the elapsed periods:
- *
- *   c = t / (period/2)
- *
- * [ Its monotonic increasing property can be applied to mitigate the wrap-
- *   around issue. ]
- *
- * This allows us to do away with the loop over all BDIs on each period
- * expiration. By remembering the period count under which it was last
- * accessed as c_{j}, we can obtain the number of 'missed' cycles from:
- *
- *   c - c_{j}
- *
- * We can then lazily catch up to the global period count every time we are
- * going to use x_{j}, by doing:
- *
- *   x_{j} /= 2^(c - c_{j}), c_{j} = c
  */
-
-struct vm_completions_data {
-	/*
-	 * The period over which we differentiate (in pages)
-	 *
-	 *   period = 2^shift
-	 */
-	int shift;
-
-	/*
-	 * The total page writeback completion counter aka 'time'.
-	 *
-	 * Treated as an unsigned long; the lower 'shift - 1' bits are the
-	 * counter bits, the remaining upper bits the period counter.
-	 */
-	struct percpu_counter completions;
-};
-
-static int vm_completions_index;
-static struct vm_completions_data vm_completions[2];
-static DEFINE_MUTEX(vm_completions_mutex);
+struct prop_descriptor vm_completions;
 
 static unsigned long determine_dirtyable_memory(void);
 
@@ -219,85 +134,6 @@ static int calc_period_shift(void)
 	return 2 + ilog2(dirty_total - 1);
 }
 
-static void vcd_init(void)
-{
-	vm_completions[0].shift = calc_period_shift();
-	percpu_counter_init(&vm_completions[0].completions, 0);
-	percpu_counter_init(&vm_completions[1].completions, 0);
-}
-
-/*
- * We have two copies, and flip between them to make it seem like an atomic
- * update. The update is not really atomic wrt the completions counter, but
- * it is internally consistent with the bit layout depending on shift.
- *
- * We calculate the new shift, copy the completions count, move the bits around
- * and flip the index.
- */
-static void vcd_flip(void)
-{
-	int index;
-	int shift;
-	int offset;
-	u64 completions;
-	unsigned long flags;
-
-	mutex_lock(&vm_completions_mutex);
-
-	index = vm_completions_index ^ 1;
-	shift = calc_period_shift();
-	offset = vm_completions[vm_completions_index].shift - shift;
-	if (!offset)
-		goto out;
-
-	vm_completions[index].shift = shift;
-
-	local_irq_save(flags);
-	completions = percpu_counter_sum_signed(
-			&vm_completions[vm_completions_index].completions);
-
-	if (offset < 0)
-		completions <<= -offset;
-	else
-		completions >>= offset;
-
-	percpu_counter_set(&vm_completions[index].completions, completions);
-
-	/*
-	 * ensure the new vcd is fully written before the switch
-	 */
-	smp_wmb();
-	vm_completions_index = index;
-	local_irq_restore(flags);
-
-	synchronize_rcu();
-
-out:
-	mutex_unlock(&vm_completions_mutex);
-}
-
-/*
- * wrap the access to the data in an rcu_read_lock() section;
- * this is used to track the active references.
- */
-static struct vm_completions_data *get_vcd(void)
-{
-	int index;
-
-	rcu_read_lock();
-	index = vm_completions_index;
-	/*
-	 * match the wmb from vcd_flip()
-	 */
-	smp_rmb();
-	return &vm_completions[index];
-}
-
-static void put_vcd(struct vm_completions_data *vcd)
-{
-	rcu_read_unlock();
-}
-
 /*
  * update the period when the dirty ratio changes.
  */
@@ -307,130 +143,38 @@ int dirty_ratio_handler(ctl_table *table
 {
 	int old_ratio = vm_dirty_ratio;
 	int ret = proc_dointvec_minmax(table, write, filp, buffer, lenp, ppos);
-	if (ret == 0 && write && vm_dirty_ratio != old_ratio)
-		vcd_flip();
-	return ret;
-}
-
-/*
- * adjust the bdi local data to changes in the bit layout.
- */
-static void bdi_adjust_shift(struct backing_dev_info *bdi, int shift)
-{
-	int offset = bdi->shift - shift;
-
-	if (!offset)
-		return;
-
-	if (offset < 0)
-		bdi->period <<= -offset;
-	else
-		bdi->period >>= offset;
-
-	bdi->shift = shift;
-}
-
-/*
- * Catch up with missed period expirations.
- *
- *   until (c_{j} == c)
- *     x_{j} -= x_{j}/2;
- *     c_{j}++;
- */
-static void bdi_writeout_norm(struct backing_dev_info *bdi,
-		struct vm_completions_data *vcd)
-{
-	unsigned long period = 1UL << (vcd->shift - 1);
-	unsigned long period_mask = ~(period - 1);
-	unsigned long global_period;
-	unsigned long flags;
-
-	global_period = percpu_counter_read(&vcd->completions);
-	global_period &= period_mask;
-
-	/*
-	 * Fast path - check if the local and global period count still match
-	 * outside of the lock.
-	 */
-	if (bdi->period == global_period)
-		return;
-
-	spin_lock_irqsave(&bdi->lock, flags);
-	bdi_adjust_shift(bdi, vcd->shift);
-	/*
-	 * For each missed period, we half the local counter.
-	 * basically:
-	 *   bdi_stat(bdi, BDI_COMPLETION) >> (global_period - bdi->period);
-	 *
-	 * but since the distributed nature of percpu counters make division
-	 * rather hard, use a regular subtraction loop. This is safe, because
-	 * BDI_COMPLETION will only every be incremented, hence the subtraction
-	 * can never result in a negative number.
-	 */
-	while (bdi->period != global_period) {
-		unsigned long val = bdi_stat_unsigned(bdi, BDI_COMPLETION);
-		unsigned long half = (val + 1) >> 1;
-
-		/*
-		 * Half of zero won't be much less, break out.
-		 * This limits the loop to shift iterations, even
-		 * if we missed a million.
-		 */
-		if (!val)
-			break;
-
-		/*
-		 * Iff shift >32 half might exceed the limits of
-		 * the regular percpu_counter_mod.
-		 */
-		__mod_bdi_stat64(bdi, BDI_COMPLETION, -half);
-		bdi->period += period;
+	if (ret == 0 && write && vm_dirty_ratio != old_ratio) {
+		int shift = calc_period_shift();
+		prop_change_shift(&vm_completions, shift);
 	}
-	bdi->period = global_period;
-	bdi->shift = vcd->shift;
-	spin_unlock_irqrestore(&bdi->lock, flags);
+	return ret;
 }
 
 /*
  * Increment the BDI's writeout completion count and the global writeout
  * completion count. Called from test_clear_page_writeback().
- *
- *   ++x_{j}, ++t
  */
 static void __bdi_writeout_inc(struct backing_dev_info *bdi)
 {
-	struct vm_completions_data *vcd = get_vcd();
-	/* Catch up with missed period expirations before using the counter. */
-	bdi_writeout_norm(bdi, vcd);
-	__inc_bdi_stat(bdi, BDI_COMPLETION);
-
-	percpu_counter_mod(&vcd->completions, 1);
-	put_vcd(vcd);
+	struct prop_global *pg = prop_get_global(&vm_completions);
+	__prop_inc(pg, &bdi->completions);
+	prop_put_global(&vm_completions, pg);
 }
 
 /*
  * Obtain an accurate fraction of the BDI's portion.
- *
- *   p_{j} = x_{j} / (period/2 + t % period/2)
  */
 void bdi_writeout_fraction(struct backing_dev_info *bdi,
 	       	long *numerator, long *denominator)
 {
-	struct vm_completions_data *vcd = get_vcd();
-	unsigned long period_2 = 1UL << (vcd->shift - 1);
-	unsigned long counter_mask = period_2 - 1;
-	unsigned long global_count;
-
 	if (bdi_cap_writeback_dirty(bdi)) {
-		/* Catch up with the period expirations before use. */
-		bdi_writeout_norm(bdi, vcd);
-		*numerator = bdi_stat(bdi, BDI_COMPLETION);
-	} else
+		struct prop_global *pg = prop_get_global(&vm_completions);
+		prop_fraction(pg, &bdi->completions, numerator, denominator);
+		prop_put_global(&vm_completions, pg);
+	} else {
 		*numerator = 0;
-
-	global_count = percpu_counter_read(&vcd->completions);
-	*denominator = period_2 + (global_count & counter_mask);
-	put_vcd(vcd);
+		*denominator = 1;
+	}
 }
 
 /*
@@ -980,10 +724,14 @@ static struct notifier_block __cpuinitda
  */
 void __init page_writeback_init(void)
 {
+	int shift;
+
 	mod_timer(&wb_timer, jiffies + dirty_writeback_interval);
 	writeback_set_ratelimit();
 	register_cpu_notifier(&ratelimit_nb);
-	vcd_init();
+
+	shift = calc_period_shift();
+	prop_descriptor_init(&vm_completions, shift);
 }
 
 /**
Index: linux-2.6/lib/proportions.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/lib/proportions.c	2007-05-10 11:03:01.000000000 +0200
@@ -0,0 +1,259 @@
+/*
+ * FLoating proportions
+ *
+ *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
+ *
+ * DETAILS:
+ *
+ * The floating proportion is a time derivative with an exponentially decaying
+ * history:
+ *
+ *   p_{j} = \Sum_{i=0} (dx_{j}/dt_{-i}) / 2^(1+i)
+ *
+ * Where j is an element from {BDIs}, x_{j} is j's number of completions, and i
+ * the time period over which the differential is taken. So d/dt_{-i} is the
+ * differential over the i-th last period.
+ *
+ * The decaying history gives smooth transitions. The time differential carries
+ * the notion of speed.
+ *
+ * The denominator is 2^(1+i) because we want the series to be normalised, ie.
+ *
+ *   \Sum_{i=0} 1/2^(1+i) = 1
+ *
+ * Further more, if we measure time (t) in the same events as x; so that:
+ *
+ *   t = \Sum_{j} x_{j}
+ *
+ * we get that:
+ *
+ *   \Sum_{j} p_{j} = 1
+ *
+ * Writing this in an iterative fashion we get (dropping the 'd's):
+ *
+ *   if (++x_{j}, ++t > period)
+ *     t /= 2;
+ *     for_each (j)
+ *       x_{j} /= 2;
+ *
+ * so that:
+ *
+ *   p_{j} = x_{j} / t;
+ *
+ * We optimize away the '/= 2' for the global time delta by noting that:
+ *
+ *   if (++t > period) t /= 2:
+ *
+ * Can be approximated by:
+ *
+ *   period/2 + (++t % period/2)
+ *
+ * [ Furthermore, when we choose period to be 2^n it can be written in terms of
+ *   binary operations and wraparound artefacts disappear. ]
+ *
+ * Also note that this yields a natural counter of the elapsed periods:
+ *
+ *   c = t / (period/2)
+ *
+ * [ Its monotonic increasing property can be applied to mitigate the wrap-
+ *   around issue. ]
+ *
+ * This allows us to do away with the loop over all BDIs on each period
+ * expiration. By remembering the period count under which it was last
+ * accessed as c_{j}, we can obtain the number of 'missed' cycles from:
+ *
+ *   c - c_{j}
+ *
+ * We can then lazily catch up to the global period count every time we are
+ * going to use x_{j}, by doing:
+ *
+ *   x_{j} /= 2^(c - c_{j}), c_{j} = c
+ */
+
+#include <linux/proportions.h>
+#include <linux/rcupdate.h>
+
+void prop_descriptor_init(struct prop_descriptor *pd, int shift)
+{
+	pd->index = 0;
+	pd->pg[0].shift = shift;
+	percpu_counter_init(&pd->pg[0].events, 0);
+	percpu_counter_init(&pd->pg[1].events, 0);
+	mutex_init(&pd->mutex);
+}
+
+/*
+ * We have two copies, and flip between them to make it seem like an atomic
+ * update. The update is not really atomic wrt the events counter, but
+ * it is internally consistent with the bit layout depending on shift.
+ *
+ * We copy the events count, move the bits around and flip the index.
+ */
+void prop_change_shift(struct prop_descriptor *pd, int shift)
+{
+	int index;
+	int offset;
+	u64 events;
+	unsigned long flags;
+
+	mutex_lock(&pd->mutex);
+
+	index = pd->index ^ 1;
+	offset = pd->pg[pd->index].shift - shift;
+	if (!offset)
+		goto out;
+
+	pd->pg[index].shift = shift;
+
+	local_irq_save(flags);
+	events = percpu_counter_sum_signed(
+			&pd->pg[pd->index].events);
+	if (offset < 0)
+		events <<= -offset;
+	else
+		events >>= offset;
+	percpu_counter_set(&pd->pg[index].events, events);
+
+	/*
+	 * ensure the new pg is fully written before the switch
+	 */
+	smp_wmb();
+	pd->index = index;
+	local_irq_restore(flags);
+
+	synchronize_rcu();
+
+out:
+	mutex_unlock(&pd->mutex);
+}
+
+/*
+ * wrap the access to the data in an rcu_read_lock() section;
+ * this is used to track the active references.
+ */
+struct prop_global *prop_get_global(struct prop_descriptor *pd)
+{
+	int index;
+
+	rcu_read_lock();
+	index = pd->index;
+	/*
+	 * match the wmb from vcd_flip()
+	 */
+	smp_rmb();
+	return &pd->pg[index];
+}
+
+void prop_put_global(struct prop_descriptor *pd, struct prop_global *pg)
+{
+	rcu_read_unlock();
+}
+
+static void prop_adjust_shift(struct prop_local *pl, int new_shift)
+{
+	int offset = pl->shift - new_shift;
+
+	if (!offset)
+		return;
+
+	if (offset < 0)
+		pl->period <<= -offset;
+	else
+		pl->period >>= offset;
+
+	pl->shift = new_shift;
+}
+
+void prop_local_init(struct prop_local *pl)
+{
+	spin_lock_init(&pl->lock);
+	pl->shift = 0;
+	pl->period = 0;
+	percpu_counter_init(&pl->events, 0);
+}
+
+void prop_local_destroy(struct prop_local *pl)
+{
+	percpu_counter_destroy(&pl->events);
+}
+
+/*
+ * Catch up with missed period expirations.
+ *
+ *   until (c_{j} == c)
+ *     x_{j} -= x_{j}/2;
+ *     c_{j}++;
+ */
+void prop_norm(struct prop_global *pg,
+		struct prop_local *pl)
+{
+	unsigned long period = 1UL << (pg->shift - 1);
+	unsigned long period_mask = ~(period - 1);
+	unsigned long global_period;
+	unsigned long flags;
+
+	global_period = percpu_counter_read(&pg->events);
+	global_period &= period_mask;
+
+	/*
+	 * Fast path - check if the local and global period count still match
+	 * outside of the lock.
+	 */
+	if (pl->period == global_period)
+		return;
+
+	spin_lock_irqsave(&pl->lock, flags);
+	prop_adjust_shift(pl, pg->shift);
+	/*
+	 * For each missed period, we half the local counter.
+	 * basically:
+	 *   pl->events >> (global_period - pl->period);
+	 *
+	 * but since the distributed nature of percpu counters make division
+	 * rather hard, use a regular subtraction loop. This is safe, because
+	 * the events will only every be incremented, hence the subtraction
+	 * can never result in a negative number.
+	 */
+	while (pl->period != global_period) {
+		unsigned long val = percpu_counter_read(&pl->events);
+		unsigned long half = (val + 1) >> 1;
+
+		/*
+		 * Half of zero won't be much less, break out.
+		 * This limits the loop to shift iterations, even
+		 * if we missed a million.
+		 */
+		if (!val)
+			break;
+
+		/*
+		 * Iff shift >32 half might exceed the limits of
+		 * the regular percpu_counter_mod.
+		 */
+		percpu_counter_mod64(&pl->events, -half);
+		pl->period += period;
+	}
+	pl->period = global_period;
+	spin_unlock_irqrestore(&pl->lock, flags);
+}
+
+/*
+ * Obtain an fraction of this proportion
+ *
+ *   p_{j} = x_{j} / (period/2 + t % period/2)
+ */
+void prop_fraction(struct prop_global *pg, struct prop_local *pl,
+	       	long *numerator, long *denominator)
+{
+	unsigned long period_2 = 1UL << (pg->shift - 1);
+	unsigned long counter_mask = period_2 - 1;
+	unsigned long global_count;
+
+	prop_norm(pg, pl);
+	*numerator = percpu_counter_read(&pl->events);
+
+	global_count = percpu_counter_read(&pg->events);
+	*denominator = period_2 + (global_count & counter_mask);
+}
+
+
Index: linux-2.6/include/linux/proportions.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/proportions.h	2007-05-10 11:03:01.000000000 +0200
@@ -0,0 +1,82 @@
+/*
+ * FLoating proportions
+ *
+ *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
+ *
+ * This file contains the public data structure and API definitions.
+ */
+
+#ifndef _LINUX_PROPORTIONS_H
+#define _LINUX_PROPORTIONS_H
+
+#include <linux/percpu_counter.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+
+struct prop_global {
+	/*
+	 * The period over which we differentiate (in pages)
+	 *
+	 *   period = 2^shift
+	 */
+	int shift;
+	/*
+	 * The total page writeback completion counter aka 'time'.
+	 *
+	 * Treated as an unsigned long; the lower 'shift - 1' bits are the
+	 * counter bits, the remaining upper bits the period counter.
+	 */
+	struct percpu_counter events;
+};
+
+/*
+ * global property descriptor
+ *
+ * this is needed to consitently flip prop_global structures.
+ */
+struct prop_descriptor {
+	int index;
+	struct prop_global pg[2];
+	struct mutex mutex;
+};
+
+void prop_descriptor_init(struct prop_descriptor *pd, int shift);
+void prop_change_shift(struct prop_descriptor *pd, int new_shift);
+struct prop_global *prop_get_global(struct prop_descriptor *pd);
+void prop_put_global(struct prop_descriptor *pd, struct prop_global *pg);
+
+struct prop_local {
+	/*
+	 * the local events counter
+	 */
+	struct percpu_counter events;
+
+	/*
+	 * snapshot of the last seen global state
+	 * and a lock protecting this state
+	 */
+	int shift;
+	unsigned long period;
+	spinlock_t lock;
+};
+
+void prop_local_init(struct prop_local *pl);
+void prop_local_destroy(struct prop_local *pl);
+
+void prop_norm(struct prop_global *pg, struct prop_local *pl);
+
+/*
+ *   ++x_{j}, ++t
+ */
+static inline
+void __prop_inc(struct prop_global *pg, struct prop_local *pl)
+{
+	prop_norm(pg, pl);
+	percpu_counter_mod(&pl->events, 1);
+	percpu_counter_mod(&pg->events, 1);
+}
+
+void prop_fraction(struct prop_global *pg, struct prop_local *pl,
+		long *numerator, long *denominator);
+
+#endif /* _LINUX_PROPORTIONS_H */
Index: linux-2.6/mm/backing-dev.c
===================================================================
--- linux-2.6.orig/mm/backing-dev.c	2007-05-10 10:52:24.000000000 +0200
+++ linux-2.6/mm/backing-dev.c	2007-05-10 11:03:01.000000000 +0200
@@ -12,9 +12,8 @@ void bdi_init(struct backing_dev_info *b
 	for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
 		percpu_counter_init(&bdi->bdi_stat[i], 0);
 
-	spin_lock_init(&bdi->lock);
-	bdi->period = 0;
 	bdi->dirty_exceeded = 0;
+	prop_local_init(&bdi->completions);
 
 }
 EXPORT_SYMBOL(bdi_init);
@@ -25,6 +24,8 @@ void bdi_destroy(struct backing_dev_info
 
 	for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
 		percpu_counter_destroy(&bdi->bdi_stat[i]);
+
+	prop_local_destroy(&bdi->completions);
 }
 EXPORT_SYMBOL(bdi_destroy);
 
Index: linux-2.6/lib/Makefile
===================================================================
--- linux-2.6.orig/lib/Makefile	2007-05-10 10:51:35.000000000 +0200
+++ linux-2.6/lib/Makefile	2007-05-10 11:03:01.000000000 +0200
@@ -5,7 +5,7 @@
 lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 rbtree.o radix-tree.o dump_stack.o \
 	 idr.o int_sqrt.o bitmap.o extable.o prio_tree.o \
-	 sha1.o irq_regs.o reciprocal_div.o
+	 sha1.o irq_regs.o reciprocal_div.o proportions.o
 
 lib-$(CONFIG_MMU) += ioremap.o pagewalk.o
 lib-$(CONFIG_SMP) += cpumask.o

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 15/15] mm: dirty balancing for tasks
  2007-05-10 10:08 [PATCH 00/15] per device dirty throttling -v6 Peter Zijlstra
                   ` (13 preceding siblings ...)
  2007-05-10 10:08 ` [PATCH 14/15] lib: abstract the floating proportion Peter Zijlstra
@ 2007-05-10 10:08 ` Peter Zijlstra
  2007-05-15  4:48 ` [PATCH 00/15] per device dirty throttling -v6 Neil Brown
  15 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2007-05-10 10:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou

[-- Attachment #1: dirty_pages2.patch --]
[-- Type: text/plain, Size: 6206 bytes --]

Based on ideas of Andrew:
  http://marc.info/?l=linux-kernel&m=102912915020543&w=2

Scale the bdi dirty limit inversly with the tasks dirty rate.
This makes heavy writers have a lower dirty limit than the occasional writer. 

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h |    2 +
 kernel/exit.c         |    1 
 kernel/fork.c         |    1 
 mm/page-writeback.c   |   60 +++++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 63 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h	2007-05-10 11:06:12.000000000 +0200
+++ linux-2.6/include/linux/sched.h	2007-05-10 11:48:20.000000000 +0200
@@ -83,6 +83,7 @@ struct sched_param {
 #include <linux/timer.h>
 #include <linux/hrtimer.h>
 #include <linux/task_io_accounting.h>
+#include <linux/proportions.h>
 
 #include <asm/processor.h>
 
@@ -1092,6 +1093,7 @@ struct task_struct {
 #ifdef CONFIG_FAULT_INJECTION
 	int make_it_fail;
 #endif
+	struct prop_local dirties;
 };
 
 static inline pid_t process_group(struct task_struct *tsk)
Index: linux-2.6/kernel/exit.c
===================================================================
--- linux-2.6.orig/kernel/exit.c	2007-05-10 11:06:12.000000000 +0200
+++ linux-2.6/kernel/exit.c	2007-05-10 11:48:20.000000000 +0200
@@ -160,6 +160,7 @@ repeat:
 	ptrace_unlink(p);
 	BUG_ON(!list_empty(&p->ptrace_list) || !list_empty(&p->ptrace_children));
 	__exit_signal(p);
+	prop_local_destroy(&p->dirties);
 
 	/*
 	 * If we are the last non-leader member of the thread
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2007-05-10 11:06:12.000000000 +0200
+++ linux-2.6/kernel/fork.c	2007-05-10 11:48:20.000000000 +0200
@@ -191,6 +191,7 @@ static struct task_struct *dup_task_stru
 	tsk->btrace_seq = 0;
 #endif
 	tsk->splice_pipe = NULL;
+	prop_local_init(&tsk->dirties);
 	return tsk;
 }
 
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c	2007-05-10 11:06:12.000000000 +0200
+++ linux-2.6/mm/page-writeback.c	2007-05-10 11:54:56.000000000 +0200
@@ -118,6 +118,7 @@ static void background_writeout(unsigned
  *
  */
 struct prop_descriptor vm_completions;
+struct prop_descriptor vm_dirties;
 
 static unsigned long determine_dirtyable_memory(void);
 
@@ -146,6 +147,7 @@ int dirty_ratio_handler(ctl_table *table
 	if (ret == 0 && write && vm_dirty_ratio != old_ratio) {
 		int shift = calc_period_shift();
 		prop_change_shift(&vm_completions, shift);
+		prop_change_shift(&vm_dirties, shift);
 	}
 	return ret;
 }
@@ -161,6 +163,16 @@ static void __bdi_writeout_inc(struct ba
 	prop_put_global(&vm_completions, pg);
 }
 
+static void task_dirty_inc(struct task_struct *tsk)
+{
+	unsigned long flags;
+	struct prop_global *pg = prop_get_global(&vm_dirties);
+	local_irq_save(flags);
+	__prop_inc(pg, &tsk->dirties);
+	local_irq_restore(flags);
+	prop_put_global(&vm_dirties, pg);
+}
+
 /*
  * Obtain an accurate fraction of the BDI's portion.
  */
@@ -177,6 +189,14 @@ void bdi_writeout_fraction(struct backin
 	}
 }
 
+void task_dirties_fraction(struct task_struct *tsk,
+		long *numerator, long *denominator)
+{
+	struct prop_global *pg = prop_get_global(&vm_dirties);
+	prop_fraction(pg, &tsk->dirties, numerator, denominator);
+	prop_put_global(&vm_dirties, pg);
+}
+
 /*
  * Work out the current dirty-memory clamping and background writeout
  * thresholds.
@@ -310,6 +330,33 @@ clip_bdi_dirty_limit(struct backing_dev_
 }
 
 /*
+ * scale the dirty limit
+ *
+ * task specific dirty limit:
+ *
+ *   dirty -= (dirty/2) * p_{t}
+ */
+void task_dirty_limit(struct task_struct *tsk, long *pdirty)
+{
+	long numerator, denominator;
+	long dirty = *pdirty;
+	long long inv = dirty >> 1;
+
+	task_dirties_fraction(tsk, &numerator, &denominator);
+	inv *= numerator;
+	do_div(inv, denominator);
+
+	dirty -= inv;
+	if (dirty < *pdirty/2) {
+		printk("odd limit: %ld %ld %ld\n",
+				*pdirty, dirty, (long)inv);
+		dirty = *pdirty/2;
+	}
+
+	*pdirty = dirty;
+}
+
+/*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
  * the caller to perform writeback if the system is over `vm_dirty_ratio'.
@@ -340,6 +387,7 @@ static void balance_dirty_pages(struct a
 		get_dirty_limits(&background_thresh, &dirty_thresh,
 				&bdi_thresh, bdi);
 		clip_bdi_dirty_limit(bdi, dirty_thresh, &bdi_thresh);
+		task_dirty_limit(current, &bdi_thresh);
 		bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
 		bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
 		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
@@ -360,6 +408,7 @@ static void balance_dirty_pages(struct a
 			get_dirty_limits(&background_thresh, &dirty_thresh,
 				       &bdi_thresh, bdi);
 			clip_bdi_dirty_limit(bdi, dirty_thresh, &bdi_thresh);
+			task_dirty_limit(current, &bdi_thresh);
 
 			/*
 			 * In order to avoid the stacked BDI deadlock we need
@@ -732,6 +781,7 @@ void __init page_writeback_init(void)
 
 	shift = calc_period_shift();
 	prop_descriptor_init(&vm_completions, shift);
+	prop_descriptor_init(&vm_dirties, shift);
 }
 
 /**
@@ -1009,7 +1059,7 @@ EXPORT_SYMBOL(redirty_page_for_writepage
  * If the mapping doesn't provide a set_page_dirty a_op, then
  * just fall through and assume that it wants buffer_heads.
  */
-int fastcall set_page_dirty(struct page *page)
+static int __set_page_dirty(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 
@@ -1027,6 +1077,14 @@ int fastcall set_page_dirty(struct page 
 	}
 	return 0;
 }
+
+int fastcall set_page_dirty(struct page *page)
+{
+	int ret = __set_page_dirty(page);
+	if (ret)
+		task_dirty_inc(current);
+	return ret;
+}
 EXPORT_SYMBOL(set_page_dirty);
 
 /*

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 00/15] per device dirty throttling -v6
  2007-05-10 10:08 [PATCH 00/15] per device dirty throttling -v6 Peter Zijlstra
                   ` (14 preceding siblings ...)
  2007-05-10 10:08 ` [PATCH 15/15] mm: dirty balancing for tasks Peter Zijlstra
@ 2007-05-15  4:48 ` Neil Brown
  2007-05-15  7:44   ` Peter Zijlstra
  15 siblings, 1 reply; 18+ messages in thread
From: Neil Brown @ 2007-05-15  4:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, miklos, akpm, dgc, tomoki.sekiyama.qu,
	nikita, trond.myklebust, yingchao.zhou

On Thursday May 10, a.p.zijlstra@chello.nl wrote:
> The latest version of the per device dirty throttling patches.
> 
> I put in quite a few comments, and added an patch to do per task dirty
> throttling as well, for RFCs sake :-)
> 
> I haven't yet come around to do anything but integrety testing on this code
> base, ie. it built a kernel. I hope to do more tests shorty if time permits...
> 
> Perhaps the people on bugzilla.kernel.org #7372 might be willing to help out
> there.
> 
> Oh, patches are against 2.6.21-mm2
> 
> -- 

Patch 12 has:
  +#include <linux/proportions.h>

But that file isn't added until patch 14.

Splitting the "proportions" stuff out into lib/ is a good idea.
You have left some remnants of it's origin though, which mentions of
   BDI
   pages
   total page writeback

The "proportions" library always uses a percpu counter, which is
perfect of the per-bdi counter, but seems wrong when you use the same
code for per-task throttling.  Have a percpu counter in struct task
seems very wasteful.  You don't need to lock the access to this
counter as it is only ever access as current-> so a simple "long"
(or "long long") would do.  The global "vm_dirties" still needs to be
percpu....  I'm not sure what best to do about this.

The per-task throttling is interesting.
You reduce the point where a task has to throttle by up to half, based
on the fraction of recently dirtied pages that the task is responsible
for.
So if there is one writer, it now gets only half the space that it
used to.  That is probably OK, we can just increase the space
available...
If there are two equally eager writers, they can both use up to the
75% mark, so they probably each get 37%, which is reasonable.
If there is one fast an one slow writer where the slow writer is
generating dirty pages well below the writeout rate of the device, the
fast writer will throttle at around 50% and the slow writer will never
block.  That is nice.

If you have two writers A and B writing aggressively to two devices X
and Y with different speeds, say X twice the speed of Y, then in the
steady state, X gets 2/3 of the space and Y gets 1/3.
A will dirty twice the pages that B dirties so A will get to use
1 - (2/3)/2 == 2/3 of that space or 4/9, and B will get to use 1 - (1/3)/2 ==
5/6 of that space or 5/18.  Did I get that right?
So they will each reduce the space available to the other, even though
they aren't really competing.   That might not be a problem, but it is
interesting... 

It seems that the 'one half' is fairly arbitrary.  It could equally
well be 3/4.  That would simply mean there is less differentiation
between the more and less aggressive writer.  I would probably lean
towards a higher number like 3/4.  It should still give reasonable
differentiation without cutting max amount of dirty memory in half for
the common 1-writer case.

A couple of years ago Andrea Arcangeli wrote a patch that did per-task
throttling, which it is worth comparing with.
  http://lwn.net/Articles/152277/

It takes each task separately, measure rate-of-dirtying over a fixed
time period, and throttle when that rate would put the system over the
limit soon.  Thus slower dirtiers throttle later.

Having to configure the fixed number (the period) is always awkward,
and I think your floating average is better suited for the task.
I doubt if Andrea's patch still applies so a direct comparison might
be awkward, but it might not hurt to read through it if you haven't
already. 

NeilBrown

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 00/15] per device dirty throttling -v6
  2007-05-15  4:48 ` [PATCH 00/15] per device dirty throttling -v6 Neil Brown
@ 2007-05-15  7:44   ` Peter Zijlstra
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2007-05-15  7:44 UTC (permalink / raw)
  To: Neil Brown
  Cc: linux-mm, linux-kernel, miklos, akpm, dgc, tomoki.sekiyama.qu,
	nikita, trond.myklebust, yingchao.zhou

On Tue, 2007-05-15 at 14:48 +1000, Neil Brown wrote:
> On Thursday May 10, a.p.zijlstra@chello.nl wrote:
> > The latest version of the per device dirty throttling patches.
> > 
> > I put in quite a few comments, and added an patch to do per task dirty
> > throttling as well, for RFCs sake :-)
> > 
> > I haven't yet come around to do anything but integrety testing on this code
> > base, ie. it built a kernel. I hope to do more tests shorty if time permits...
> > 
> > Perhaps the people on bugzilla.kernel.org #7372 might be willing to help out
> > there.
> > 
> > Oh, patches are against 2.6.21-mm2
> > 
> > -- 
> 
> Patch 12 has:
>   +#include <linux/proportions.h>
> 
> But that file isn't added until patch 14.

Oops :-)

> Splitting the "proportions" stuff out into lib/ is a good idea.
> You have left some remnants of it's origin though, which mentions of
>    BDI
>    pages
>    total page writeback
> 
> The "proportions" library always uses a percpu counter, which is
> perfect of the per-bdi counter, but seems wrong when you use the same
> code for per-task throttling.  Have a percpu counter in struct task
> seems very wasteful.  You don't need to lock the access to this
> counter as it is only ever access as current-> so a simple "long"
> (or "long long") would do.  The global "vm_dirties" still needs to be
> percpu....  I'm not sure what best to do about this.

Right, I did it just to quickly reuse the concept. The task throttling
is very much an RFC (I even had that in the subject, but it magically
got lost in sending).
   
But if you like it, I could put this patch before the per bdi affair and
clean it up. I though having only 1 user would not warrant its own lib/
file.

> The per-task throttling is interesting.
> You reduce the point where a task has to throttle by up to half, based
> on the fraction of recently dirtied pages that the task is responsible
> for.
> So if there is one writer, it now gets only half the space that it
> used to.  That is probably OK, we can just increase the space
> available...
> If there are two equally eager writers, they can both use up to the
> 75% mark, so they probably each get 37%, which is reasonable.
> If there is one fast an one slow writer where the slow writer is
> generating dirty pages well below the writeout rate of the device, the
> fast writer will throttle at around 50% and the slow writer will never
> block.  That is nice.

Yes, if only ext3's fsync would not be global... :-/

> If you have two writers A and B writing aggressively to two devices X
> and Y with different speeds, say X twice the speed of Y, then in the
> steady state, X gets 2/3 of the space and Y gets 1/3.
> A will dirty twice the pages that B dirties so A will get to use
> 1 - (2/3)/2 == 2/3 of that space or 4/9, and B will get to use 1 - (1/3)/2 ==
> 5/6 of that space or 5/18.  Did I get that right?
> So they will each reduce the space available to the other, even though
> they aren't really competing.   That might not be a problem, but it is
> interesting... 

Indeed, quite an interesting scenario. /me must ponder this; per bdi
task proportions are way overkill...

> It seems that the 'one half' is fairly arbitrary.  It could equally
> well be 3/4.  That would simply mean there is less differentiation
> between the more and less aggressive writer.  I would probably lean
> towards a higher number like 3/4.  It should still give reasonable
> differentiation without cutting max amount of dirty memory in half for
> the common 1-writer case.

Yes its pulled from a dark place,... pretty much anything would do, I
even ran into someone who wanted it to be almost 1 - so that heavy
writers would act almost synchonous.

> A couple of years ago Andrea Arcangeli wrote a patch that did per-task
> throttling, which it is worth comparing with.
>   http://lwn.net/Articles/152277/
> 
> It takes each task separately, measure rate-of-dirtying over a fixed
> time period, and throttle when that rate would put the system over the
> limit soon.  Thus slower dirtiers throttle later.
> 
> Having to configure the fixed number (the period) is always awkward,
> and I think your floating average is better suited for the task.
> I doubt if Andrea's patch still applies so a direct comparison might
> be awkward, but it might not hurt to read through it if you haven't
> already. 

I remember now, thanks for the pointer, I'll see if I can come up with
something hybrid here.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2007-05-15  7:44 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-05-10 10:08 [PATCH 00/15] per device dirty throttling -v6 Peter Zijlstra
2007-05-10 10:08 ` [PATCH 01/15] nfs: remove congestion_end() Peter Zijlstra
2007-05-10 10:08 ` [PATCH 02/15] lib: percpu_counter variable batch Peter Zijlstra
2007-05-10 10:08 ` [PATCH 03/15] lib: percpu_counter_mod64 Peter Zijlstra
2007-05-10 10:08 ` [PATCH 04/15] lib: percpu_counter_set Peter Zijlstra
2007-05-10 10:08 ` [PATCH 05/15] lib: percpu_count_sum_signed() Peter Zijlstra
2007-05-10 10:08 ` [PATCH 06/15] mm: bdi init hooks Peter Zijlstra
2007-05-10 10:08 ` [PATCH 07/15] mtd: give mtdconcat devices their own backing_dev_info Peter Zijlstra
2007-05-10 10:08 ` [PATCH 08/15] mm: scalable bdi statistics counters Peter Zijlstra
2007-05-10 10:08 ` [PATCH 09/15] mm: count reclaimable pages per BDI Peter Zijlstra
2007-05-10 10:08 ` [PATCH 10/15] mm: count writeback " Peter Zijlstra
2007-05-10 10:08 ` [PATCH 11/15] mm: expose BDI statistics in sysfs Peter Zijlstra
2007-05-10 10:08 ` [PATCH 12/15] mm: per device dirty threshold Peter Zijlstra
2007-05-10 10:08 ` [PATCH 13/15] debug: sysfs files for the current ratio/size/total Peter Zijlstra
2007-05-10 10:08 ` [PATCH 14/15] lib: abstract the floating proportion Peter Zijlstra
2007-05-10 10:08 ` [PATCH 15/15] mm: dirty balancing for tasks Peter Zijlstra
2007-05-15  4:48 ` [PATCH 00/15] per device dirty throttling -v6 Neil Brown
2007-05-15  7:44   ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox