linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 0/6] rwsem: performance optimizations
       [not found] <cover.1380231690.git.tim.c.chen@linux.intel.com>
@ 2013-09-26 22:20 ` Tim Chen
  2013-09-26 22:21 ` [PATCH v7 1/6] rwsem: check the lock before cpmxchg in down_write_trylock Tim Chen
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 7+ messages in thread
From: Tim Chen @ 2013-09-26 22:20 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton
  Cc: Linus Torvalds, Andrea Arcangeli, Alex Shi, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, Peter Hurley,
	Tim Chen, linux-kernel, linux-mm

We fixed various style issues for version 7 of this patchset 
and added brief explanations of the MCS lock code that we put
in a separate file. We will like to have it merged if
there are no objections.

In this patchset, we introduce two categories of optimizations to read
write semaphore.  The first four patches from Alex Shi reduce cache
bouncing of the sem->count field by doing a pre-read of the sem->count
and avoid cmpxchg if possible.

The last two patches introduce similar optimistic spinning logic as the
mutex code for the writer lock acquisition of rwsem. This addresses the
general 'mutexes out perform writer-rwsems' situations that has been
seen in more than one case.  Users now need not worry about performance
issues when choosing between these two locking mechanisms.

Without these optimizations, Davidlohr Bueso saw a -8% regression to
aim7's shared and high_systime workloads when he switched i_mmap_mutex
to rwsem.  Tests were on 8 socket 80 cores system.  With the patchset,
he got significant improvements to the aim7 suite instead of regressions:

alltests (+16.3%), custom (+20%), disk (+19.5%), high_systime (+7%),
shared (+18.4%) and short (+6.3%).

Tim Chen also got a +5% improvements to exim mail server workload on a
40 core system.

Thanks to Ingo Molnar, Peter Hurley and Peter Zijlstra for reviewing
this patchset.

Regards,
Tim Chen

Changelog:

v7:
1. Rename mcslock.h to mcs_spinlock.h and also rename mcs related fields
with mcs prefix.
2. Properly define type of *mcs_lock field instead of leaving it as *void.
3. Added breif explanation of mcs lock.

v6:
1. Fix missing mcslock.h file.
2. Fix various code style issues.

v5:
1. Try optimistic spinning before we put the writer on the wait queue
to avoid bottlenecking at wait queue.  This provides 5% boost to exim workload
and between 2% to 8% boost to aim7. 
2. Put MCS locking code into its own mcslock.h file for better reuse
between mutex.c and rwsem.c
3. Remove the configuration RWSEM_SPIN_ON_WRITE_OWNER and make the 
operations default per Ingo's suggestions.

v4:
1. Fixed a bug in task_struct definition in rwsem_can_spin_on_owner
2. Fix another typo for RWSEM_SPIN_ON_WRITE_OWNER config option

v3:
1. Added ACCESS_ONCE to sem->count access in rwsem_can_spin_on_owner.
2. Fix typo bug for RWSEM_SPIN_ON_WRITE_OWNER option in init/Kconfig

v2:
1. Reorganize changes to down_write_trylock and do_wake into 4 patches and fixed
   a bug referencing &sem->count when sem->count is intended.
2. Fix unsafe sem->owner de-reference in rwsem_can_spin_on_owner.
the option to be on for more seasoning but can be turned off should it be detrimental.
3. Various patch comments update

Alex Shi (4):
  rwsem: check the lock before cpmxchg in down_write_trylock
  rwsem: remove 'out' label in do_wake
  rwsem: remove try_reader_grant label do_wake
  rwsem/wake: check lock before do atomic update

Tim Chen (2):
  MCS Lock: Restructure the MCS lock defines and locking code into its
    own file
  rwsem: do optimistic spinning for writer lock acquisition

 include/asm-generic/rwsem.h  |    8 +-
 include/linux/mcs_spinlock.h |   64 ++++++++++++
 include/linux/mutex.h        |    5 +-
 include/linux/rwsem.h        |    7 +-
 kernel/mutex.c               |   60 ++----------
 kernel/rwsem.c               |   19 ++++-
 lib/rwsem.c                  |  226 ++++++++++++++++++++++++++++++++++++-----
 7 files changed, 300 insertions(+), 89 deletions(-)
 create mode 100644 include/linux/mcs_spinlock.h

-- 
1.7.4.4


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v7 1/6] rwsem: check the lock before cpmxchg in down_write_trylock
       [not found] <cover.1380231690.git.tim.c.chen@linux.intel.com>
  2013-09-26 22:20 ` [PATCH v7 0/6] rwsem: performance optimizations Tim Chen
@ 2013-09-26 22:21 ` Tim Chen
  2013-09-26 22:21 ` [PATCH v7 2/6] rwsem: remove 'out' label in do_wake Tim Chen
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 7+ messages in thread
From: Tim Chen @ 2013-09-26 22:21 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton
  Cc: Linus Torvalds, Andrea Arcangeli, Alex Shi, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, Peter Hurley,
	Tim Chen, linux-kernel, linux-mm

Cmpxchg will cause the cacheline bouning when do the value checking,
that cause scalability issue in a large machine (like a 80 core box).

So a lock pre-read can relief this contention.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 include/asm-generic/rwsem.h |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/asm-generic/rwsem.h b/include/asm-generic/rwsem.h
index bb1e2cd..5ba80e7 100644
--- a/include/asm-generic/rwsem.h
+++ b/include/asm-generic/rwsem.h
@@ -70,11 +70,11 @@ static inline void __down_write(struct rw_semaphore *sem)
 
 static inline int __down_write_trylock(struct rw_semaphore *sem)
 {
-	long tmp;
+	if (unlikely(sem->count != RWSEM_UNLOCKED_VALUE))
+		return 0;
 
-	tmp = cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
-		      RWSEM_ACTIVE_WRITE_BIAS);
-	return tmp == RWSEM_UNLOCKED_VALUE;
+	return cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
+		      RWSEM_ACTIVE_WRITE_BIAS) == RWSEM_UNLOCKED_VALUE;
 }
 
 /*
-- 
1.7.4.4



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v7 2/6] rwsem: remove 'out' label in do_wake
       [not found] <cover.1380231690.git.tim.c.chen@linux.intel.com>
  2013-09-26 22:20 ` [PATCH v7 0/6] rwsem: performance optimizations Tim Chen
  2013-09-26 22:21 ` [PATCH v7 1/6] rwsem: check the lock before cpmxchg in down_write_trylock Tim Chen
@ 2013-09-26 22:21 ` Tim Chen
  2013-09-26 22:21 ` [PATCH v7 3/6] rwsem: remove try_reader_grant label do_wake Tim Chen
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 7+ messages in thread
From: Tim Chen @ 2013-09-26 22:21 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton
  Cc: Linus Torvalds, Andrea Arcangeli, Alex Shi, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, Peter Hurley,
	Tim Chen, linux-kernel, linux-mm

That make code simple and more readable.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 lib/rwsem.c |    5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index 19c5fa9..42f1b1a 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -75,7 +75,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
 			 * will block as they will notice the queued writer.
 			 */
 			wake_up_process(waiter->task);
-		goto out;
+		return sem;
 	}
 
 	/* Writers might steal the lock before we grant it to the next reader.
@@ -91,7 +91,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
 			/* A writer stole the lock. Undo our reader grant. */
 			if (rwsem_atomic_update(-adjustment, sem) &
 						RWSEM_ACTIVE_MASK)
-				goto out;
+				return sem;
 			/* Last active locker left. Retry waking readers. */
 			goto try_reader_grant;
 		}
@@ -136,7 +136,6 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
 	sem->wait_list.next = next;
 	next->prev = &sem->wait_list;
 
- out:
 	return sem;
 }
 
-- 
1.7.4.4



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v7 3/6] rwsem: remove try_reader_grant label do_wake
       [not found] <cover.1380231690.git.tim.c.chen@linux.intel.com>
                   ` (2 preceding siblings ...)
  2013-09-26 22:21 ` [PATCH v7 2/6] rwsem: remove 'out' label in do_wake Tim Chen
@ 2013-09-26 22:21 ` Tim Chen
  2013-09-26 22:21 ` [PATCH v7 4/6] rwsem/wake: check lock before do atomic update Tim Chen
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 7+ messages in thread
From: Tim Chen @ 2013-09-26 22:21 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton
  Cc: Linus Torvalds, Andrea Arcangeli, Alex Shi, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, Peter Hurley,
	Tim Chen, linux-kernel, linux-mm

That make code simple and more readable

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 lib/rwsem.c |   12 +++++++-----
 1 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index 42f1b1a..a8055cf 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -85,15 +85,17 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
 	adjustment = 0;
 	if (wake_type != RWSEM_WAKE_READ_OWNED) {
 		adjustment = RWSEM_ACTIVE_READ_BIAS;
- try_reader_grant:
-		oldcount = rwsem_atomic_update(adjustment, sem) - adjustment;
-		if (unlikely(oldcount < RWSEM_WAITING_BIAS)) {
-			/* A writer stole the lock. Undo our reader grant. */
+		while (1) {
+			oldcount = rwsem_atomic_update(adjustment, sem)
+								- adjustment;
+			if (likely(oldcount >= RWSEM_WAITING_BIAS))
+				break;
+
+			 /* A writer stole the lock.  Undo our reader grant. */
 			if (rwsem_atomic_update(-adjustment, sem) &
 						RWSEM_ACTIVE_MASK)
 				return sem;
 			/* Last active locker left. Retry waking readers. */
-			goto try_reader_grant;
 		}
 	}
 
-- 
1.7.4.4



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v7 4/6] rwsem/wake: check lock before do atomic update
       [not found] <cover.1380231690.git.tim.c.chen@linux.intel.com>
                   ` (3 preceding siblings ...)
  2013-09-26 22:21 ` [PATCH v7 3/6] rwsem: remove try_reader_grant label do_wake Tim Chen
@ 2013-09-26 22:21 ` Tim Chen
  2013-09-26 22:21 ` [PATCH v7 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file Tim Chen
  2013-09-26 22:21 ` [PATCH v7 6/6] rwsem: do optimistic spinning for writer lock acquisition Tim Chen
  6 siblings, 0 replies; 7+ messages in thread
From: Tim Chen @ 2013-09-26 22:21 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton
  Cc: Linus Torvalds, Andrea Arcangeli, Alex Shi, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, Peter Hurley,
	Tim Chen, linux-kernel, linux-mm

Atomic update lock and roll back will cause cache bouncing in large
machine. A lock status pre-read can relieve this problem

Suggested-by: Davidlohr bueso <davidlohr.bueso@hp.com>
Suggested-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 lib/rwsem.c |    8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index a8055cf..1d6e6e8 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -64,7 +64,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
 	struct rwsem_waiter *waiter;
 	struct task_struct *tsk;
 	struct list_head *next;
-	long oldcount, woken, loop, adjustment;
+	long woken, loop, adjustment;
 
 	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
 	if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
@@ -86,6 +86,12 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
 	if (wake_type != RWSEM_WAKE_READ_OWNED) {
 		adjustment = RWSEM_ACTIVE_READ_BIAS;
 		while (1) {
+			long oldcount;
+
+			/* A writer stole the lock. */
+			if (unlikely(sem->count < RWSEM_WAITING_BIAS))
+				return sem;
+
 			oldcount = rwsem_atomic_update(adjustment, sem)
 								- adjustment;
 			if (likely(oldcount >= RWSEM_WAITING_BIAS))
-- 
1.7.4.4



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v7 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file
       [not found] <cover.1380231690.git.tim.c.chen@linux.intel.com>
                   ` (4 preceding siblings ...)
  2013-09-26 22:21 ` [PATCH v7 4/6] rwsem/wake: check lock before do atomic update Tim Chen
@ 2013-09-26 22:21 ` Tim Chen
  2013-09-26 22:21 ` [PATCH v7 6/6] rwsem: do optimistic spinning for writer lock acquisition Tim Chen
  6 siblings, 0 replies; 7+ messages in thread
From: Tim Chen @ 2013-09-26 22:21 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton
  Cc: Linus Torvalds, Andrea Arcangeli, Alex Shi, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, Peter Hurley,
	Tim Chen, linux-kernel, linux-mm

We will need the MCS lock code for doing optimistic spinning for rwsem.
Extracting the MCS code from mutex.c and put into its own file allow us
to reuse this code easily for rwsem.

Reviewed-by: Ingo Molnar <mingo@elte.hu>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
---
 include/linux/mcs_spinlock.h |   64 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/mutex.h        |    5 ++-
 kernel/mutex.c               |   60 ++++----------------------------------
 3 files changed, 74 insertions(+), 55 deletions(-)
 create mode 100644 include/linux/mcs_spinlock.h

diff --git a/include/linux/mcs_spinlock.h b/include/linux/mcs_spinlock.h
new file mode 100644
index 0000000..b5de3b0
--- /dev/null
+++ b/include/linux/mcs_spinlock.h
@@ -0,0 +1,64 @@
+/*
+ * MCS lock defines
+ *
+ * This file contains the main data structure and API definitions of MCS lock.
+ *
+ * The MCS lock (proposed by Mellor-Crummey and Scott) is a simple spin-lock
+ * with the desirable properties of being fair, and with each cpu trying
+ * to acquire the lock spinning on a local variable.
+ * It avoids expensive cache bouncings that common test-and-set spin-lock
+ * implementations incur.
+ */
+#ifndef __LINUX_MCS_SPINLOCK_H
+#define __LINUX_MCS_SPINLOCK_H
+
+struct mcs_spinlock {
+	struct mcs_spinlock *next;
+	int locked; /* 1 if lock acquired */
+};
+
+/*
+ * We don't inline mcs_spin_lock() so that perf can correctly account for the
+ * time spent in this lock function.
+ */
+static noinline
+void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
+{
+	struct mcs_spinlock *prev;
+
+	/* Init node */
+	node->locked = 0;
+	node->next   = NULL;
+
+	prev = xchg(lock, node);
+	if (likely(prev == NULL)) {
+		/* Lock acquired */
+		node->locked = 1;
+		return;
+	}
+	ACCESS_ONCE(prev->next) = node;
+	smp_wmb();
+	/* Wait until the lock holder passes the lock down */
+	while (!ACCESS_ONCE(node->locked))
+		arch_mutex_cpu_relax();
+}
+
+static void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
+{
+	struct mcs_spinlock *next = ACCESS_ONCE(node->next);
+
+	if (likely(!next)) {
+		/*
+		 * Release the lock by setting it to NULL
+		 */
+		if (cmpxchg(lock, node, NULL) == node)
+			return;
+		/* Wait until the next pointer is set */
+		while (!(next = ACCESS_ONCE(node->next)))
+			arch_mutex_cpu_relax();
+	}
+	ACCESS_ONCE(next->locked) = 1;
+	smp_wmb();
+}
+
+#endif /* __LINUX_MCS_SPINLOCK_H */
diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index ccd4260..e6eaeea 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -46,6 +46,7 @@
  * - detects multi-task circular deadlocks and prints out all affected
  *   locks and tasks (and only those tasks)
  */
+struct mcs_spinlock;
 struct mutex {
 	/* 1: unlocked, 0: locked, negative: locked, possible waiters */
 	atomic_t		count;
@@ -55,7 +56,7 @@ struct mutex {
 	struct task_struct	*owner;
 #endif
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
-	void			*spin_mlock;	/* Spinner MCS lock */
+	struct mcs_spinlock	*mcs_lock;	/* Spinner MCS lock */
 #endif
 #ifdef CONFIG_DEBUG_MUTEXES
 	const char 		*name;
@@ -179,4 +180,4 @@ extern int atomic_dec_and_mutex_lock(atomic_t *cnt, struct mutex *lock);
 #define arch_mutex_cpu_relax()	cpu_relax()
 #endif
 
-#endif
+#endif /* __LINUX_MUTEX_H */
diff --git a/kernel/mutex.c b/kernel/mutex.c
index 6d647ae..4640731 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -25,6 +25,7 @@
 #include <linux/spinlock.h>
 #include <linux/interrupt.h>
 #include <linux/debug_locks.h>
+#include <linux/mcs_spinlock.h>
 
 /*
  * In the DEBUG case we are using the "NULL fastpath" for mutexes,
@@ -52,7 +53,7 @@ __mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
 	INIT_LIST_HEAD(&lock->wait_list);
 	mutex_clear_owner(lock);
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
-	lock->spin_mlock = NULL;
+	lock->mcs_lock = NULL;
 #endif
 
 	debug_mutex_init(lock, name, key);
@@ -111,54 +112,7 @@ EXPORT_SYMBOL(mutex_lock);
  * more or less simultaneously, the spinners need to acquire a MCS lock
  * first before spinning on the owner field.
  *
- * We don't inline mspin_lock() so that perf can correctly account for the
- * time spent in this lock function.
  */
-struct mspin_node {
-	struct mspin_node *next ;
-	int		  locked;	/* 1 if lock acquired */
-};
-#define	MLOCK(mutex)	((struct mspin_node **)&((mutex)->spin_mlock))
-
-static noinline
-void mspin_lock(struct mspin_node **lock, struct mspin_node *node)
-{
-	struct mspin_node *prev;
-
-	/* Init node */
-	node->locked = 0;
-	node->next   = NULL;
-
-	prev = xchg(lock, node);
-	if (likely(prev == NULL)) {
-		/* Lock acquired */
-		node->locked = 1;
-		return;
-	}
-	ACCESS_ONCE(prev->next) = node;
-	smp_wmb();
-	/* Wait until the lock holder passes the lock down */
-	while (!ACCESS_ONCE(node->locked))
-		arch_mutex_cpu_relax();
-}
-
-static void mspin_unlock(struct mspin_node **lock, struct mspin_node *node)
-{
-	struct mspin_node *next = ACCESS_ONCE(node->next);
-
-	if (likely(!next)) {
-		/*
-		 * Release the lock by setting it to NULL
-		 */
-		if (cmpxchg(lock, node, NULL) == node)
-			return;
-		/* Wait until the next pointer is set */
-		while (!(next = ACCESS_ONCE(node->next)))
-			arch_mutex_cpu_relax();
-	}
-	ACCESS_ONCE(next->locked) = 1;
-	smp_wmb();
-}
 
 /*
  * Mutex spinning code migrated from kernel/sched/core.c
@@ -448,7 +402,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 
 	for (;;) {
 		struct task_struct *owner;
-		struct mspin_node  node;
+		struct mcs_spinlock  node;
 
 		if (!__builtin_constant_p(ww_ctx == NULL) && ww_ctx->acquired > 0) {
 			struct ww_mutex *ww;
@@ -470,10 +424,10 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 		 * If there's an owner, wait for it to either
 		 * release the lock or go to sleep.
 		 */
-		mspin_lock(MLOCK(lock), &node);
+		mcs_spin_lock(&lock->mcs_lock, &node);
 		owner = ACCESS_ONCE(lock->owner);
 		if (owner && !mutex_spin_on_owner(lock, owner)) {
-			mspin_unlock(MLOCK(lock), &node);
+			mcs_spin_unlock(&lock->mcs_lock, &node);
 			goto slowpath;
 		}
 
@@ -488,11 +442,11 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 			}
 
 			mutex_set_owner(lock);
-			mspin_unlock(MLOCK(lock), &node);
+			mcs_spin_unlock(&lock->mcs_lock, &node);
 			preempt_enable();
 			return 0;
 		}
-		mspin_unlock(MLOCK(lock), &node);
+		mcs_spin_unlock(&lock->mcs_lock, &node);
 
 		/*
 		 * When there's no owner, we might have preempted between the
-- 
1.7.4.4



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v7 6/6] rwsem: do optimistic spinning for writer lock acquisition
       [not found] <cover.1380231690.git.tim.c.chen@linux.intel.com>
                   ` (5 preceding siblings ...)
  2013-09-26 22:21 ` [PATCH v7 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file Tim Chen
@ 2013-09-26 22:21 ` Tim Chen
  6 siblings, 0 replies; 7+ messages in thread
From: Tim Chen @ 2013-09-26 22:21 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton
  Cc: Linus Torvalds, Andrea Arcangeli, Alex Shi, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, Peter Hurley,
	Tim Chen, linux-kernel, linux-mm

We want to add optimistic spinning to rwsems because
the writer rwsem does not perform as well as mutexes. Tim noticed that
for exim (mail server) workloads, when reverting commit 4fc3f1d6 and
Davidlohr noticed it when converting the i_mmap_mutex to a rwsem in some
aim7 workloads. We've noticed that the biggest difference
is when we fail to acquire a mutex in the fastpath, optimistic spinning
comes in to play and we can avoid a large amount of unnecessary sleeping
and overhead of moving tasks in and out of wait queue.

Allowing optimistic spinning before putting the writer on the wait queue
reduces wait queue contention and provided greater chance for the rwsem
to get acquired. With these changes, rwsem is on par with mutex.

Reviewed-by: Ingo Molnar <mingo@elte.hu>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
---
 include/linux/rwsem.h |    7 ++-
 kernel/rwsem.c        |   19 +++++-
 lib/rwsem.c           |  201 ++++++++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 206 insertions(+), 21 deletions(-)

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 0616ffe..aba7920 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -22,10 +22,13 @@ struct rw_semaphore;
 #include <linux/rwsem-spinlock.h> /* use a generic implementation */
 #else
 /* All arch specific implementations share the same struct */
+struct mcs_spinlock;
 struct rw_semaphore {
 	long			count;
 	raw_spinlock_t		wait_lock;
 	struct list_head	wait_list;
+	struct task_struct	*owner; /* write owner */
+	struct mcs_spinlock	*mcs_lock;
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	struct lockdep_map	dep_map;
 #endif
@@ -58,7 +61,9 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 #define __RWSEM_INITIALIZER(name)			\
 	{ RWSEM_UNLOCKED_VALUE,				\
 	  __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock),	\
-	  LIST_HEAD_INIT((name).wait_list)		\
+	  LIST_HEAD_INIT((name).wait_list),		\
+	  NULL,						\
+	  NULL						\
 	  __RWSEM_DEP_MAP_INIT(name) }
 
 #define DECLARE_RWSEM(name) \
diff --git a/kernel/rwsem.c b/kernel/rwsem.c
index cfff143..d74d1c9 100644
--- a/kernel/rwsem.c
+++ b/kernel/rwsem.c
@@ -12,6 +12,16 @@
 
 #include <linux/atomic.h>
 
+static inline void rwsem_set_owner(struct rw_semaphore *sem)
+{
+	sem->owner = current;
+}
+
+static inline void rwsem_clear_owner(struct rw_semaphore *sem)
+{
+	sem->owner = NULL;
+}
+
 /*
  * lock for reading
  */
@@ -48,6 +58,7 @@ void __sched down_write(struct rw_semaphore *sem)
 	rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
 
 	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
+	rwsem_set_owner(sem);
 }
 
 EXPORT_SYMBOL(down_write);
@@ -59,8 +70,10 @@ int down_write_trylock(struct rw_semaphore *sem)
 {
 	int ret = __down_write_trylock(sem);
 
-	if (ret == 1)
+	if (ret == 1) {
 		rwsem_acquire(&sem->dep_map, 0, 1, _RET_IP_);
+		rwsem_set_owner(sem);
+	}
 	return ret;
 }
 
@@ -86,6 +99,7 @@ void up_write(struct rw_semaphore *sem)
 	rwsem_release(&sem->dep_map, 1, _RET_IP_);
 
 	__up_write(sem);
+	rwsem_clear_owner(sem);
 }
 
 EXPORT_SYMBOL(up_write);
@@ -100,6 +114,7 @@ void downgrade_write(struct rw_semaphore *sem)
 	 * dependency.
 	 */
 	__downgrade_write(sem);
+	rwsem_clear_owner(sem);
 }
 
 EXPORT_SYMBOL(downgrade_write);
@@ -122,6 +137,7 @@ void _down_write_nest_lock(struct rw_semaphore *sem, struct lockdep_map *nest)
 	rwsem_acquire_nest(&sem->dep_map, 0, 0, nest, _RET_IP_);
 
 	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
+	rwsem_set_owner(sem);
 }
 
 EXPORT_SYMBOL(_down_write_nest_lock);
@@ -141,6 +157,7 @@ void down_write_nested(struct rw_semaphore *sem, int subclass)
 	rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_);
 
 	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
+	rwsem_set_owner(sem);
 }
 
 EXPORT_SYMBOL(down_write_nested);
diff --git a/lib/rwsem.c b/lib/rwsem.c
index 1d6e6e8..cc3b33e 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -10,6 +10,8 @@
 #include <linux/sched.h>
 #include <linux/init.h>
 #include <linux/export.h>
+#include <linux/sched/rt.h>
+#include <linux/mcs_spinlock.h>
 
 /*
  * Initialize an rwsem:
@@ -27,6 +29,8 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
 	sem->count = RWSEM_UNLOCKED_VALUE;
 	raw_spin_lock_init(&sem->wait_lock);
 	INIT_LIST_HEAD(&sem->wait_list);
+	sem->owner = NULL;
+	sem->mcs_lock = NULL;
 }
 
 EXPORT_SYMBOL(__init_rwsem);
@@ -194,14 +198,177 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
 	return sem;
 }
 
+static inline int rwsem_try_write_lock(long count, struct rw_semaphore *sem)
+{
+	if (!(count & RWSEM_ACTIVE_MASK)) {
+		/* Try acquiring the write lock. */
+		if (sem->count == RWSEM_WAITING_BIAS &&
+		    cmpxchg(&sem->count, RWSEM_WAITING_BIAS,
+			    RWSEM_ACTIVE_WRITE_BIAS) == RWSEM_WAITING_BIAS) {
+			if (!list_is_singular(&sem->wait_list))
+				rwsem_atomic_update(RWSEM_WAITING_BIAS, sem);
+			return 1;
+		}
+	}
+	return 0;
+}
+
+/*
+ * Try to acquire write lock before the writer has been put on wait queue.
+ */
+static inline int rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
+{
+	long count;
+
+	count = ACCESS_ONCE(sem->count);
+retry:
+	if (count == RWSEM_WAITING_BIAS) {
+		count = cmpxchg(&sem->count, RWSEM_WAITING_BIAS,
+			    RWSEM_ACTIVE_WRITE_BIAS + RWSEM_WAITING_BIAS);
+		/* allow write lock stealing, try acquiring the write lock. */
+		if (count == RWSEM_WAITING_BIAS)
+			goto acquired;
+		else if (count == 0)
+			goto retry;
+	} else if (count == 0) {
+		count = cmpxchg(&sem->count, 0, RWSEM_ACTIVE_WRITE_BIAS);
+		if (count == 0)
+			goto acquired;
+		else if (count == RWSEM_WAITING_BIAS)
+			goto retry;
+	}
+	return 0;
+
+acquired:
+	return 1;
+}
+
+static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
+{
+	int retval;
+	struct task_struct *owner;
+
+	rcu_read_lock();
+	owner = ACCESS_ONCE(sem->owner);
+
+	/* Spin only if active writer running */
+	if (owner)
+		retval = owner->on_cpu;
+	else
+		retval = false;
+
+	rcu_read_unlock();
+	/*
+	 * if lock->owner is not set, the sem owner may have just acquired
+	 * it and not set the owner yet, or the sem has been released, or
+	 * reader active.
+	 */
+	return retval;
+}
+
+static inline bool owner_running(struct rw_semaphore *lock,
+				struct task_struct *owner)
+{
+	if (lock->owner != owner)
+		return false;
+
+	/*
+	 * Ensure we emit the owner->on_cpu, dereference _after_ checking
+	 * lock->owner still matches owner, if that fails, owner might
+	 * point to free()d memory, if it still matches, the rcu_read_lock()
+	 * ensures the memory stays valid.
+	 */
+	barrier();
+
+	return owner->on_cpu;
+}
+
+static noinline
+int rwsem_spin_on_owner(struct rw_semaphore *lock, struct task_struct *owner)
+{
+	rcu_read_lock();
+	while (owner_running(lock, owner)) {
+		if (need_resched())
+			break;
+
+		arch_mutex_cpu_relax();
+	}
+	rcu_read_unlock();
+
+	/*
+	 * We break out the loop above on need_resched() or when the
+	 * owner changed, which is a sign for heavy contention. Return
+	 * success only when lock->owner is NULL.
+	 */
+	return lock->owner == NULL;
+}
+
+int rwsem_optimistic_spin(struct rw_semaphore *sem)
+{
+	struct task_struct *owner;
+	int ret = 0;
+
+	/* sem->wait_lock should not be held when doing optimistic spinning */
+	if (!rwsem_can_spin_on_owner(sem))
+		return ret;
+
+	preempt_disable();
+	for (;;) {
+		struct mcs_spinlock node;
+
+		mcs_spin_lock(&sem->mcs_lock, &node);
+		owner = ACCESS_ONCE(sem->owner);
+		if (owner && !rwsem_spin_on_owner(sem, owner)) {
+			mcs_spin_unlock(&sem->mcs_lock, &node);
+			break;
+		}
+
+		/* wait_lock will be acquired if write_lock is obtained */
+		if (rwsem_try_write_lock_unqueued(sem)) {
+			mcs_spin_unlock(&sem->mcs_lock, &node);
+			ret = 1;
+			break;
+		}
+		mcs_spin_unlock(&sem->mcs_lock, &node);
+
+		/*
+		 * When there's no owner, we might have preempted between the
+		 * owner acquiring the lock and setting the owner field. If
+		 * we're an RT task that will live-lock because we won't let
+		 * the owner complete.
+		 */
+		if (!owner && (need_resched() || rt_task(current)))
+			break;
+
+		/*
+		 * The cpu_relax() call is a compiler barrier which forces
+		 * everything in this loop to be re-loaded. We don't need
+		 * memory barriers as we'll eventually observe the right
+		 * values at the cost of a few extra spins.
+		 */
+		arch_mutex_cpu_relax();
+	}
+	preempt_enable();
+
+	return ret;
+}
+
 /*
  * wait until we successfully acquire the write lock
  */
 struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
 {
-	long count, adjustment = -RWSEM_ACTIVE_WRITE_BIAS;
+	long count;
 	struct rwsem_waiter waiter;
 	struct task_struct *tsk = current;
+	bool waiting = true;
+
+	/* undo write bias from down_write operation, stop active locking */
+	count = rwsem_atomic_update(-RWSEM_ACTIVE_WRITE_BIAS, sem);
+
+	/* do optimistic spinning and steal lock if possible */
+	if (rwsem_optimistic_spin(sem))
+		goto done;
 
 	/* set up my own style of waitqueue */
 	waiter.task = tsk;
@@ -209,33 +376,28 @@ struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
 
 	raw_spin_lock_irq(&sem->wait_lock);
 	if (list_empty(&sem->wait_list))
-		adjustment += RWSEM_WAITING_BIAS;
+		waiting = false;
 	list_add_tail(&waiter.list, &sem->wait_list);
 
 	/* we're now waiting on the lock, but no longer actively locking */
-	count = rwsem_atomic_update(adjustment, sem);
+	if (waiting)
+		count = ACCESS_ONCE(sem->count);
+	else
+		count = rwsem_atomic_update(RWSEM_WAITING_BIAS, sem);
 
-	/* If there were already threads queued before us and there are no
+	/*
+	 * If there were already threads queued before us and there are no
 	 * active writers, the lock must be read owned; so we try to wake
-	 * any read locks that were queued ahead of us. */
-	if (count > RWSEM_WAITING_BIAS &&
-	    adjustment == -RWSEM_ACTIVE_WRITE_BIAS)
+	 * any read locks that were queued ahead of us.
+	 */
+	if ((count > RWSEM_WAITING_BIAS) && waiting)
 		sem = __rwsem_do_wake(sem, RWSEM_WAKE_READERS);
 
 	/* wait until we successfully acquire the lock */
 	set_task_state(tsk, TASK_UNINTERRUPTIBLE);
-	while (true) {
-		if (!(count & RWSEM_ACTIVE_MASK)) {
-			/* Try acquiring the write lock. */
-			count = RWSEM_ACTIVE_WRITE_BIAS;
-			if (!list_is_singular(&sem->wait_list))
-				count += RWSEM_WAITING_BIAS;
-
-			if (sem->count == RWSEM_WAITING_BIAS &&
-			    cmpxchg(&sem->count, RWSEM_WAITING_BIAS, count) ==
-							RWSEM_WAITING_BIAS)
-				break;
-		}
+	for (;;) {
+		if (rwsem_try_write_lock(count, sem))
+			break;
 
 		raw_spin_unlock_irq(&sem->wait_lock);
 
@@ -250,6 +412,7 @@ struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
 
 	list_del(&waiter.list);
 	raw_spin_unlock_irq(&sem->wait_lock);
+done:
 	tsk->state = TASK_RUNNING;
 
 	return sem;
-- 
1.7.4.4


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-09-26 22:21 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <cover.1380231690.git.tim.c.chen@linux.intel.com>
2013-09-26 22:20 ` [PATCH v7 0/6] rwsem: performance optimizations Tim Chen
2013-09-26 22:21 ` [PATCH v7 1/6] rwsem: check the lock before cpmxchg in down_write_trylock Tim Chen
2013-09-26 22:21 ` [PATCH v7 2/6] rwsem: remove 'out' label in do_wake Tim Chen
2013-09-26 22:21 ` [PATCH v7 3/6] rwsem: remove try_reader_grant label do_wake Tim Chen
2013-09-26 22:21 ` [PATCH v7 4/6] rwsem/wake: check lock before do atomic update Tim Chen
2013-09-26 22:21 ` [PATCH v7 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file Tim Chen
2013-09-26 22:21 ` [PATCH v7 6/6] rwsem: do optimistic spinning for writer lock acquisition Tim Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox