linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] fast_gup for shared futexes
@ 2008-04-04 19:33 Peter Zijlstra
  2008-04-04 19:33 ` [RFC PATCH 1/2] futex: rely on get_user_pages() " Peter Zijlstra
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Peter Zijlstra @ 2008-04-04 19:33 UTC (permalink / raw)
  To: Nick Piggin, Eric Dumazet, Ingo Molnar, Thomas Gleixner
  Cc: Peter Zijlstra, linux-kernel, linux-mm

Hi,

this patch series removes mmap_sem from the fast path of shared futexes by
making use of Nick's recent fast_gup() patches. Full series at:

  http://programming.kicks-ass.net/kernel-patches/futex-fast_gup/v2.6.24.4-rt4/

--

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH 1/2] futex: rely on get_user_pages() for shared futexes
  2008-04-04 19:33 [RFC PATCH 0/2] fast_gup for shared futexes Peter Zijlstra
@ 2008-04-04 19:33 ` Peter Zijlstra
  2008-04-08 11:40   ` Nick Piggin
  2008-04-04 19:33 ` [RFC PATCH 2/2] futex: use fast_gup() Peter Zijlstra
  2008-04-04 19:56 ` [RFC PATCH 0/2] fast_gup for shared futexes Thomas Gleixner
  2 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2008-04-04 19:33 UTC (permalink / raw)
  To: Nick Piggin, Eric Dumazet, Ingo Molnar, Thomas Gleixner
  Cc: Peter Zijlstra, linux-kernel, linux-mm

[-- Attachment #1: futex-gup.patch --]
[-- Type: text/plain, Size: 8982 bytes --]

On the way of getting rid of the mmap_sem requirement for shared futexes,
start by relying on get_user_pages().

This requires we get the page associated with the key, and put the page when
we're done with it.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/futex.h |   12 ++--
 kernel/futex.c        |  122 ++++++++++++++++++++------------------------------
 2 files changed, 55 insertions(+), 79 deletions(-)

Index: linux-2.6/include/linux/futex.h
===================================================================
--- linux-2.6.orig/include/linux/futex.h
+++ linux-2.6/include/linux/futex.h
@@ -124,18 +124,14 @@ handle_futex_death(u32 __user *uaddr, st
  *  00 : Private process futex (PTHREAD_PROCESS_PRIVATE)
  *       (no reference on an inode or mm)
  *  01 : Shared futex (PTHREAD_PROCESS_SHARED)
- *	mapped on a file (reference on the underlying inode)
- *  10 : Shared futex (PTHREAD_PROCESS_SHARED)
- *       (but private mapping on an mm, and reference taken on it)
-*/
+ */
 
-#define FUT_OFF_INODE    1 /* We set bit 0 if key has a reference on inode */
-#define FUT_OFF_MMSHARED 2 /* We set bit 1 if key has a reference on mm */
+#define FUT_OFF_PAGE     1
 
 union futex_key {
 	struct {
 		unsigned long pgoff;
-		struct inode *inode;
+		struct page *page;
 		int offset;
 	} shared;
 	struct {
@@ -150,6 +146,8 @@ union futex_key {
 	} both;
 };
 
+#define FUTEX_KEY_INIT (union futex_key) { .both = { .ptr = NULL } }
+
 #ifdef CONFIG_FUTEX
 extern void exit_robust_list(struct task_struct *curr);
 extern void exit_pi_state_list(struct task_struct *curr);
Index: linux-2.6/kernel/futex.c
===================================================================
--- linux-2.6.orig/kernel/futex.c
+++ linux-2.6/kernel/futex.c
@@ -190,7 +190,6 @@ static int get_futex_key(u32 __user *uad
 {
 	unsigned long address = (unsigned long)uaddr;
 	struct mm_struct *mm = current->mm;
-	struct vm_area_struct *vma;
 	struct page *page;
 	int err;
 
@@ -202,6 +201,9 @@ static int get_futex_key(u32 __user *uad
 		return -EINVAL;
 	address -= key->both.offset;
 
+	if (unlikely(!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))))
+		return -EFAULT;
+
 	/*
 	 * PROCESS_PRIVATE futexes are fast.
 	 * As the mm cannot disappear under us and the 'key' only needs
@@ -210,67 +212,37 @@ static int get_futex_key(u32 __user *uad
 	 *        but access_ok() should be faster than find_vma()
 	 */
 	if (!fshared) {
-		if (unlikely(!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))))
-			return -EFAULT;
 		key->private.mm = mm;
 		key->private.address = address;
 		return 0;
 	}
-	/*
-	 * The futex is hashed differently depending on whether
-	 * it's in a shared or private mapping.  So check vma first.
-	 */
-	vma = find_extend_vma(mm, address);
-	if (unlikely(!vma))
-		return -EFAULT;
 
-	/*
-	 * Permissions.
-	 */
-	if (unlikely((vma->vm_flags & (VM_IO|VM_READ)) != VM_READ))
-		return (vma->vm_flags & VM_IO) ? -EPERM : -EACCES;
+	err = get_user_pages(current, mm, address, 1, 0, 0, &page, NULL);
+	if (err < 0)
+		return err;
+
+	key->shared.page = page;
+	key->both.offset |= FUT_OFF_PAGE;
 
 	/*
-	 * Private mappings are handled in a simple way.
-	 *
-	 * NOTE: When userspace waits on a MAP_SHARED mapping, even if
-	 * it's a read-only handle, it's expected that futexes attach to
-	 * the object not the particular process.  Therefore we use
-	 * VM_MAYSHARE here, not VM_SHARED which is restricted to shared
-	 * mappings of _writable_ handles.
+	 * doesn't really matter anyway, as we'll end up finding the
+	 * same page again
 	 */
-	if (likely(!(vma->vm_flags & VM_MAYSHARE))) {
-		key->both.offset |= FUT_OFF_MMSHARED; /* reference taken on mm */
-		key->private.mm = mm;
+	if (PageAnon(page))
 		key->private.address = address;
-		return 0;
-	}
+	else
+		key->shared.pgoff = page->index;
 
-	/*
-	 * Linear file mappings are also simple.
-	 */
-	key->shared.inode = vma->vm_file->f_path.dentry->d_inode;
-	key->both.offset |= FUT_OFF_INODE; /* inode-based key. */
-	if (likely(!(vma->vm_flags & VM_NONLINEAR))) {
-		key->shared.pgoff = (((address - vma->vm_start) >> PAGE_SHIFT)
-				     + vma->vm_pgoff);
-		return 0;
-	}
+	return 0;
+}
 
-	/*
-	 * We could walk the page table to read the non-linear
-	 * pte, and get the page index without fetching the page
-	 * from swap.  But that's a lot of code to duplicate here
-	 * for a rare case, so we simply fetch the page.
-	 */
-	err = get_user_pages(current, mm, address, 1, 0, 0, &page, NULL);
-	if (err >= 0) {
-		key->shared.pgoff =
-			page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
-		put_page(page);
-		return 0;
-	}
-	return err;
+static void put_futex_key(struct rw_semaphore *fshared, union futex_key *key)
+{
+	if (!key->both.ptr)
+		return;
+
+	if (key->both.offset & FUT_OFF_PAGE)
+		put_page(key->shared.page);
 }
 
 /*
@@ -280,16 +252,13 @@ static int get_futex_key(u32 __user *uad
  */
 static void get_futex_key_refs(union futex_key *key)
 {
-	if (key->both.ptr == 0)
+	if (!key->both.ptr)
 		return;
-	switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
-		case FUT_OFF_INODE:
-			atomic_inc(&key->shared.inode->i_count);
-			break;
-		case FUT_OFF_MMSHARED:
-			atomic_inc(&key->private.mm->mm_count);
-			break;
-	}
+
+	if (key->both.offset & FUT_OFF_PAGE)
+		get_page(key->shared.page);
+	else
+		atomic_inc(&key->private.mm->mm_count);
 }
 
 /*
@@ -300,14 +269,11 @@ static void drop_futex_key_refs(union fu
 {
 	if (!key->both.ptr)
 		return;
-	switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
-		case FUT_OFF_INODE:
-			iput(key->shared.inode);
-			break;
-		case FUT_OFF_MMSHARED:
-			mmdrop(key->private.mm);
-			break;
-	}
+
+	if (key->both.offset & FUT_OFF_PAGE)
+		put_page(key->shared.page);
+	else
+		mmdrop(key->private.mm);
 }
 
 static u32 cmpxchg_futex_value_locked(u32 __user *uaddr, u32 uval, u32 newval)
@@ -733,7 +699,7 @@ static int futex_wake(u32 __user *uaddr,
 	struct futex_hash_bucket *hb;
 	struct futex_q *this, *next;
 	struct plist_head *head;
-	union futex_key key;
+	union futex_key key = FUTEX_KEY_INIT;
 	int ret;
 
 	futex_lock_mm(fshared);
@@ -760,6 +726,7 @@ static int futex_wake(u32 __user *uaddr,
 
 	spin_unlock(&hb->lock);
 out:
+	put_futex_key(fshared, &key);
 	futex_unlock_mm(fshared);
 	return ret;
 }
@@ -773,7 +740,7 @@ futex_wake_op(u32 __user *uaddr1, struct
 	      u32 __user *uaddr2,
 	      int nr_wake, int nr_wake2, int op)
 {
-	union futex_key key1, key2;
+	union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
 	struct futex_hash_bucket *hb1, *hb2;
 	struct plist_head *head;
 	struct futex_q *this, *next;
@@ -873,6 +840,8 @@ retry:
 	if (hb1 != hb2)
 		spin_unlock(&hb2->lock);
 out:
+	put_futex_key(fshared, &key2);
+	put_futex_key(fshared, &key1);
 	futex_unlock_mm(fshared);
 
 	return ret;
@@ -886,7 +855,7 @@ static int futex_requeue(u32 __user *uad
 			 u32 __user *uaddr2,
 			 int nr_wake, int nr_requeue, u32 *cmpval)
 {
-	union futex_key key1, key2;
+	union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
 	struct futex_hash_bucket *hb1, *hb2;
 	struct plist_head *head1;
 	struct futex_q *this, *next;
@@ -978,6 +947,8 @@ out_unlock:
 		drop_futex_key_refs(&key1);
 
 out:
+	put_futex_key(fshared, &key2);
+	put_futex_key(fshared, &key1);
 	futex_unlock_mm(fshared);
 	return ret;
 }
@@ -1185,6 +1156,7 @@ static int futex_wait(u32 __user *uaddr,
  retry:
 	futex_lock_mm(fshared);
 
+	q.key = FUTEX_KEY_INIT;
 	ret = get_futex_key(uaddr, fshared, &q.key);
 	if (unlikely(ret != 0))
 		goto out_release_sem;
@@ -1324,6 +1296,7 @@ static int futex_wait(u32 __user *uaddr,
 	queue_unlock(&q, hb);
 
  out_release_sem:
+	put_futex_key(fshared, &q.key);
 	futex_unlock_mm(fshared);
 	return ret;
 }
@@ -1373,6 +1346,7 @@ static int futex_lock_pi(u32 __user *uad
  retry:
 	futex_lock_mm(fshared);
 
+	q.key = FUTEX_KEY_INIT;
 	ret = get_futex_key(uaddr, fshared, &q.key);
 	if (unlikely(ret != 0))
 		goto out_release_sem;
@@ -1587,6 +1561,7 @@ static int futex_lock_pi(u32 __user *uad
 	queue_unlock(&q, hb);
 
  out_release_sem:
+	put_futex_key(fshared, &q.key);
 	futex_unlock_mm(fshared);
 	return ret;
 
@@ -1629,7 +1604,7 @@ static int futex_unlock_pi(u32 __user *u
 	struct futex_q *this, *next;
 	u32 uval;
 	struct plist_head *head;
-	union futex_key key;
+	union futex_key key = FUTEX_KEY_INIT;
 	int ret, attempt = 0;
 
 retry:
@@ -1702,6 +1677,7 @@ retry_unlocked:
 out_unlock:
 	spin_unlock(&hb->lock);
 out:
+	put_futex_key(fshared, &key);
 	futex_unlock_mm(fshared);
 
 	return ret;
@@ -1822,6 +1798,7 @@ static int futex_fd(u32 __user *uaddr, i
 
 	fshared = &current->mm->mmap_sem;
 	down_read(fshared);
+	q->key = FUTEX_KEY_INIT;
 	err = get_futex_key(uaddr, fshared, &q->key);
 
 	if (unlikely(err != 0)) {
@@ -1837,6 +1814,7 @@ static int futex_fd(u32 __user *uaddr, i
 	filp->private_data = q;
 
 	queue_me(q, ret, filp);
+	put_futex_key(fshared, &q->key);
 	up_read(fshared);
 
 	/* Now we map fd to filp, so userspace can access it */

--

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH 2/2] futex: use fast_gup()
  2008-04-04 19:33 [RFC PATCH 0/2] fast_gup for shared futexes Peter Zijlstra
  2008-04-04 19:33 ` [RFC PATCH 1/2] futex: rely on get_user_pages() " Peter Zijlstra
@ 2008-04-04 19:33 ` Peter Zijlstra
  2008-04-04 19:47   ` Peter Zijlstra
  2008-04-04 19:56 ` [RFC PATCH 0/2] fast_gup for shared futexes Thomas Gleixner
  2 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2008-04-04 19:33 UTC (permalink / raw)
  To: Nick Piggin, Eric Dumazet, Ingo Molnar, Thomas Gleixner
  Cc: Peter Zijlstra, linux-kernel, linux-mm

[-- Attachment #1: futex-fast_gup.patch --]
[-- Type: text/plain, Size: 7184 bytes --]

now that we rely on get_user_pages()/put_page() for the shared key handling
swhitch to fast_gup() and remove all the mmap_sem stuff.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/futex.c |   86 +--------------------------------------------------------
 1 file changed, 3 insertions(+), 83 deletions(-)

Index: linux-2.6/kernel/futex.c
===================================================================
--- linux-2.6.orig/kernel/futex.c
+++ linux-2.6/kernel/futex.c
@@ -129,24 +129,6 @@ static struct vfsmount *futex_mnt;
 int futex_performance_hack;
 
 /*
- * Take mm->mmap_sem, when futex is shared
- */
-static inline void futex_lock_mm(struct rw_semaphore *fshared)
-{
-	if (fshared && !futex_performance_hack)
-		down_read(fshared);
-}
-
-/*
- * Release mm->mmap_sem, when the futex is shared
- */
-static inline void futex_unlock_mm(struct rw_semaphore *fshared)
-{
-	if (fshared && !futex_performance_hack)
-		up_read(fshared);
-}
-
-/*
  * We hash on the keys returned from get_futex_key (see below).
  */
 static struct futex_hash_bucket *hash_futex(union futex_key *key)
@@ -217,7 +199,7 @@ static int get_futex_key(u32 __user *uad
 		return 0;
 	}
 
-	err = get_user_pages(current, mm, address, 1, 0, 0, &page, NULL);
+	err = fast_gup(address, 1, 0, &page);
 	if (err < 0)
 		return err;
 
@@ -312,8 +294,7 @@ static int futex_handle_fault(unsigned l
 	if (attempt > 2)
 		return ret;
 
-	if (!fshared)
-		down_read(&mm->mmap_sem);
+	down_read(&mm->mmap_sem);
 	vma = find_vma(mm, address);
 	if (vma && address >= vma->vm_start &&
 	    (vma->vm_flags & VM_WRITE)) {
@@ -333,8 +314,7 @@ static int futex_handle_fault(unsigned l
 				current->min_flt++;
 		}
 	}
-	if (!fshared)
-		up_read(&mm->mmap_sem);
+	up_read(&mm->mmap_sem);
 	return ret;
 }
 
@@ -702,8 +682,6 @@ static int futex_wake(u32 __user *uaddr,
 	union futex_key key = FUTEX_KEY_INIT;
 	int ret;
 
-	futex_lock_mm(fshared);
-
 	ret = get_futex_key(uaddr, fshared, &key);
 	if (unlikely(ret != 0))
 		goto out;
@@ -727,7 +705,6 @@ static int futex_wake(u32 __user *uaddr,
 	spin_unlock(&hb->lock);
 out:
 	put_futex_key(fshared, &key);
-	futex_unlock_mm(fshared);
 	return ret;
 }
 
@@ -747,8 +724,6 @@ futex_wake_op(u32 __user *uaddr1, struct
 	int ret, op_ret, attempt = 0;
 
 retryfull:
-	futex_lock_mm(fshared);
-
 	ret = get_futex_key(uaddr1, fshared, &key1);
 	if (unlikely(ret != 0))
 		goto out;
@@ -799,12 +774,6 @@ retry:
 			goto retry;
 		}
 
-		/*
-		 * If we would have faulted, release mmap_sem,
-		 * fault it in and start all over again.
-		 */
-		futex_unlock_mm(fshared);
-
 		ret = get_user(dummy, uaddr2);
 		if (ret)
 			return ret;
@@ -842,7 +811,6 @@ retry:
 out:
 	put_futex_key(fshared, &key2);
 	put_futex_key(fshared, &key1);
-	futex_unlock_mm(fshared);
 
 	return ret;
 }
@@ -862,8 +830,6 @@ static int futex_requeue(u32 __user *uad
 	int ret, drop_count = 0;
 
  retry:
-	futex_lock_mm(fshared);
-
 	ret = get_futex_key(uaddr1, fshared, &key1);
 	if (unlikely(ret != 0))
 		goto out;
@@ -886,12 +852,6 @@ static int futex_requeue(u32 __user *uad
 			if (hb1 != hb2)
 				spin_unlock(&hb2->lock);
 
-			/*
-			 * If we would have faulted, release mmap_sem, fault
-			 * it in and start all over again.
-			 */
-			futex_unlock_mm(fshared);
-
 			ret = get_user(curval, uaddr1);
 
 			if (!ret)
@@ -949,7 +909,6 @@ out_unlock:
 out:
 	put_futex_key(fshared, &key2);
 	put_futex_key(fshared, &key1);
-	futex_unlock_mm(fshared);
 	return ret;
 }
 
@@ -1154,8 +1113,6 @@ static int futex_wait(u32 __user *uaddr,
 
 	q.pi_state = NULL;
  retry:
-	futex_lock_mm(fshared);
-
 	q.key = FUTEX_KEY_INIT;
 	ret = get_futex_key(uaddr, fshared, &q.key);
 	if (unlikely(ret != 0))
@@ -1188,12 +1145,6 @@ static int futex_wait(u32 __user *uaddr,
 	if (unlikely(ret)) {
 		queue_unlock(&q, hb);
 
-		/*
-		 * If we would have faulted, release mmap_sem, fault it in and
-		 * start all over again.
-		 */
-		futex_unlock_mm(fshared);
-
 		ret = get_user(uval, uaddr);
 
 		if (!ret)
@@ -1208,12 +1159,6 @@ static int futex_wait(u32 __user *uaddr,
 	__queue_me(&q, hb);
 
 	/*
-	 * Now the futex is queued and we have checked the data, we
-	 * don't want to hold mmap_sem while we sleep.
-	 */
-	futex_unlock_mm(fshared);
-
-	/*
 	 * There might have been scheduling since the queue_me(), as we
 	 * cannot hold a spinlock across the get_user() in case it
 	 * faults, and we cannot just set TASK_INTERRUPTIBLE state when
@@ -1297,7 +1242,6 @@ static int futex_wait(u32 __user *uaddr,
 
  out_release_sem:
 	put_futex_key(fshared, &q.key);
-	futex_unlock_mm(fshared);
 	return ret;
 }
 
@@ -1344,8 +1288,6 @@ static int futex_lock_pi(u32 __user *uad
 
 	q.pi_state = NULL;
  retry:
-	futex_lock_mm(fshared);
-
 	q.key = FUTEX_KEY_INIT;
 	ret = get_futex_key(uaddr, fshared, &q.key);
 	if (unlikely(ret != 0))
@@ -1435,7 +1377,6 @@ static int futex_lock_pi(u32 __user *uad
 			 * exit to complete.
 			 */
 			queue_unlock(&q, hb);
-			futex_unlock_mm(fshared);
 			cond_resched();
 			goto retry;
 
@@ -1467,12 +1408,6 @@ static int futex_lock_pi(u32 __user *uad
 	 */
 	__queue_me(&q, hb);
 
-	/*
-	 * Now the futex is queued and we have checked the data, we
-	 * don't want to hold mmap_sem while we sleep.
-	 */
-	futex_unlock_mm(fshared);
-
 	WARN_ON(!q.pi_state);
 	/*
 	 * Block on the PI mutex:
@@ -1485,7 +1420,6 @@ static int futex_lock_pi(u32 __user *uad
 		ret = ret ? 0 : -EWOULDBLOCK;
 	}
 
-	futex_lock_mm(fshared);
 	spin_lock(q.lock_ptr);
 
 	if (!ret) {
@@ -1553,7 +1487,6 @@ static int futex_lock_pi(u32 __user *uad
 
 	/* Unqueue and drop the lock */
 	unqueue_me_pi(&q);
-	futex_unlock_mm(fshared);
 
 	return ret != -EINTR ? ret : -ERESTARTNOINTR;
 
@@ -1562,7 +1495,6 @@ static int futex_lock_pi(u32 __user *uad
 
  out_release_sem:
 	put_futex_key(fshared, &q.key);
-	futex_unlock_mm(fshared);
 	return ret;
 
  uaddr_faulted:
@@ -1584,8 +1516,6 @@ static int futex_lock_pi(u32 __user *uad
 		goto retry_unlocked;
 	}
 
-	futex_unlock_mm(fshared);
-
 	ret = get_user(uval, uaddr);
 	if (!ret && (uval != -EFAULT))
 		goto retry;
@@ -1615,10 +1545,6 @@ retry:
 	 */
 	if ((uval & FUTEX_TID_MASK) != task_pid_vnr(current))
 		return -EPERM;
-	/*
-	 * First take all the futex related locks:
-	 */
-	futex_lock_mm(fshared);
 
 	ret = get_futex_key(uaddr, fshared, &key);
 	if (unlikely(ret != 0))
@@ -1678,7 +1604,6 @@ out_unlock:
 	spin_unlock(&hb->lock);
 out:
 	put_futex_key(fshared, &key);
-	futex_unlock_mm(fshared);
 
 	return ret;
 
@@ -1702,8 +1627,6 @@ pi_faulted:
 		goto retry_unlocked;
 	}
 
-	futex_unlock_mm(fshared);
-
 	ret = get_user(uval, uaddr);
 	if (!ret && (uval != -EFAULT))
 		goto retry;
@@ -1797,12 +1720,10 @@ static int futex_fd(u32 __user *uaddr, i
 	q->pi_state = NULL;
 
 	fshared = &current->mm->mmap_sem;
-	down_read(fshared);
 	q->key = FUTEX_KEY_INIT;
 	err = get_futex_key(uaddr, fshared, &q->key);
 
 	if (unlikely(err != 0)) {
-		up_read(fshared);
 		kfree(q);
 		goto error;
 	}
@@ -1815,7 +1736,6 @@ static int futex_fd(u32 __user *uaddr, i
 
 	queue_me(q, ret, filp);
 	put_futex_key(fshared, &q->key);
-	up_read(fshared);
 
 	/* Now we map fd to filp, so userspace can access it */
 	fd_install(ret, filp);

--

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 2/2] futex: use fast_gup()
  2008-04-04 19:33 ` [RFC PATCH 2/2] futex: use fast_gup() Peter Zijlstra
@ 2008-04-04 19:47   ` Peter Zijlstra
  0 siblings, 0 replies; 9+ messages in thread
From: Peter Zijlstra @ 2008-04-04 19:47 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Eric Dumazet, Ingo Molnar, Thomas Gleixner, linux-kernel, linux-mm

On Fri, 2008-04-04 at 21:33 +0200, Peter Zijlstra wrote:

> @@ -217,7 +199,7 @@ static int get_futex_key(u32 __user *uad
>  		return 0;
>  	}
>  
> -	err = get_user_pages(current, mm, address, 1, 0, 0, &page, NULL);
> +	err = fast_gup(address, 1, 0, &page);
>  	if (err < 0)
>  		return err;
>  


Failed to include the following hunk...

Index: linux-2.6/kernel/futex.c
===================================================================
--- linux-2.6.orig/kernel/futex.c
+++ linux-2.6/kernel/futex.c
@@ -203,6 +203,9 @@ static int get_futex_key(u32 __user *uad
 	if (err < 0)
 		return err;
 
+	if (!page)
+		return -EFAULT;
+
 	key->shared.page = page;
 	key->both.offset |= FUT_OFF_PAGE;
 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 0/2] fast_gup for shared futexes
  2008-04-04 19:33 [RFC PATCH 0/2] fast_gup for shared futexes Peter Zijlstra
  2008-04-04 19:33 ` [RFC PATCH 1/2] futex: rely on get_user_pages() " Peter Zijlstra
  2008-04-04 19:33 ` [RFC PATCH 2/2] futex: use fast_gup() Peter Zijlstra
@ 2008-04-04 19:56 ` Thomas Gleixner
  2 siblings, 0 replies; 9+ messages in thread
From: Thomas Gleixner @ 2008-04-04 19:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nick Piggin, Eric Dumazet, Ingo Molnar, linux-kernel, linux-mm

On Fri, 4 Apr 2008, Peter Zijlstra wrote:
> Hi,
> 
> this patch series removes mmap_sem from the fast path of shared futexes by
> making use of Nick's recent fast_gup() patches. Full series at:
> 
>   http://programming.kicks-ass.net/kernel-patches/futex-fast_gup/v2.6.24.4-rt4/

Looks good at the first glance. Need to look at the corner cases, but
the code does not really depend on mmap_sem anymore so the chance that
it blows up is pretty low.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 1/2] futex: rely on get_user_pages() for shared futexes
  2008-04-04 19:33 ` [RFC PATCH 1/2] futex: rely on get_user_pages() " Peter Zijlstra
@ 2008-04-08 11:40   ` Nick Piggin
  2008-04-08 16:59     ` Peter Zijlstra
  2008-04-09 13:51     ` Peter Zijlstra
  0 siblings, 2 replies; 9+ messages in thread
From: Nick Piggin @ 2008-04-08 11:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Eric Dumazet, Ingo Molnar, Thomas Gleixner, linux-kernel, linux-mm

On Saturday 05 April 2008 06:33, Peter Zijlstra wrote:
> On the way of getting rid of the mmap_sem requirement for shared futexes,
> start by relying on get_user_pages().
>
> This requires we get the page associated with the key, and put the page
> when we're done with it.

Hi Peter,

Cool.

I'm all for removing mmap_sem requirement from shared futexes...
Are there many apps which make a non-trivial use of them I wonder?
I guess it will help legacy (pre-FUTEX_PRIVATE) usespaces in
performance too, though.

What I'm worried about with this is invalidate or truncate races.
With direct IO, it obviously doesn't matter because the only
requirement is that the page existed at the address at some point
during the syscall... 

So I'd really like you to not carry the page around in the key, but
just continue using the same key we have now. Also, lock the page
and ensure it hasn't been truncated before taking the inode from the
key and incrementing its count (page lock's extra atomics should be
more or less cancelled out by fewer mmap_sem atomic ops).

get_futex_key should look something like this I would have thought:?

BTW. I like that it removes a lot of fshared crap from around
the place. And also this is a really good user of fast_gup
because I guess it should usually be faulted in. The problem is
that this could be a little more expensive for architectures that
don't implement fast_gup. Though most should be able to.

@@ -191,7 +191,6 @@ static int get_futex_key(u32 __user *uad
 {
        unsigned long address = (unsigned long)uaddr;
        struct mm_struct *mm = current->mm;
-       struct vm_area_struct *vma;
        struct page *page;
        int err;

@@ -210,27 +209,26 @@ static int get_futex_key(u32 __user *uad
         * Note : We do have to check 'uaddr' is a valid user address,
         *        but access_ok() should be faster than find_vma()
         */
-       if (!fshared) {
+       if (likely(!fshared)) {
                if (unlikely(!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))))
                        return -EFAULT;
                key->private.mm = mm;
                key->private.address = address;
                return 0;
        }
-       /*
-        * The futex is hashed differently depending on whether
-        * it's in a shared or private mapping.  So check vma first.
-        */
-       vma = find_extend_vma(mm, address);
-       if (unlikely(!vma))
-               return -EFAULT;
-
-       /*
-        * Permissions.
-        */
-       if (unlikely((vma->vm_flags & (VM_IO|VM_READ)) != VM_READ))
-               return (vma->vm_flags & VM_IO) ? -EPERM : -EACCES;

+again:
+       err = fast_gup(address, 1, 0, &page);
+       if (err < 0)
+               return err;
+
+       lock_page(page);
+       if (!page->mapping) { /* PageAnon pages shouldn't get caught here */
+               unlock_page(page);
+               put_page(page);
+               goto again;
+       }
+
        /*
         * Private mappings are handled in a simple way.
         *
@@ -240,38 +238,19 @@ static int get_futex_key(u32 __user *uad
         * VM_MAYSHARE here, not VM_SHARED which is restricted to shared
         * mappings of _writable_ handles.
         */
-       if (likely(!(vma->vm_flags & VM_MAYSHARE))) {
-               key->both.offset |= FUT_OFF_MMSHARED; /* reference taken on mm 
*
/
+       if (PageAnon(page)) {
+               key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */
                key->private.mm = mm;
                key->private.address = address;
-               return 0;
-       }
-
-       /*
-        * Linear file mappings are also simple.
-        */
-       key->shared.inode = vma->vm_file->f_path.dentry->d_inode;
-       key->both.offset |= FUT_OFF_INODE; /* inode-based key. */
-       if (likely(!(vma->vm_flags & VM_NONLINEAR))) {
-               key->shared.pgoff = (((address - vma->vm_start) >> PAGE_SHIFT)
-                                    + vma->vm_pgoff);
-               return 0;
-       }
-
-       /*
-        * We could walk the page table to read the non-linear
-        * pte, and get the page index without fetching the page
-        * from swap.  But that's a lot of code to duplicate here
-        * for a rare case, so we simply fetch the page.
-        */
-       err = get_user_pages(current, mm, address, 1, 0, 0, &page, NULL);
-       if (err >= 0) {
-               key->shared.pgoff =
-                       page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
-               put_page(page);
-               return 0;
+       } else {
+               key->both.offset |= FUT_OFF_INODE; /* inode-based key. */
+               key->shared.inode = page->mapping->inode;
+               key->shared.pgoff = page->index;
        }
-       return err;
+out:
+       unlock_page(page);
+       put_page(page);
+       return 0;
 }

 /*

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 1/2] futex: rely on get_user_pages() for shared futexes
  2008-04-08 11:40   ` Nick Piggin
@ 2008-04-08 16:59     ` Peter Zijlstra
  2008-04-09  2:32       ` Nick Piggin
  2008-04-09 13:51     ` Peter Zijlstra
  1 sibling, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2008-04-08 16:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Eric Dumazet, Ingo Molnar, Thomas Gleixner, linux-kernel, linux-mm

On Tue, 2008-04-08 at 21:40 +1000, Nick Piggin wrote:
> On Saturday 05 April 2008 06:33, Peter Zijlstra wrote:
> > On the way of getting rid of the mmap_sem requirement for shared futexes,
> > start by relying on get_user_pages().
> >
> > This requires we get the page associated with the key, and put the page
> > when we're done with it.
> 
> Hi Peter,
> 
> Cool.
> 
> I'm all for removing mmap_sem requirement from shared futexes...
> Are there many apps which make a non-trivial use of them I wonder?

No idea. I've heard some stories, but I've never seen any code..

> I guess it will help legacy (pre-FUTEX_PRIVATE) usespaces in
> performance too, though.

Yeah, that was one of the motivations for this patch.

> What I'm worried about with this is invalidate or truncate races.
> With direct IO, it obviously doesn't matter because the only
> requirement is that the page existed at the address at some point
> during the syscall... 
> 
> So I'd really like you to not carry the page around in the key, but
> just continue using the same key we have now. Also, lock the page
> and ensure it hasn't been truncated before taking the inode from the
> key and incrementing its count (page lock's extra atomics should be
> more or less cancelled out by fewer mmap_sem atomic ops).
> 
> get_futex_key should look something like this I would have thought:?

Does look nicer, will have to ponder it a bit though.

I must admit to not fully understanding why we take inode/mm references
in the key anyway as neither will stop unmapping the futex.

Will make this into a nice patch.. thanks!

> BTW. I like that it removes a lot of fshared crap from around
> the place. And also this is a really good user of fast_gup
> because I guess it should usually be faulted in. The problem is
> that this could be a little more expensive for architectures that
> don't implement fast_gup. Though most should be able to.

Yeah, if that really becomes a problem (but I doubt it will) we could
possibly make the old and new scheme work based on ARCH_HAVE_PTE_SPECIAL
or something ugly like that.

> @@ -191,7 +191,6 @@ static int get_futex_key(u32 __user *uad
>  {
>         unsigned long address = (unsigned long)uaddr;
>         struct mm_struct *mm = current->mm;
> -       struct vm_area_struct *vma;
>         struct page *page;
>         int err;
> 
> @@ -210,27 +209,26 @@ static int get_futex_key(u32 __user *uad
>          * Note : We do have to check 'uaddr' is a valid user address,
>          *        but access_ok() should be faster than find_vma()
>          */
> -       if (!fshared) {
> +       if (likely(!fshared)) {
>                 if (unlikely(!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))))
>                         return -EFAULT;
>                 key->private.mm = mm;
>                 key->private.address = address;
>                 return 0;
>         }
> -       /*
> -        * The futex is hashed differently depending on whether
> -        * it's in a shared or private mapping.  So check vma first.
> -        */
> -       vma = find_extend_vma(mm, address);
> -       if (unlikely(!vma))
> -               return -EFAULT;
> -
> -       /*
> -        * Permissions.
> -        */
> -       if (unlikely((vma->vm_flags & (VM_IO|VM_READ)) != VM_READ))
> -               return (vma->vm_flags & VM_IO) ? -EPERM : -EACCES;
> 
> +again:
> +       err = fast_gup(address, 1, 0, &page);
> +       if (err < 0)
> +               return err;
> +
> +       lock_page(page);
> +       if (!page->mapping) { /* PageAnon pages shouldn't get caught here */
> +               unlock_page(page);
> +               put_page(page);
> +               goto again;
> +       }
> +
>         /*
>          * Private mappings are handled in a simple way.
>          *
> @@ -240,38 +238,19 @@ static int get_futex_key(u32 __user *uad
>          * VM_MAYSHARE here, not VM_SHARED which is restricted to shared
>          * mappings of _writable_ handles.
>          */
> -       if (likely(!(vma->vm_flags & VM_MAYSHARE))) {
> -               key->both.offset |= FUT_OFF_MMSHARED; /* reference taken on mm 
> *
> /
> +       if (PageAnon(page)) {
> +               key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */
>                 key->private.mm = mm;
>                 key->private.address = address;
> -               return 0;
> -       }
> -
> -       /*
> -        * Linear file mappings are also simple.
> -        */
> -       key->shared.inode = vma->vm_file->f_path.dentry->d_inode;
> -       key->both.offset |= FUT_OFF_INODE; /* inode-based key. */
> -       if (likely(!(vma->vm_flags & VM_NONLINEAR))) {
> -               key->shared.pgoff = (((address - vma->vm_start) >> PAGE_SHIFT)
> -                                    + vma->vm_pgoff);
> -               return 0;
> -       }
> -
> -       /*
> -        * We could walk the page table to read the non-linear
> -        * pte, and get the page index without fetching the page
> -        * from swap.  But that's a lot of code to duplicate here
> -        * for a rare case, so we simply fetch the page.
> -        */
> -       err = get_user_pages(current, mm, address, 1, 0, 0, &page, NULL);
> -       if (err >= 0) {
> -               key->shared.pgoff =
> -                       page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> -               put_page(page);
> -               return 0;
> +       } else {
> +               key->both.offset |= FUT_OFF_INODE; /* inode-based key. */
> +               key->shared.inode = page->mapping->inode;
> +               key->shared.pgoff = page->index;
>         }
> -       return err;
> +out:
> +       unlock_page(page);
> +       put_page(page);
> +       return 0;
>  }
> 
>  /*
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 1/2] futex: rely on get_user_pages() for shared futexes
  2008-04-08 16:59     ` Peter Zijlstra
@ 2008-04-09  2:32       ` Nick Piggin
  0 siblings, 0 replies; 9+ messages in thread
From: Nick Piggin @ 2008-04-09  2:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Eric Dumazet, Ingo Molnar, Thomas Gleixner, linux-kernel, linux-mm

On Wednesday 09 April 2008 02:59, Peter Zijlstra wrote:
> On Tue, 2008-04-08 at 21:40 +1000, Nick Piggin wrote:
> > On Saturday 05 April 2008 06:33, Peter Zijlstra wrote:
> > > On the way of getting rid of the mmap_sem requirement for shared
> > > futexes, start by relying on get_user_pages().
> > >
> > > This requires we get the page associated with the key, and put the page
> > > when we're done with it.
> >
> > Hi Peter,
> >
> > Cool.
> >
> > I'm all for removing mmap_sem requirement from shared futexes...
> > Are there many apps which make a non-trivial use of them I wonder?
>
> No idea. I've heard some stories, but I've never seen any code..
>
> > I guess it will help legacy (pre-FUTEX_PRIVATE) usespaces in
> > performance too, though.
>
> Yeah, that was one of the motivations for this patch.
>
> > What I'm worried about with this is invalidate or truncate races.
> > With direct IO, it obviously doesn't matter because the only
> > requirement is that the page existed at the address at some point
> > during the syscall...
> >
> > So I'd really like you to not carry the page around in the key, but
> > just continue using the same key we have now. Also, lock the page
> > and ensure it hasn't been truncated before taking the inode from the
> > key and incrementing its count (page lock's extra atomics should be
> > more or less cancelled out by fewer mmap_sem atomic ops).
> >
> > get_futex_key should look something like this I would have thought:?
>
> Does look nicer, will have to ponder it a bit though.
>
> I must admit to not fully understanding why we take inode/mm references
> in the key anyway as neither will stop unmapping the futex.

I guess that's the problem; if we unmap the futex then we might
free the inode without a ref.


> Will make this into a nice patch.. thanks!
>
> > BTW. I like that it removes a lot of fshared crap from around
> > the place. And also this is a really good user of fast_gup
> > because I guess it should usually be faulted in. The problem is
> > that this could be a little more expensive for architectures that
> > don't implement fast_gup. Though most should be able to.
>
> Yeah, if that really becomes a problem (but I doubt it will) we could
> possibly make the old and new scheme work based on ARCH_HAVE_PTE_SPECIAL
> or something ugly like that.

Yeah, hopefully we won't have to worry. Technically this is a slowish
path anyway, so while heavy global contention and even sleeping on
mmap_sem may be a problem, I doubt the extra atomic or so would be
out of the noise.

Anyway, again thanks for doing this patch. What would be really nice
in order to get some testing concurrently while I'm trying to get
fast_gup in, is to make patch 2 which is exactly the same but it
still uses get_user_pages (ie. it just moves the mmap_sem call right
around the get_user_pages site). Then patch 3 can just remove those
3 lines and replace them with a call to fast_gup. Does that make sense?

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 1/2] futex: rely on get_user_pages() for shared futexes
  2008-04-08 11:40   ` Nick Piggin
  2008-04-08 16:59     ` Peter Zijlstra
@ 2008-04-09 13:51     ` Peter Zijlstra
  1 sibling, 0 replies; 9+ messages in thread
From: Peter Zijlstra @ 2008-04-09 13:51 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Eric Dumazet, Ingo Molnar, Thomas Gleixner, linux-kernel, linux-mm

On Tue, 2008-04-08 at 21:40 +1000, Nick Piggin wrote:

> @@ -191,7 +191,6 @@ static int get_futex_key(u32 __user *uad
>  {
>         unsigned long address = (unsigned long)uaddr;
>         struct mm_struct *mm = current->mm;
> -       struct vm_area_struct *vma;
>         struct page *page;
>         int err;
> 
> @@ -210,27 +209,26 @@ static int get_futex_key(u32 __user *uad
>          * Note : We do have to check 'uaddr' is a valid user address,
>          *        but access_ok() should be faster than find_vma()
>          */
> -       if (!fshared) {
> +       if (likely(!fshared)) {
>                 if (unlikely(!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))))
>                         return -EFAULT;
>                 key->private.mm = mm;
>                 key->private.address = address;
>                 return 0;
>         }
> -       /*
> -        * The futex is hashed differently depending on whether
> -        * it's in a shared or private mapping.  So check vma first.
> -        */
> -       vma = find_extend_vma(mm, address);
> -       if (unlikely(!vma))
> -               return -EFAULT;
> -
> -       /*
> -        * Permissions.
> -        */
> -       if (unlikely((vma->vm_flags & (VM_IO|VM_READ)) != VM_READ))
> -               return (vma->vm_flags & VM_IO) ? -EPERM : -EACCES;
> 
> +again:
> +       err = fast_gup(address, 1, 0, &page);
> +       if (err < 0)
> +               return err;
> +
> +       lock_page(page);
> +       if (!page->mapping) { /* PageAnon pages shouldn't get caught here */
> +               unlock_page(page);
> +               put_page(page);
> +               goto again;
> +       }
> +
>         /*
>          * Private mappings are handled in a simple way.
>          *
> @@ -240,38 +238,19 @@ static int get_futex_key(u32 __user *uad
>          * VM_MAYSHARE here, not VM_SHARED which is restricted to shared
>          * mappings of _writable_ handles.
>          */
> -       if (likely(!(vma->vm_flags & VM_MAYSHARE))) {
> -               key->both.offset |= FUT_OFF_MMSHARED; /* reference taken on mm 
> *
> /
> +       if (PageAnon(page)) {
> +               key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */
>                 key->private.mm = mm;
>                 key->private.address = address;
> -               return 0;
> -       }
> -
> -       /*
> -        * Linear file mappings are also simple.
> -        */
> -       key->shared.inode = vma->vm_file->f_path.dentry->d_inode;
> -       key->both.offset |= FUT_OFF_INODE; /* inode-based key. */
> -       if (likely(!(vma->vm_flags & VM_NONLINEAR))) {
> -               key->shared.pgoff = (((address - vma->vm_start) >> PAGE_SHIFT)
> -                                    + vma->vm_pgoff);
> -               return 0;
> -       }
> -
> -       /*
> -        * We could walk the page table to read the non-linear
> -        * pte, and get the page index without fetching the page
> -        * from swap.  But that's a lot of code to duplicate here
> -        * for a rare case, so we simply fetch the page.
> -        */
> -       err = get_user_pages(current, mm, address, 1, 0, 0, &page, NULL);
> -       if (err >= 0) {
> -               key->shared.pgoff =
> -                       page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> -               put_page(page);
> -               return 0;
> +       } else {
> +               key->both.offset |= FUT_OFF_INODE; /* inode-based key. */
> +               key->shared.inode = page->mapping->inode;
> +               key->shared.pgoff = page->index;
>         }
> -       return err;
> +out:
> +       unlock_page(page);
> +       put_page(page);
> +       return 0;
>  }

Right, so staring at this for a while makes me feel uneasy. The thing
is, once we exit get_futex_key() there is nothing stopping the mm of
inode from going away.

So I'm going to have to take a ref while we have the page locked, and do
the put_futex_key() thing to release it.

Code at (uncompiled and untested):
http://programming.kicks-ass.net/kernel-patches/futex-fast_gup/v2.6.24.4-rt4-2/

The thing is, it now needs to take a reference on a rather large object
(mm/inode), giving a rather large opportunity to bounce that cacheline.

What problems do you see with just keeping a ref on the page?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2008-04-09 13:51 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-04-04 19:33 [RFC PATCH 0/2] fast_gup for shared futexes Peter Zijlstra
2008-04-04 19:33 ` [RFC PATCH 1/2] futex: rely on get_user_pages() " Peter Zijlstra
2008-04-08 11:40   ` Nick Piggin
2008-04-08 16:59     ` Peter Zijlstra
2008-04-09  2:32       ` Nick Piggin
2008-04-09 13:51     ` Peter Zijlstra
2008-04-04 19:33 ` [RFC PATCH 2/2] futex: use fast_gup() Peter Zijlstra
2008-04-04 19:47   ` Peter Zijlstra
2008-04-04 19:56 ` [RFC PATCH 0/2] fast_gup for shared futexes Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox