[PATCH] low-latency zap_page

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] low-latency zap_page_range()
@ 2002-08-29 15:31 Robert Love
  2002-08-29 20:30 ` Andrew Morton
  0 siblings, 1 reply; 10+ messages in thread
From: Robert Love @ 2002-08-29 15:31 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm

Andrew,

Attached patch implements a low latency version of "zap_page_range()".

Calls with even moderately large page ranges result in very long lock
held times and consequently very long periods of non-preemptibility. 
This function is in my list of the top 3 worst offenders.  It is gross.

This new version reimplements zap_page_range() as a loop over
ZAP_BLOCK_SIZE chunks.  After each iteration, if a reschedule is
pending, we drop page_table_lock and automagically preempt.  Note we can
not blindly drop the locks and reschedule (e.g. for the non-preempt
case) since there is a possibility to enter this codepath holding other
locks.

... I am sure you are familar with all this, its the same deal as your
low-latency work.  This patch implements the "cond_resched_lock()" as we
discussed sometime back.  I think this solution should be acceptable to
you and Linus.

There are other misc. cleanups, too.

This new zap_page_range() yields latency too-low-to-benchmark: <<1ms.

Please, Andrew, add this to your ever-growing list.

	Robert Love

diff -urN linux-2.5.32/include/linux/sched.h linux/include/linux/sched.h
--- linux-2.5.32/include/linux/sched.h	Tue Aug 27 15:26:34 2002
+++ linux/include/linux/sched.h	Wed Aug 28 18:04:41 2002
@@ -898,6 +898,34 @@
 		__cond_resched();
 }
 
+#ifdef CONFIG_PREEMPT
+
+/*
+ * cond_resched_lock() - if a reschedule is pending, drop the given lock,
+ * call schedule, and on return reacquire the lock.
+ *
+ * Note: this does not assume the given lock is the _only_ lock held.
+ * The kernel preemption counter gives us "free" checking that we are
+ * atomic -- let's use it.
+ */
+static inline void cond_resched_lock(spinlock_t * lock)
+{
+	if (need_resched() && preempt_count() == 1) {
+		_raw_spin_unlock(lock);
+		preempt_enable_no_resched();
+		__cond_resched();
+		spin_lock(lock);
+	}
+}
+
+#else
+
+static inline void cond_resched_lock(spinlock_t * lock)
+{
+}
+
+#endif
+
 /* Reevaluate whether the task has signals pending delivery.
    This is required every time the blocked sigset_t changes.
    Athread cathreaders should have t->sigmask_lock.  */
diff -urN linux-2.5.32/mm/memory.c linux/mm/memory.c
--- linux-2.5.32/mm/memory.c	Tue Aug 27 15:26:42 2002
+++ linux/mm/memory.c	Wed Aug 28 18:03:11 2002
@@ -389,8 +389,8 @@
 {
 	pgd_t * dir;
 
-	if (address >= end)
-		BUG();
+	BUG_ON(address >= end);
+
 	dir = pgd_offset(vma->vm_mm, address);
 	tlb_start_vma(tlb, vma);
 	do {
@@ -401,30 +401,43 @@
 	tlb_end_vma(tlb, vma);
 }
 
-/*
- * remove user pages in a given range.
+#define ZAP_BLOCK_SIZE	(256 * PAGE_SIZE) /* how big a chunk we loop over */
+ 
+/**
+ * zap_page_range - remove user pages in a given range
+ * @vma: vm_area_struct holding the applicable pages
+ * @address: starting address of pages to zap
+ * @size: number of bytes to zap
  */
 void zap_page_range(struct vm_area_struct *vma, unsigned long address, unsigned long size)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	mmu_gather_t *tlb;
-	unsigned long start = address, end = address + size;
+	unsigned long end, block;
 
-	/*
-	 * This is a long-lived spinlock. That's fine.
-	 * There's no contention, because the page table
-	 * lock only protects against kswapd anyway, and
-	 * even if kswapd happened to be looking at this
-	 * process we _want_ it to get stuck.
-	 */
-	if (address >= end)
-		BUG();
 	spin_lock(&mm->page_table_lock);
-	flush_cache_range(vma, address, end);
 
-	tlb = tlb_gather_mmu(mm, 0);
-	unmap_page_range(tlb, vma, address, end);
-	tlb_finish_mmu(tlb, start, end);
+  	/*
+ 	 * This was once a long-held spinlock.  Now we break the
+ 	 * work up into ZAP_BLOCK_SIZE units and relinquish the
+ 	 * lock after each interation.  This drastically lowers
+ 	 * lock contention and allows for a preemption point.
+  	 */
+	while (size) {
+		block = (size > ZAP_BLOCK_SIZE) ? ZAP_BLOCK_SIZE : size;
+ 		end = address + block;
+ 
+ 		flush_cache_range(vma, address, end);
+ 		tlb = tlb_gather_mmu(mm, 0);
+ 		unmap_page_range(tlb, vma, address, end);
+ 		tlb_finish_mmu(tlb, address, end);
+ 
+ 		cond_resched_lock(&mm->page_table_lock);
+ 
+ 		address += block;
+ 		size -= block;
+ 	}
+
 	spin_unlock(&mm->page_table_lock);
 }
 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] low-latency zap_page_range()
  2002-08-29 15:31 [PATCH] low-latency zap_page_range() Robert Love
@ 2002-08-29 20:30 ` Andrew Morton
  2002-08-29 20:40   ` Robert Love
  0 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2002-08-29 20:30 UTC (permalink / raw)
  To: Robert Love; +Cc: linux-kernel, linux-mm

Robert Love wrote:
> 
> Andrew,
> 
> Attached patch implements a low latency version of "zap_page_range()".
> 

This doesn't quite do the right thing on SMP.

Note that pages which are to be torn down are buffered in the
mmu_gather_t array.  The kernel throws away 507 pages at a
time - this is to reduce the frequency of global TLB invalidations.
(The 507 is, I assume, designed to make the mmu_gather_t be
2048 bytes in size.  I recently broke that math, and need to fix
it up).

However with your change, we'll only ever put 256 pages into the
mmu_gather_t.  Half of that thing's buffer is unused and the
invalidation rate will be doubled during teardown of large
address ranges.

I suggest that you make ZAP_BLOCK_SIZE be equal to FREE_PTE_NR on
SMP, and 256 on UP.

(We could get fancier and do something like:

	tlb = tlb_gather_mmu(mm, 0):
	while (size) {
		...
		unmap_page_range(ZAP_BLOCK_SIZE pages);
		tlb_flush_mmu(...);
		cond_resched_lock();
	}
	tlb_finish_mmu(..);
	spin_unlock(page_table_lock);

but I don't think that passes the benefit-versus-complexity test.)

Also, if the kernel is not compiled for preemption then we're
doing a little bit of extra work to no advantage, yes?  We can
avoid doing that by setting ZAP_BLOCK_SIZE to infinity.

How does this altered version look?  All I changed was the ZAP_BLOCK_SIZE
initialisation.


--- 2.5.32/include/linux/sched.h~llzpr	Thu Aug 29 13:01:01 2002
+++ 2.5.32-akpm/include/linux/sched.h	Thu Aug 29 13:01:01 2002
@@ -907,6 +907,34 @@ static inline void cond_resched(void)
 		__cond_resched();
 }
 
+#ifdef CONFIG_PREEMPT
+
+/*
+ * cond_resched_lock() - if a reschedule is pending, drop the given lock,
+ * call schedule, and on return reacquire the lock.
+ *
+ * Note: this does not assume the given lock is the _only_ lock held.
+ * The kernel preemption counter gives us "free" checking that we are
+ * atomic -- let's use it.
+ */
+static inline void cond_resched_lock(spinlock_t * lock)
+{
+	if (need_resched() && preempt_count() == 1) {
+		_raw_spin_unlock(lock);
+		preempt_enable_no_resched();
+		__cond_resched();
+		spin_lock(lock);
+	}
+}
+
+#else
+
+static inline void cond_resched_lock(spinlock_t * lock)
+{
+}
+
+#endif
+
 /* Reevaluate whether the task has signals pending delivery.
    This is required every time the blocked sigset_t changes.
    Athread cathreaders should have t->sigmask_lock.  */
--- 2.5.32/mm/memory.c~llzpr	Thu Aug 29 13:01:01 2002
+++ 2.5.32-akpm/mm/memory.c	Thu Aug 29 13:26:21 2002
@@ -389,8 +389,8 @@ void unmap_page_range(mmu_gather_t *tlb,
 {
 	pgd_t * dir;
 
-	if (address >= end)
-		BUG();
+	BUG_ON(address >= end);
+
 	dir = pgd_offset(vma->vm_mm, address);
 	tlb_start_vma(tlb, vma);
 	do {
@@ -401,30 +401,53 @@ void unmap_page_range(mmu_gather_t *tlb,
 	tlb_end_vma(tlb, vma);
 }
 
-/*
- * remove user pages in a given range.
+#if defined(CONFIG_SMP) && defined(CONFIG_PREEMPT)
+#define ZAP_BLOCK_SIZE	(FREE_PTE_NR * PAGE_SIZE)
+#endif
+
+#if !defined(CONFIG_SMP) && defined(CONFIG_PREEMPT)
+#define ZAP_BLOCK_SIZE	(256 * PAGE_SIZE)
+#endif
+
+#if !defined(CONFIG_PREEMPT)
+#define ZAP_BLOCK_SIZE	(~(0UL))
+#endif
+
+/**
+ * zap_page_range - remove user pages in a given range
+ * @vma: vm_area_struct holding the applicable pages
+ * @address: starting address of pages to zap
+ * @size: number of bytes to zap
  */
 void zap_page_range(struct vm_area_struct *vma, unsigned long address, unsigned long size)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	mmu_gather_t *tlb;
-	unsigned long start = address, end = address + size;
+	unsigned long end, block;
 
-	/*
-	 * This is a long-lived spinlock. That's fine.
-	 * There's no contention, because the page table
-	 * lock only protects against kswapd anyway, and
-	 * even if kswapd happened to be looking at this
-	 * process we _want_ it to get stuck.
-	 */
-	if (address >= end)
-		BUG();
 	spin_lock(&mm->page_table_lock);
-	flush_cache_range(vma, address, end);
 
-	tlb = tlb_gather_mmu(mm, 0);
-	unmap_page_range(tlb, vma, address, end);
-	tlb_finish_mmu(tlb, start, end);
+  	/*
+ 	 * This was once a long-held spinlock.  Now we break the
+ 	 * work up into ZAP_BLOCK_SIZE units and relinquish the
+ 	 * lock after each interation.  This drastically lowers
+ 	 * lock contention and allows for a preemption point.
+  	 */
+	while (size) {
+		block = (size > ZAP_BLOCK_SIZE) ? ZAP_BLOCK_SIZE : size;
+ 		end = address + block;
+ 
+ 		flush_cache_range(vma, address, end);
+ 		tlb = tlb_gather_mmu(mm, 0);
+ 		unmap_page_range(tlb, vma, address, end);
+ 		tlb_finish_mmu(tlb, address, end);
+ 
+ 		cond_resched_lock(&mm->page_table_lock);
+ 
+ 		address += block;
+ 		size -= block;
+ 	}
+
 	spin_unlock(&mm->page_table_lock);
 }
 

.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] low-latency zap_page_range()
  2002-08-29 20:30 ` Andrew Morton
@ 2002-08-29 20:40   ` Robert Love
  2002-08-29 20:46     ` Robert Love
                       ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Robert Love @ 2002-08-29 20:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

On Thu, 2002-08-29 at 16:30, Andrew Morton wrote:

> However with your change, we'll only ever put 256 pages into the
> mmu_gather_t.  Half of that thing's buffer is unused and the
> invalidation rate will be doubled during teardown of large
> address ranges.

Agreed.  Go for it.

Hm, unless, since 507 vs 256 is not the end of the world and latency is
already low, we want to just make it always (FREE_PTE_NR*PAGE_SIZE)...

As long as the "cond_resched_lock()" is a preempt only thing, I also
agree with making ZAP_BLOCK_SIZE ~0 on !CONFIG_PREEMPT - unless we
wanted to unconditionally drop the locks and let preempt just do the
right thing and also reduce SMP lock contention in the SMP case.

	Robert Love

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] low-latency zap_page_range()
  2002-08-29 20:40   ` Robert Love
@ 2002-08-29 20:46     ` Robert Love
  2002-08-29 20:59     ` Andrew Morton
  2002-08-29 21:00     ` Andrew Morton
  2 siblings, 0 replies; 10+ messages in thread
From: Robert Love @ 2002-08-29 20:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

On Thu, 2002-08-29 at 16:40, Robert Love wrote:

> On Thu, 2002-08-29 at 16:30, Andrew Morton wrote:
> 
> > However with your change, we'll only ever put 256 pages into the
> > mmu_gather_t.  Half of that thing's buffer is unused and the
> > invalidation rate will be doubled during teardown of large
> > address ranges.
> 
> Agreed.  Go for it.

Oh and put a comment in there explaining what you just said to me :)

	Robert Love


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] low-latency zap_page_range()
  2002-08-29 20:40   ` Robert Love
  2002-08-29 20:46     ` Robert Love
@ 2002-08-29 20:59     ` Andrew Morton
  2002-08-29 21:38       ` William Lee Irwin III
  2002-08-29 21:00     ` Andrew Morton
  2 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2002-08-29 20:59 UTC (permalink / raw)
  To: Robert Love; +Cc: linux-kernel, linux-mm

Robert Love wrote:
> 
> ...
> unless we
> wanted to unconditionally drop the locks and let preempt just do the
> right thing and also reduce SMP lock contention in the SMP case.

That's an interesting point.  page_table_lock is one of those locks
which is occasionally held for ages, and frequently held for a short
time.

I suspect that yes, voluntarily popping the lock during the long holdtimes
will allow other CPUs to get on with stuff, and will provide efficiency
increases.  (It's a pretty lame way of doing that though).

But I don't recall seeing nasty page_table_lock spintimes on
anyone's lockmeter reports, so...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] low-latency zap_page_range()
  2002-08-29 20:59     ` Andrew Morton
@ 2002-08-29 21:38       ` William Lee Irwin III
  0 siblings, 0 replies; 10+ messages in thread
From: William Lee Irwin III @ 2002-08-29 21:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Robert Love, linux-kernel, linux-mm

Robert Love wrote:
>> unless we
>> wanted to unconditionally drop the locks and let preempt just do the
>> right thing and also reduce SMP lock contention in the SMP case.

On Thu, Aug 29, 2002 at 01:59:17PM -0700, Andrew Morton wrote:
> That's an interesting point.  page_table_lock is one of those locks
> which is occasionally held for ages, and frequently held for a short
> time.
> I suspect that yes, voluntarily popping the lock during the long holdtimes
> will allow other CPUs to get on with stuff, and will provide efficiency
> increases.  (It's a pretty lame way of doing that though).
> But I don't recall seeing nasty page_table_lock spintimes on
> anyone's lockmeter reports, so...

You will. There are just bigger fish to fry at the moment.


Cheers,
Bill
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] low-latency zap_page_range()
  2002-08-29 20:40   ` Robert Love
  2002-08-29 20:46     ` Robert Love
  2002-08-29 20:59     ` Andrew Morton
@ 2002-08-29 21:00     ` Andrew Morton
  2002-08-29 21:12       ` Robert Love
  2 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2002-08-29 21:00 UTC (permalink / raw)
  To: Robert Love; +Cc: linux-kernel, linux-mm

Robert Love wrote:
> 
> ...
> unless we
> wanted to unconditionally drop the locks and let preempt just do the
> right thing and also reduce SMP lock contention in the SMP case.

That's an interesting point.  page_table_lock is one of those locks
which is occasionally held for ages, and frequently held for a short
time.

I suspect that yes, voluntarily popping the lock during the long holdtimes
will allow other CPUs to get on with stuff, and will provide efficiency
increases.  (It's a pretty lame way of doing that though).

But I don't recall seeing nasty page_table_lock spintimes on
anyone's lockmeter reports, so we can leave it as-is for now.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] low-latency zap_page_range()
  2002-08-29 21:00     ` Andrew Morton
@ 2002-08-29 21:12       ` Robert Love
  2002-08-29 21:22         ` Andrew Morton
  0 siblings, 1 reply; 10+ messages in thread
From: Robert Love @ 2002-08-29 21:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

On Thu, 2002-08-29 at 17:00, Andrew Morton wrote:

> That's an interesting point.  page_table_lock is one of those locks
> which is occasionally held for ages, and frequently held for a short
> time.

Since latency is a direct function of lock held times in the preemptible
kernel, and I am seeing disgusting zap_page_range() latencies, the lock
is held a long time.

So we know it is held forever and a day... but is there contention?

> But I don't recall seeing nasty page_table_lock spintimes on
> anyone's lockmeter reports, so we can leave it as-is for now.

I do not recall seeing this either and I have not done my own tests.

Personally, I would love to rip out the "cond_resched_lock()" and just
do

	spin_unlock();
	spin_lock();

and be done with it.  This gives automatic preemption support and the
SMP benefit.  Preemption being an "automatic" consequence of improved
locking was always my selling point (albeit, this is a gross example of
improving the locking, but it gets the job done).

But, the current implementation was more palatable to you and Linus when
I first posted this, and that counts for something.

	Robert Love

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] low-latency zap_page_range()
  2002-08-29 21:12       ` Robert Love
@ 2002-08-29 21:22         ` Andrew Morton
  2002-08-29 21:46           ` Rik van Riel
  0 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2002-08-29 21:22 UTC (permalink / raw)
  To: Robert Love; +Cc: linux-kernel, linux-mm

Robert Love wrote:
> 
> On Thu, 2002-08-29 at 17:00, Andrew Morton wrote:
> 
> > That's an interesting point.  page_table_lock is one of those locks
> > which is occasionally held for ages, and frequently held for a short
> > time.
> 
> Since latency is a direct function of lock held times in the preemptible
> kernel, and I am seeing disgusting zap_page_range() latencies, the lock
> is held a long time.
> 
> So we know it is held forever and a day... but is there contention?

I'm sure there is, but nobody has measured the right workload.

Two CLONE_MM threads, one running mmap()/munmap(), the other trying
to fault in some pages.  I'm sure someone has some vital application
which does exactly this.  They always do :(
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] low-latency zap_page_range()
  2002-08-29 21:22         ` Andrew Morton
@ 2002-08-29 21:46           ` Rik van Riel
  0 siblings, 0 replies; 10+ messages in thread
From: Rik van Riel @ 2002-08-29 21:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Robert Love, linux-kernel, linux-mm

On Thu, 29 Aug 2002, Andrew Morton wrote:

> > So we know it is held forever and a day... but is there contention?
>
> I'm sure there is, but nobody has measured the right workload.
>
> Two CLONE_MM threads, one running mmap()/munmap(), the other trying
> to fault in some pages.  I'm sure someone has some vital application
> which does exactly this.  They always do :(

Can't fix this one.  The mmap()/munmap() needs to have the
mmap_sem for writing as long as its setting up or tearing
down a VMA while the pagefault path takes the mmap_sem for
reading.

It might be fixable in some dirty way, but I doubt that'll
ever be worth it.

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2002-08-29 21:46 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-08-29 15:31 [PATCH] low-latency zap_page_range() Robert Love
2002-08-29 20:30 ` Andrew Morton
2002-08-29 20:40   ` Robert Love
2002-08-29 20:46     ` Robert Love
2002-08-29 20:59     ` Andrew Morton
2002-08-29 21:38       ` William Lee Irwin III
2002-08-29 21:00     ` Andrew Morton
2002-08-29 21:12       ` Robert Love
2002-08-29 21:22         ` Andrew Morton
2002-08-29 21:46           ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox