linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] lazy freeing of memory through MADV_FREE
@ 2007-04-17  7:15 Rik van Riel
  2007-04-19 21:15 ` [PATCH] lazy freeing of memory through MADV_FREE 2/2 Rik van Riel
                   ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Rik van Riel @ 2007-04-17  7:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1024 bytes --]

Make it possible for applications to have the kernel free memory
lazily.  This reduces a repeated free/malloc cycle from freeing
pages and allocating them, to just marking them freeable.  If the
application wants to reuse them before the kernel needs the memory,
not even a page fault will happen.

This patch, together with Ulrich's glibc change, increases
MySQL sysbench performance by a factor of 2 on my quad core
test system.

Signed-off-by: Rik van Riel <riel@redhat.com>

---
Ulrich Drepper has test glibc RPMS for this functionality at:

     http://people.redhat.com/drepper/rpms

Andrew, I have stress tested this patch for a few days now and
have not been able to find any more bugs.  I believe it is ready
to be merged in -mm, and upstream at the next merge window.

When the patch goes upstream, I will submit a small follow-up
patch to revert MADV_DONTNEED behaviour to what it did previously
and have the new behaviour trigger only on MADV_FREE: at that
point people will have to get new test RPMs of glibc.


[-- Attachment #2: linux-2.6.21-rc6-mm1-madv_free.patch --]
[-- Type: text/x-patch, Size: 11514 bytes --]

--- linux-2.6.21-rc6-mm1/include/asm-parisc/mman.h.madv_free	2007-04-17 02:17:19.000000000 -0400
+++ linux-2.6.21-rc6-mm1/include/asm-parisc/mman.h	2007-04-17 02:22:46.000000000 -0400
@@ -38,6 +38,7 @@
 #define MADV_SPACEAVAIL 5               /* insure that resources are reserved */
 #define MADV_VPS_PURGE  6               /* Purge pages from VM page cache */
 #define MADV_VPS_INHERIT 7              /* Inherit parents page size */
+#define MADV_FREE	8		/* don't need the pages or the data */
 
 /* common/generic parameters */
 #define MADV_REMOVE	9		/* remove these pages & resources */
--- linux-2.6.21-rc6-mm1/include/asm-mips/mman.h.madv_free	2007-04-17 02:17:19.000000000 -0400
+++ linux-2.6.21-rc6-mm1/include/asm-mips/mman.h	2007-04-17 02:22:46.000000000 -0400
@@ -65,6 +65,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* don't need the pages or the data */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
--- linux-2.6.21-rc6-mm1/include/asm-xtensa/mman.h.madv_free	2007-04-17 02:17:19.000000000 -0400
+++ linux-2.6.21-rc6-mm1/include/asm-xtensa/mman.h	2007-04-17 02:22:46.000000000 -0400
@@ -72,6 +72,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* don't need the pages or the data */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
--- linux-2.6.21-rc6-mm1/include/linux/swap.h.madv_free	2007-04-17 02:17:43.000000000 -0400
+++ linux-2.6.21-rc6-mm1/include/linux/swap.h	2007-04-17 02:22:46.000000000 -0400
@@ -182,6 +182,7 @@ extern void FASTCALL(lru_cache_add(struc
 extern void FASTCALL(lru_cache_add_active(struct page *));
 extern void FASTCALL(lru_cache_add_tail(struct page *));
 extern void FASTCALL(activate_page(struct page *));
+extern void FASTCALL(deactivate_tail_page(struct page *));
 extern void FASTCALL(mark_page_accessed(struct page *));
 extern void lru_add_drain(void);
 extern int lru_add_drain_all(void);
--- linux-2.6.21-rc6-mm1/include/linux/mm.h.madv_free	2007-04-17 02:17:43.000000000 -0400
+++ linux-2.6.21-rc6-mm1/include/linux/mm.h	2007-04-17 02:22:46.000000000 -0400
@@ -767,6 +767,7 @@ struct zap_details {
 	pgoff_t last_index;			/* Highest page->index to unmap */
 	spinlock_t *i_mmap_lock;		/* For unmap_mapping_range: */
 	unsigned long truncate_count;		/* Compare vm_truncate_count */
+	short madv_free;			/* MADV_FREE anonymous memory */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *, unsigned long, pte_t);
--- linux-2.6.21-rc6-mm1/include/linux/page-flags.h.madv_free	2007-04-17 02:17:43.000000000 -0400
+++ linux-2.6.21-rc6-mm1/include/linux/page-flags.h	2007-04-17 02:23:16.000000000 -0400
@@ -91,6 +91,7 @@
 #define PG_booked		20	/* Has blocks reserved on-disk */
 
 #define PG_readahead		21	/* Reminder to do read-ahead */
+#define PG_lazyfree		22	/* MADV_FREE potential throwaway */
 
 /* PG_owner_priv_1 users should have descriptive aliases */
 #define PG_checked		PG_owner_priv_1 /* Used by some filesystems */
@@ -216,6 +217,11 @@ static inline void SetPageUptodate(struc
 #define ClearPageReclaim(page)	clear_bit(PG_reclaim, &(page)->flags)
 #define TestClearPageReclaim(page) test_and_clear_bit(PG_reclaim, &(page)->flags)
 
+#define PageLazyFree(page)	test_bit(PG_lazyfree, &(page)->flags)
+#define SetPageLazyFree(page)	set_bit(PG_lazyfree, &(page)->flags)
+#define ClearPageLazyFree(page)	clear_bit(PG_lazyfree, &(page)->flags)
+#define __ClearPageLazyFree(page) __clear_bit(PG_lazyfree, &(page)->flags)
+
 #define PageCompound(page)	test_bit(PG_compound, &(page)->flags)
 #define __SetPageCompound(page)	__set_bit(PG_compound, &(page)->flags)
 #define __ClearPageCompound(page) __clear_bit(PG_compound, &(page)->flags)
--- linux-2.6.21-rc6-mm1/include/asm-alpha/mman.h.madv_free	2007-04-17 02:17:19.000000000 -0400
+++ linux-2.6.21-rc6-mm1/include/asm-alpha/mman.h	2007-04-17 02:22:46.000000000 -0400
@@ -42,6 +42,7 @@
 #define MADV_WILLNEED	3		/* will need these pages */
 #define	MADV_SPACEAVAIL	5		/* ensure resources are available */
 #define MADV_DONTNEED	6		/* don't need these pages */
+#define MADV_FREE	7		/* don't need the pages or the data */
 
 /* common/generic parameters */
 #define MADV_REMOVE	9		/* remove these pages & resources */
--- linux-2.6.21-rc6-mm1/include/asm-generic/mman.h.madv_free	2007-04-17 02:17:19.000000000 -0400
+++ linux-2.6.21-rc6-mm1/include/asm-generic/mman.h	2007-04-17 02:22:46.000000000 -0400
@@ -29,6 +29,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* don't need the pages or the data */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
--- linux-2.6.21-rc6-mm1/mm/memory.c.madv_free	2007-04-17 02:17:43.000000000 -0400
+++ linux-2.6.21-rc6-mm1/mm/memory.c	2007-04-17 02:22:46.000000000 -0400
@@ -432,6 +432,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
 	unsigned long vm_flags = vma->vm_flags;
 	pte_t pte = *src_pte;
 	struct page *page;
+	int dirty = 0;
 
 	/* pte contains position in swap or file, so copy. */
 	if (unlikely(!pte_present(pte))) {
@@ -466,6 +467,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
 	 * in the parent and the child
 	 */
 	if (is_cow_mapping(vm_flags)) {
+		dirty = pte_dirty(pte);
 		ptep_set_wrprotect(src_mm, addr, src_pte);
 		pte = pte_wrprotect(pte);
 	}
@@ -483,6 +485,8 @@ copy_one_pte(struct mm_struct *dst_mm, s
 		get_page(page);
 		page_dup_rmap(page, vma, addr);
 		rss[!!PageAnon(page)]++;
+		if (dirty && PageLazyFree(page))
+			ClearPageLazyFree(page);
 	}
 
 out_set_pte:
@@ -661,6 +665,25 @@ static unsigned long zap_pte_range(struc
 				    (page->index < details->first_index ||
 				     page->index > details->last_index))
 					continue;
+
+				/*
+				 * MADV_FREE is used to lazily recycle
+				 * anon memory.  The process no longer
+				 * needs the data and wants to avoid IO.
+				 */
+				if (details->madv_free && PageAnon(page)) {
+					if (unlikely(PageSwapCache(page)) &&
+					    !TestSetPageLocked(page)) {
+						remove_exclusive_swap_page(page);
+						unlock_page(page);
+					}
+					ptep_clear_flush_dirty(vma, addr, pte);
+					ptep_clear_flush_young(vma, addr, pte);
+					SetPageLazyFree(page);
+					if (PageActive(page))
+						deactivate_tail_page(page);
+					continue;
+				}
 			}
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
 							tlb->fullmm);
@@ -689,7 +713,8 @@ static unsigned long zap_pte_range(struc
 		 * If details->check_mapping, we leave swap entries;
 		 * if details->nonlinear_vma, we leave file entries.
 		 */
-		if (unlikely(details))
+		if (unlikely(details && (details->check_mapping ||
+				details->nonlinear_vma)))
 			continue;
 		if (!pte_file(ptent))
 			free_swap_and_cache(pte_to_swp_entry(ptent));
@@ -755,7 +780,8 @@ static unsigned long unmap_page_range(st
 	pgd_t *pgd;
 	unsigned long next;
 
-	if (details && !details->check_mapping && !details->nonlinear_vma)
+	if (details && !details->check_mapping && !details->nonlinear_vma
+			&& !details->madv_free)
 		details = NULL;
 
 	BUG_ON(addr >= end);
--- linux-2.6.21-rc6-mm1/mm/page_alloc.c.madv_free	2007-04-17 02:17:43.000000000 -0400
+++ linux-2.6.21-rc6-mm1/mm/page_alloc.c	2007-04-17 02:22:46.000000000 -0400
@@ -266,6 +266,7 @@ static void bad_page(struct page *page)
 			1 << PG_slab    |
 			1 << PG_swapcache |
 			1 << PG_writeback |
+			1 << PG_lazyfree |
 			1 << PG_buddy );
 	set_page_count(page, 0);
 	reset_page_mapcount(page);
@@ -514,6 +515,8 @@ static inline int free_pages_check(struc
 		bad_page(page);
 	if (PageDirty(page))
 		__ClearPageDirty(page);
+	if (PageLazyFree(page))
+		__ClearPageLazyFree(page);
 	/*
 	 * For now, we report if PG_reserved was found set, but do not
 	 * clear it, and do not free the page.  But we shall soon need
@@ -661,6 +664,7 @@ static int prep_new_page(struct page *pa
 			1 << PG_swapcache |
 			1 << PG_writeback |
 			1 << PG_reserved |
+			1 << PG_lazyfree |
 			1 << PG_buddy ))))
 		bad_page(page);
 
--- linux-2.6.21-rc6-mm1/mm/swap.c.madv_free	2007-04-17 02:17:43.000000000 -0400
+++ linux-2.6.21-rc6-mm1/mm/swap.c	2007-04-17 02:22:46.000000000 -0400
@@ -152,6 +152,20 @@ void fastcall activate_page(struct page 
 	spin_unlock_irq(&zone->lru_lock);
 }
 
+void fastcall deactivate_tail_page(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+
+	spin_lock_irq(&zone->lru_lock);
+	if (PageLRU(page) && PageActive(page)) {
+		del_page_from_active_list(zone, page);
+		ClearPageActive(page);
+		add_page_to_inactive_list_tail(zone, page);
+		__count_vm_event(PGDEACTIVATE);
+	}
+	spin_unlock_irq(&zone->lru_lock);
+}
+
 /*
  * Mark a page as having seen activity.
  *
--- linux-2.6.21-rc6-mm1/mm/vmscan.c.madv_free	2007-04-17 02:17:43.000000000 -0400
+++ linux-2.6.21-rc6-mm1/mm/vmscan.c	2007-04-17 02:22:46.000000000 -0400
@@ -460,6 +460,24 @@ static unsigned long shrink_page_list(st
 
 		sc->nr_scanned++;
 
+		/* 
+		 * MADV_DONTNEED pages get reclaimed lazily, unless the
+		 * process reuses it before we get to it.
+		 */
+		if (PageLazyFree(page)) {
+			switch (try_to_unmap(page, 0)) {
+			case SWAP_FAIL:
+				ClearPageLazyFree(page);
+				goto activate_locked;
+			case SWAP_AGAIN:
+				ClearPageLazyFree(page);
+				goto keep_locked;
+			case SWAP_SUCCESS:
+				ClearPageLazyFree(page);
+				goto free_it;
+			}
+		}
+
 		if (!sc->may_swap && page_mapped(page))
 			goto keep_locked;
 
--- linux-2.6.21-rc6-mm1/mm/madvise.c.madv_free	2007-04-17 02:17:20.000000000 -0400
+++ linux-2.6.21-rc6-mm1/mm/madvise.c	2007-04-17 02:22:46.000000000 -0400
@@ -142,8 +142,12 @@ static long madvise_dontneed(struct vm_a
 			.last_index = ULONG_MAX,
 		};
 		zap_page_range(vma, start, end - start, &details);
-	} else
-		zap_page_range(vma, start, end - start, NULL);
+	} else {
+		struct zap_details details = {
+			.madv_free = 1,
+		};
+		zap_page_range(vma, start, end - start, &details);
+	}
 	return 0;
 }
 
@@ -215,7 +219,9 @@ madvise_vma(struct vm_area_struct *vma, 
 		error = madvise_willneed(vma, prev, start, end);
 		break;
 
+	/* FIXME: POSIX says that MADV_DONTNEED cannot throw away data. */
 	case MADV_DONTNEED:
+	case MADV_FREE:
 		error = madvise_dontneed(vma, prev, start, end);
 		break;
 
--- linux-2.6.21-rc6-mm1/mm/rmap.c.madv_free	2007-04-17 02:17:43.000000000 -0400
+++ linux-2.6.21-rc6-mm1/mm/rmap.c	2007-04-17 02:22:46.000000000 -0400
@@ -707,7 +707,17 @@ static int try_to_unmap_one(struct page 
 	/* Update high watermark before we lower rss */
 	update_hiwater_rss(mm);
 
-	if (PageAnon(page)) {
+	/* MADV_FREE is used to lazily free memory from userspace. */
+	if (PageLazyFree(page) && !migration) {
+		/* There is new data in the page.  Reinstate it. */
+		if (unlikely(pte_dirty(pteval))) {
+			set_pte_at(mm, address, pte, pteval);
+			ret = SWAP_FAIL;
+			goto out_unmap;
+		}
+		/* Throw the page away. */
+		dec_mm_counter(mm, anon_rss);
+	} else if (PageAnon(page)) {
 		swp_entry_t entry = { .val = page_private(page) };
 
 		if (PageSwapCache(page)) {

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE 2/2
  2007-04-17  7:15 [PATCH] lazy freeing of memory through MADV_FREE Rik van Riel
@ 2007-04-19 21:15 ` Rik van Riel
  2007-04-20 21:03   ` Andrew Morton
  2007-04-20 20:57 ` [PATCH] lazy freeing of memory through MADV_FREE Andrew Morton
  2007-04-22  8:18 ` Andrew Morton
  2 siblings, 1 reply; 43+ messages in thread
From: Rik van Riel @ 2007-04-19 21:15 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Andrew Morton, linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 459 bytes --]

Restore MADV_DONTNEED to its original Linux behaviour.  This is still
not the same behaviour as POSIX, but applications may be depending on
the Linux behaviour already. Besides, glibc catches POSIX_MADV_DONTNEED
and makes sure nothing is done...

Signed-off-by: Rik van Riel <riel@redhat.com>

---
This is to be applied over of the original MADV_FREE patch.
It turns out that the current glibc patch already falls back
to MADV_DONTNEED if it gets an -EINVAL.

[-- Attachment #2: linux-2.6-madv-dontneed-restore.patch --]
[-- Type: text/x-patch, Size: 1317 bytes --]

--- linux-2.6.20.x86_64/mm/madvise.c.madv_free	2007-04-19 16:46:22.000000000 -0400
+++ linux-2.6.20.x86_64/mm/madvise.c	2007-04-19 16:52:19.000000000 -0400
@@ -130,7 +130,8 @@ static long madvise_willneed(struct vm_a
  */
 static long madvise_dontneed(struct vm_area_struct * vma,
 			     struct vm_area_struct ** prev,
-			     unsigned long start, unsigned long end)
+			     unsigned long start, unsigned long end,
+			     int behavior)
 {
 	*prev = vma;
 	if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
@@ -142,12 +143,14 @@ static long madvise_dontneed(struct vm_a
 			.last_index = ULONG_MAX,
 		};
 		zap_page_range(vma, start, end - start, &details);
-	} else {
+	} else if (behavior == MADV_FREE) {
 		struct zap_details details = {
 			.madv_free = 1,
 		};
 		zap_page_range(vma, start, end - start, &details);
-	}
+	} else /* behavior == MADV_DONTNEED */
+		zap_page_range(vma, start, end - start, NULL);
+
 	return 0;
 }
 
@@ -219,10 +222,9 @@ madvise_vma(struct vm_area_struct *vma, 
 		error = madvise_willneed(vma, prev, start, end);
 		break;
 
-	/* FIXME: POSIX says that MADV_DONTNEED cannot throw away data. */
 	case MADV_DONTNEED:
 	case MADV_FREE:
-		error = madvise_dontneed(vma, prev, start, end);
+		error = madvise_dontneed(vma, prev, start, end, behavior);
 		break;
 
 	default:

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-17  7:15 [PATCH] lazy freeing of memory through MADV_FREE Rik van Riel
  2007-04-19 21:15 ` [PATCH] lazy freeing of memory through MADV_FREE 2/2 Rik van Riel
@ 2007-04-20 20:57 ` Andrew Morton
  2007-04-20 21:38   ` Rik van Riel
  2007-04-22  8:18 ` Andrew Morton
  2 siblings, 1 reply; 43+ messages in thread
From: Andrew Morton @ 2007-04-20 20:57 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm

On Tue, 17 Apr 2007 03:15:51 -0400
Rik van Riel <riel@redhat.com> wrote:

> Make it possible for applications to have the kernel free memory
> lazily.  This reduces a repeated free/malloc cycle from freeing
> pages and allocating them, to just marking them freeable.  If the
> application wants to reuse them before the kernel needs the memory,
> not even a page fault will happen.
> 
> This patch, together with Ulrich's glibc change, increases
> MySQL sysbench performance by a factor of 2 on my quad core
> test system.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> 
> ---
> Ulrich Drepper has test glibc RPMS for this functionality at:
> 
>      http://people.redhat.com/drepper/rpms
> 
> Andrew, I have stress tested this patch for a few days now and
> have not been able to find any more bugs.  I believe it is ready
> to be merged in -mm, and upstream at the next merge window.
> 
> When the patch goes upstream, I will submit a small follow-up
> patch to revert MADV_DONTNEED behaviour to what it did previously
> and have the new behaviour trigger only on MADV_FREE: at that
> point people will have to get new test RPMs of glibc.
> 
> 

I've also merged Nick's "mm: madvise avoid exclusive mmap_sem".

- Nick's patch also will help this problem.  It could be that your patch
  no longer offers a 2x speedup when combined with Nick's patch.

  It could well be that the combination of the two is even better, but it
  would be nice to firm that up a bit.  Chewing a page flag is an expensive
  thing to do.

  I do go on about that.  But we're adding page flags at about one per
  year, and when we run out we're screwed - we'll need to grow the
  pageframe.

- I need to update your patch for Nick's patch.  Please confirm that
  down_read(mmap_sem) is sufficient for MADV_FREE.


Stylistic nit:

> +	if (PageLazyFree(page) && !migration) {
> +		/* There is new data in the page.  Reinstate it. */
> +		if (unlikely(pte_dirty(pteval))) {
> +			set_pte_at(mm, address, pte, pteval);
> +			ret = SWAP_FAIL;
> +			goto out_unmap;
> +		}

The comment should be inside the second `if' statement.  As it is, It
looks like we reinstate the page if (PageLazyFree(page) && !migration).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE 2/2
  2007-04-19 21:15 ` [PATCH] lazy freeing of memory through MADV_FREE 2/2 Rik van Riel
@ 2007-04-20 21:03   ` Andrew Morton
  2007-04-20 21:24     ` Ulrich Drepper
  0 siblings, 1 reply; 43+ messages in thread
From: Andrew Morton @ 2007-04-20 21:03 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Jakub Jelinek, linux-kernel, linux-mm

On Thu, 19 Apr 2007 17:15:28 -0400
Rik van Riel <riel@redhat.com> wrote:

> Restore MADV_DONTNEED to its original Linux behaviour.  This is still
> not the same behaviour as POSIX, but applications may be depending on
> the Linux behaviour already. Besides, glibc catches POSIX_MADV_DONTNEED
> and makes sure nothing is done...

OK, we need to flesh this out a lot please.  People often get confused
about what our MADV_DONTNEED behaviour is.  I regularly forget, then look
at the code, then get it wrong.  That's for mainline, let alone older
kernels whose behaviour is gawd-knows-what.

So...  For the changelog (and the manpage) could we please have a full
description of the 2.6.21 behaviour and the 2.6.21-post-rik behaviour (and
the 2.4 behaviour, if it differs at all)?  Also some code comments to
demystify all of this once and for all?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE 2/2
  2007-04-20 21:03   ` Andrew Morton
@ 2007-04-20 21:24     ` Ulrich Drepper
  2007-04-21  7:37       ` Hugh Dickins
  0 siblings, 1 reply; 43+ messages in thread
From: Ulrich Drepper @ 2007-04-20 21:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, Jakub Jelinek, linux-kernel, linux-mm

On 4/20/07, Andrew Morton <akpm@linux-foundation.org> wrote:
> OK, we need to flesh this out a lot please.  People often get confused
> about what our MADV_DONTNEED behaviour is.

Well, there's not really much to flesh out.  The current MADV_DONTNEED
is useful in some situations.  The behavior cannot be changed, even
glibc will rely on it for the case when MADV_FREE is not supported.

What might be nice to have is to have a POSIX-compliant
POSIX_MADV_DONTNEED implementation.  We currently do nothing which is
OK since no test suite can detect that.  But some code might want to
use the real behavior and we're missing an optimization possibility.

Just for reference: the MADV_CURRENT behavior is to throw away data in
the range.  The POSIX_MADV_DONTNEED behavior is to never lose data.
I.e., file backed data is written back, anon data is at most swapped
out.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-20 20:57 ` [PATCH] lazy freeing of memory through MADV_FREE Andrew Morton
@ 2007-04-20 21:38   ` Rik van Riel
  2007-04-20 22:06     ` Andrew Morton
  2007-04-21  7:24     ` Hugh Dickins
  0 siblings, 2 replies; 43+ messages in thread
From: Rik van Riel @ 2007-04-20 21:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

Andrew Morton wrote:

> I've also merged Nick's "mm: madvise avoid exclusive mmap_sem".
> 
> - Nick's patch also will help this problem.  It could be that your patch
>   no longer offers a 2x speedup when combined with Nick's patch.
> 
>   It could well be that the combination of the two is even better, but it
>   would be nice to firm that up a bit.  

I'll test that.

>   I do go on about that.  But we're adding page flags at about one per
>   year, and when we run out we're screwed - we'll need to grow the
>   pageframe.

If you want, I can take a look at folding this into the
->mapping pointer.  I can guarantee you it won't be
pretty, though :)

> - I need to update your patch for Nick's patch.  Please confirm that
>   down_read(mmap_sem) is sufficient for MADV_FREE.

It is.  MADV_FREE needs no more protection than MADV_DONTNEED.

> Stylistic nit:
> 
>> +	if (PageLazyFree(page) && !migration) {
>> +		/* There is new data in the page.  Reinstate it. */
>> +		if (unlikely(pte_dirty(pteval))) {
>> +			set_pte_at(mm, address, pte, pteval);
>> +			ret = SWAP_FAIL;
>> +			goto out_unmap;
>> +		}
> 
> The comment should be inside the second `if' statement.  As it is, It
> looks like we reinstate the page if (PageLazyFree(page) && !migration).

Want me to move it?

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-20 21:38   ` Rik van Riel
@ 2007-04-20 22:06     ` Andrew Morton
  2007-04-20 23:52       ` Rik van Riel
  2007-04-21  7:24     ` Hugh Dickins
  1 sibling, 1 reply; 43+ messages in thread
From: Andrew Morton @ 2007-04-20 22:06 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm

On Fri, 20 Apr 2007 17:38:06 -0400
Rik van Riel <riel@redhat.com> wrote:

> Andrew Morton wrote:
> 
> > I've also merged Nick's "mm: madvise avoid exclusive mmap_sem".
> > 
> > - Nick's patch also will help this problem.  It could be that your patch
> >   no longer offers a 2x speedup when combined with Nick's patch.
> > 
> >   It could well be that the combination of the two is even better, but it
> >   would be nice to firm that up a bit.  
> 
> I'll test that.

Thanks.

> >   I do go on about that.  But we're adding page flags at about one per
> >   year, and when we run out we're screwed - we'll need to grow the
> >   pageframe.
> 
> If you want, I can take a look at folding this into the
> ->mapping pointer.  I can guarantee you it won't be
> pretty, though :)

Well, let's see how fugly it ends up looking?

> > - I need to update your patch for Nick's patch.  Please confirm that
> >   down_read(mmap_sem) is sufficient for MADV_FREE.
> 
> It is.  MADV_FREE needs no more protection than MADV_DONTNEED.
> 
> > Stylistic nit:
> > 
> >> +	if (PageLazyFree(page) && !migration) {
> >> +		/* There is new data in the page.  Reinstate it. */
> >> +		if (unlikely(pte_dirty(pteval))) {
> >> +			set_pte_at(mm, address, pte, pteval);
> >> +			ret = SWAP_FAIL;
> >> +			goto out_unmap;
> >> +		}
> > 
> > The comment should be inside the second `if' statement.  As it is, It
> > looks like we reinstate the page if (PageLazyFree(page) && !migration).
> 
> Want me to move it?

I did that, thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-20 22:06     ` Andrew Morton
@ 2007-04-20 23:52       ` Rik van Riel
  2007-04-21  0:48         ` Eric Dumazet
                           ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Rik van Riel @ 2007-04-20 23:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, shak

Andrew Morton wrote:
> On Fri, 20 Apr 2007 17:38:06 -0400
> Rik van Riel <riel@redhat.com> wrote:
> 
>> Andrew Morton wrote:
>>
>>> I've also merged Nick's "mm: madvise avoid exclusive mmap_sem".
>>>
>>> - Nick's patch also will help this problem.  It could be that your patch
>>>   no longer offers a 2x speedup when combined with Nick's patch.
>>>
>>>   It could well be that the combination of the two is even better, but it
>>>   would be nice to firm that up a bit.  
>> I'll test that.
> 
> Thanks.

Well, good news.

It turns out that Nick's patch does not improve peak
performance much, but it does prevent the decline when
running with 16 threads on my quad core CPU!

We _definately_ want both patches, there's a huge benefit
in having them both.

Here are the transactions/seconds for each combination:

    vanilla   new glibc  madv_free kernel   madv_free + mmap_sem
threads

1     610         609             596                545
2    1032        1136            1196               1200
4    1070        1128            2014               2024
8    1000        1088            1665               2087
16    779        1073            1310               1999


-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-20 23:52       ` Rik van Riel
@ 2007-04-21  0:48         ` Eric Dumazet
  2007-04-21  3:58           ` Rik van Riel
  2007-04-21  7:12         ` Jakub Jelinek
  2007-04-22  2:36         ` Nick Piggin
  2 siblings, 1 reply; 43+ messages in thread
From: Eric Dumazet @ 2007-04-21  0:48 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, linux-kernel, linux-mm, shak

Rik van Riel a A(C)crit :
> Andrew Morton wrote:
>> On Fri, 20 Apr 2007 17:38:06 -0400
>> Rik van Riel <riel@redhat.com> wrote:
>>
>>> Andrew Morton wrote:
>>>
>>>> I've also merged Nick's "mm: madvise avoid exclusive mmap_sem".
>>>>
>>>> - Nick's patch also will help this problem.  It could be that your 
>>>> patch
>>>>   no longer offers a 2x speedup when combined with Nick's patch.
>>>>
>>>>   It could well be that the combination of the two is even better, 
>>>> but it
>>>>   would be nice to firm that up a bit.  
>>> I'll test that.
>>
>> Thanks.
> 
> Well, good news.
> 
> It turns out that Nick's patch does not improve peak
> performance much, but it does prevent the decline when
> running with 16 threads on my quad core CPU!
> 
> We _definately_ want both patches, there's a huge benefit
> in having them both.
> 
> Here are the transactions/seconds for each combination:
> 
>    vanilla   new glibc  madv_free kernel   madv_free + mmap_sem
> threads
> 
> 1     610         609             596                545

545 tps versus 610 tps for one thread ? It seems quite bad, no ?

Could you please find an explanation for this ?

> 2    1032        1136            1196               1200
> 4    1070        1128            2014               2024
> 8    1000        1088            1665               2087
> 16    779        1073            1310               1999
> 
> 

Thank you

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-21  0:48         ` Eric Dumazet
@ 2007-04-21  3:58           ` Rik van Riel
  0 siblings, 0 replies; 43+ messages in thread
From: Rik van Riel @ 2007-04-21  3:58 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Andrew Morton, linux-kernel, linux-mm, shak

Eric Dumazet wrote:
> Rik van Riel a A(C)crit :
>> Andrew Morton wrote:
>>> On Fri, 20 Apr 2007 17:38:06 -0400
>>> Rik van Riel <riel@redhat.com> wrote:
>>>
>>>> Andrew Morton wrote:
>>>>
>>>>> I've also merged Nick's "mm: madvise avoid exclusive mmap_sem".
>>>>>
>>>>> - Nick's patch also will help this problem.  It could be that your 
>>>>> patch
>>>>>   no longer offers a 2x speedup when combined with Nick's patch.
>>>>>
>>>>>   It could well be that the combination of the two is even better, 
>>>>> but it
>>>>>   would be nice to firm that up a bit.  
>>>> I'll test that.
>>>
>>> Thanks.
>>
>> Well, good news.
>>
>> It turns out that Nick's patch does not improve peak
>> performance much, but it does prevent the decline when
>> running with 16 threads on my quad core CPU!
>>
>> We _definately_ want both patches, there's a huge benefit
>> in having them both.
>>
>> Here are the transactions/seconds for each combination:
>>
>>    vanilla   new glibc  madv_free kernel   madv_free + mmap_sem
>> threads
>>
>> 1     610         609             596                545
> 
> 545 tps versus 610 tps for one thread ? It seems quite bad, no ?
> 
> Could you please find an explanation for this ?

I have no idea why this happens.  Especially the last one,
going from a write lock to a read lock on the mmap_sem
should not make ANY difference whatsoever since we're
running single threaded!

>> 2    1032        1136            1196               1200
>> 4    1070        1128            2014               2024
>> 8    1000        1088            1665               2087
>> 16    779        1073            1310               1999

Performance with 2 database threads is way better though,
and performance with 4 or more threads more than doubles...

If you have an explanation on why single threaded performance
went down a little on my quad core system, please let me know.

Does performance suffer at all on a real UP system?

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-20 23:52       ` Rik van Riel
  2007-04-21  0:48         ` Eric Dumazet
@ 2007-04-21  7:12         ` Jakub Jelinek
  2007-04-23  4:36           ` Nick Piggin
  2007-04-22  2:36         ` Nick Piggin
  2 siblings, 1 reply; 43+ messages in thread
From: Jakub Jelinek @ 2007-04-21  7:12 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, linux-kernel, linux-mm, shak

On Fri, Apr 20, 2007 at 07:52:44PM -0400, Rik van Riel wrote:
> It turns out that Nick's patch does not improve peak
> performance much, but it does prevent the decline when
> running with 16 threads on my quad core CPU!
> 
> We _definately_ want both patches, there's a huge benefit
> in having them both.
> 
> Here are the transactions/seconds for each combination:
> 
>    vanilla   new glibc  madv_free kernel   madv_free + mmap_sem
> threads
> 
> 1     610         609             596                545
> 2    1032        1136            1196               1200
> 4    1070        1128            2014               2024
> 8    1000        1088            1665               2087
> 16    779        1073            1310               1999

FYI, I have uploaded a testing glibc that uses MADV_FREE and falls back
to MADV_DONTUSE if MADV_FREE is not available, to
http://people.redhat.com/jakub/glibc/2.5.90-21.1/
and I'm also attaching the glibc patch for those who want to build it
themselves:

2007-04-19  Ulrich Drepper  <drepper@redhat.com>
	    Jakub Jelinek  <jakub@redhat.com>

	* malloc/arena.c (heap_info): Add mprotect_size field, adjust pad.
	(new_heap): Initialize mprotect_size.
	(no_madv_free): New variable.
	(grow_heap): When growing, only mprotect from mprotect_size till
	new_size if mprotect_size is smaller.  When shrinking, use PROT_NONE
	MMAP for __libc_enable_secure only, otherwise if MADV_FREE is
	available use it and fall back to MADV_DONTNEED.
	* sysdeps/unix/sysv/linux/alpha/bits/mman.h (MADV_FREE): Define.
	* sysdeps/unix/sysv/linux/ia64/bits/mman.h (MADV_FREE): Likewise.
	* sysdeps/unix/sysv/linux/i386/bits/mman.h (MADV_FREE): Likewise.
	* sysdeps/unix/sysv/linux/s390/bits/mman.h (MADV_FREE): Likewise.
	* sysdeps/unix/sysv/linux/powerpc/bits/mman.h (MADV_FREE): Likewise.
	* sysdeps/unix/sysv/linux/x86_64/bits/mman.h (MADV_FREE): Likewise.
	* sysdeps/unix/sysv/linux/sparc/bits/mman.h (MADV_FREE): Likewise.
	* sysdeps/unix/sysv/linux/sh/bits/mman.h (MADV_FREE): Likewise.

--- libc/malloc/arena.c.jj	2006-10-31 23:05:31.000000000 +0100
+++ libc/malloc/arena.c	2007-04-19 18:54:20.000000000 +0200
@@ -1,5 +1,6 @@
 /* Malloc implementation for multiple threads without lock contention.
-   Copyright (C) 2001,2002,2003,2004,2005,2006 Free Software Foundation, Inc.
+   Copyright (C) 2001,2002,2003,2004,2005,2006,2007
+   Free Software Foundation, Inc.
    This file is part of the GNU C Library.
    Contributed by Wolfram Gloger <wg@malloc.de>, 2001.
 
@@ -59,10 +60,12 @@ typedef struct _heap_info {
   mstate ar_ptr; /* Arena for this heap. */
   struct _heap_info *prev; /* Previous heap. */
   size_t size;   /* Current size in bytes. */
+  size_t mprotect_size;	/* Size in bytes that has been mprotected
+			   PROT_READ|PROT_WRITE.  */
   /* Make sure the following data is properly aligned, particularly
      that sizeof (heap_info) + 2 * SIZE_SZ is a multiple of
-     MALLOG_ALIGNMENT. */
-  char pad[-5 * SIZE_SZ & MALLOC_ALIGN_MASK];
+     MALLOC_ALIGNMENT. */
+  char pad[-6 * SIZE_SZ & MALLOC_ALIGN_MASK];
 } heap_info;
 
 /* Get a compile-time error if the heap_info padding is not correct
@@ -692,10 +695,15 @@ new_heap(size, top_pad) size_t size, top
   }
   h = (heap_info *)p2;
   h->size = size;
+  h->mprotect_size = size;
   THREAD_STAT(stat_n_heaps++);
   return h;
 }
 
+#if defined _LIBC && defined MADV_FREE
+static int no_madv_free;
+#endif
+
 /* Grow or shrink a heap.  size is automatically rounded up to a
    multiple of the page size if it is positive. */
 
@@ -714,17 +722,49 @@ grow_heap(h, diff) heap_info *h; long di
     new_size = (long)h->size + diff;
     if((unsigned long) new_size > (unsigned long) HEAP_MAX_SIZE)
       return -1;
-    if(mprotect((char *)h + h->size, diff, PROT_READ|PROT_WRITE) != 0)
-      return -2;
+    if((unsigned long) new_size > h->mprotect_size) {
+      if (mprotect((char *)h + h->mprotect_size,
+		   (unsigned long) new_size - h->mprotect_size,
+		   PROT_READ|PROT_WRITE) != 0)
+	return -2;
+      h->mprotect_size = new_size;
+    }
   } else {
     new_size = (long)h->size + diff;
     if(new_size < (long)sizeof(*h))
       return -1;
     /* Try to re-map the extra heap space freshly to save memory, and
        make it inaccessible. */
-    if((char *)MMAP((char *)h + new_size, -diff, PROT_NONE,
-                    MAP_PRIVATE|MAP_FIXED) == (char *) MAP_FAILED)
-      return -2;
+#ifdef _LIBC
+    if (__builtin_expect (__libc_enable_secure, 0))
+#else
+    if (1)
+#endif
+      {
+	if((char *)MMAP((char *)h + new_size, -diff, PROT_NONE,
+			MAP_PRIVATE|MAP_FIXED) == (char *) MAP_FAILED)
+	  return -2;
+	h->mprotect_size = new_size;
+      }
+#ifdef _LIBC
+    else
+      {
+# ifdef MADV_FREE
+	if (!__builtin_expect (no_madv_free, 0))
+	  {
+	    if (__builtin_expect (madvise ((char *)h + new_size,
+					   -diff, MADV_FREE), 0) == -1
+		&& errno == EINVAL)
+	      {
+		no_madv_free = 1;
+		madvise ((char *)h + new_size, -diff, MADV_DONTNEED);
+	      }
+	  }
+	else
+# endif
+	  madvise ((char *)h + new_size, -diff, MADV_DONTNEED);
+      }
+#endif
     /*fprintf(stderr, "shrink %p %08lx\n", h, new_size);*/
   }
   h->size = new_size;
--- libc/sysdeps/unix/sysv/linux/alpha/bits/mman.h.jj	2006-05-02 16:33:44.000000000 +0200
+++ libc/sysdeps/unix/sysv/linux/alpha/bits/mman.h	2007-04-19 18:37:43.000000000 +0200
@@ -1,5 +1,6 @@
 /* Definitions for POSIX memory map interface.  Linux/Alpha version.
-   Copyright (C) 1997, 1998, 2000, 2003, 2006 Free Software Foundation, Inc.
+   Copyright (C) 1997, 1998, 2000, 2003, 2006, 2007
+   Free Software Foundation, Inc.
    This file is part of the GNU C Library.
 
    The GNU C Library is free software; you can redistribute it and/or
@@ -96,6 +97,7 @@
 # define MADV_SEQUENTIAL 2	/* Expect sequential page references.  */
 # define MADV_WILLNEED   3	/* Will need these pages.  */
 # define MADV_DONTNEED   6	/* Don't need these pages.  */
+# define MADV_FREE	 7	/* Content can be freed.  */
 # define MADV_REMOVE	 9	/* Remove these pages and resources.  */
 # define MADV_DONTFORK	 10	/* Do not inherit across fork.  */
 # define MADV_DOFORK	 11	/* Do inherit across fork.  */
--- libc/sysdeps/unix/sysv/linux/ia64/bits/mman.h.jj	2006-05-02 16:33:44.000000000 +0200
+++ libc/sysdeps/unix/sysv/linux/ia64/bits/mman.h	2007-04-19 18:37:43.000000000 +0200
@@ -1,5 +1,6 @@
 /* Definitions for POSIX memory map interface.  Linux/ia64 version.
-   Copyright (C) 1997,1998,2000,2003,2005,2006 Free Software Foundation, Inc.
+   Copyright (C) 1997,1998,2000,2003,2005,2006,2007
+   Free Software Foundation, Inc.
    This file is part of the GNU C Library.
 
    The GNU C Library is free software; you can redistribute it and/or
@@ -89,6 +90,7 @@
 # define MADV_SEQUENTIAL 2	/* Expect sequential page references.  */
 # define MADV_WILLNEED	 3	/* Will need these pages.  */
 # define MADV_DONTNEED	 4	/* Don't need these pages.  */
+# define MADV_FREE	 5	/* Content can be freed.  */
 # define MADV_REMOVE	 9	/* Remove these pages and resources.  */
 # define MADV_DONTFORK	 10	/* Do not inherit across fork.  */
 # define MADV_DOFORK	 11	/* Do inherit across fork.  */
--- libc/sysdeps/unix/sysv/linux/i386/bits/mman.h.jj	2006-05-02 16:33:44.000000000 +0200
+++ libc/sysdeps/unix/sysv/linux/i386/bits/mman.h	2007-04-19 18:37:43.000000000 +0200
@@ -1,5 +1,6 @@
 /* Definitions for POSIX memory map interface.  Linux/i386 version.
-   Copyright (C) 1997, 2000, 2003, 2005, 2006 Free Software Foundation, Inc.
+   Copyright (C) 1997, 2000, 2003, 2005, 2006, 2007
+   Free Software Foundation, Inc.
    This file is part of the GNU C Library.
 
    The GNU C Library is free software; you can redistribute it and/or
@@ -88,6 +89,7 @@
 # define MADV_SEQUENTIAL 2	/* Expect sequential page references.  */
 # define MADV_WILLNEED	 3	/* Will need these pages.  */
 # define MADV_DONTNEED	 4	/* Don't need these pages.  */
+# define MADV_FREE	 5	/* Content can be freed.  */
 # define MADV_REMOVE	 9	/* Remove these pages and resources.  */
 # define MADV_DONTFORK	 10	/* Do not inherit across fork.  */
 # define MADV_DOFORK	 11	/* Do inherit across fork.  */
--- libc/sysdeps/unix/sysv/linux/s390/bits/mman.h.jj	2006-05-02 16:33:44.000000000 +0200
+++ libc/sysdeps/unix/sysv/linux/s390/bits/mman.h	2007-04-19 18:37:43.000000000 +0200
@@ -1,5 +1,6 @@
 /* Definitions for POSIX memory map interface.  Linux/s390 version.
-   Copyright (C) 2000,2001,2002,2003,2005,2006 Free Software Foundation, Inc.
+   Copyright (C) 2000,2001,2002,2003,2005,2006,2007
+   Free Software Foundation, Inc.
    This file is part of the GNU C Library.
 
    The GNU C Library is free software; you can redistribute it and/or
@@ -89,6 +90,7 @@
 # define MADV_SEQUENTIAL 2	/* Expect sequential page references.  */
 # define MADV_WILLNEED	 3	/* Will need these pages.  */
 # define MADV_DONTNEED	 4	/* Don't need these pages.  */
+# define MADV_FREE	 5	/* Content can be freed.  */
 # define MADV_REMOVE	 9	/* Remove these pages and resources.  */
 # define MADV_DONTFORK	 10	/* Do not inherit across fork.  */
 # define MADV_DOFORK	 11	/* Do inherit across fork.  */
--- libc/sysdeps/unix/sysv/linux/powerpc/bits/mman.h.jj	2006-05-02 16:33:44.000000000 +0200
+++ libc/sysdeps/unix/sysv/linux/powerpc/bits/mman.h	2007-04-19 18:37:43.000000000 +0200
@@ -1,5 +1,6 @@
 /* Definitions for POSIX memory map interface.  Linux/PowerPC version.
-   Copyright (C) 1997, 2000, 2003, 2005, 2006 Free Software Foundation, Inc.
+   Copyright (C) 1997, 2000, 2003, 2005, 2006, 2007
+   Free Software Foundation, Inc.
    This file is part of the GNU C Library.
 
    The GNU C Library is free software; you can redistribute it and/or
@@ -89,6 +90,7 @@
 # define MADV_SEQUENTIAL 2	/* Expect sequential page references.  */
 # define MADV_WILLNEED	 3	/* Will need these pages.  */
 # define MADV_DONTNEED	 4	/* Don't need these pages.  */
+# define MADV_FREE	 5	/* Content can be freed.  */
 # define MADV_REMOVE	 9	/* Remove these pages and resources.  */
 # define MADV_DONTFORK	 10	/* Do not inherit across fork.  */
 # define MADV_DOFORK	 11	/* Do inherit across fork.  */
--- libc/sysdeps/unix/sysv/linux/x86_64/bits/mman.h.jj	2006-05-02 16:33:46.000000000 +0200
+++ libc/sysdeps/unix/sysv/linux/x86_64/bits/mman.h	2007-04-19 18:37:43.000000000 +0200
@@ -1,5 +1,5 @@
 /* Definitions for POSIX memory map interface.  Linux/x86_64 version.
-   Copyright (C) 2001, 2003, 2005, 2006 Free Software Foundation, Inc.
+   Copyright (C) 2001, 2003, 2005, 2006, 2007 Free Software Foundation, Inc.
    This file is part of the GNU C Library.
 
    The GNU C Library is free software; you can redistribute it and/or
@@ -89,6 +89,7 @@
 # define MADV_SEQUENTIAL 2	/* Expect sequential page references.  */
 # define MADV_WILLNEED	 3	/* Will need these pages.  */
 # define MADV_DONTNEED	 4	/* Don't need these pages.  */
+# define MADV_FREE	 5	/* Content can be freed.  */
 # define MADV_REMOVE	 9	/* Remove these pages and resources.  */
 # define MADV_DONTFORK	 10	/* Do not inherit across fork.  */
 # define MADV_DOFORK	 11	/* Do inherit across fork.  */
--- libc/sysdeps/unix/sysv/linux/sparc/bits/mman.h.jj	2006-05-02 16:33:44.000000000 +0200
+++ libc/sysdeps/unix/sysv/linux/sparc/bits/mman.h	2007-04-19 18:37:43.000000000 +0200
@@ -1,5 +1,6 @@
 /* Definitions for POSIX memory map interface.  Linux/SPARC version.
-   Copyright (C) 1997,1999,2000,2003,2005,2006 Free Software Foundation, Inc.
+   Copyright (C) 1997,1999,2000,2003,2005,2006,2007
+   Free Software Foundation, Inc.
    This file is part of the GNU C Library.
 
    The GNU C Library is free software; you can redistribute it and/or
@@ -90,7 +91,7 @@
 # define MADV_SEQUENTIAL 2	/* Expect sequential page references.  */
 # define MADV_WILLNEED	 3	/* Will need these pages.  */
 # define MADV_DONTNEED	 4	/* Don't need these pages.  */
-# define MADV_FREE	 5	/* Content can be freed (Solaris).  */
+# define MADV_FREE	 5	/* Content can be freed.  */
 # define MADV_REMOVE	 9	/* Remove these pages and resources.  */
 # define MADV_DONTFORK	 10	/* Do not inherit across fork.  */
 # define MADV_DOFORK	 11	/* Do inherit across fork.  */
--- libc/sysdeps/unix/sysv/linux/sh/bits/mman.h.jj	2006-05-02 16:33:44.000000000 +0200
+++ libc/sysdeps/unix/sysv/linux/sh/bits/mman.h	2007-04-19 18:37:43.000000000 +0200
@@ -1,5 +1,6 @@
 /* Definitions for POSIX memory map interface.  Linux/SH version.
-   Copyright (C) 1997,1999,2000,2003,2005,2006 Free Software Foundation, Inc.
+   Copyright (C) 1997,1999,2000,2003,2005,2006,2007
+   Free Software Foundation, Inc.
    This file is part of the GNU C Library.
 
    The GNU C Library is free software; you can redistribute it and/or
@@ -88,6 +89,7 @@
 # define MADV_SEQUENTIAL 2	/* Expect sequential page references.  */
 # define MADV_WILLNEED	 3	/* Will need these pages.  */
 # define MADV_DONTNEED	 4	/* Don't need these pages.  */
+# define MADV_FREE	 5	/* Content can be freed.  */
 # define MADV_REMOVE	 9	/* Remove these pages and resources.  */
 # define MADV_DONTFORK	 10	/* Do not inherit across fork.  */
 # define MADV_DOFORK	 11	/* Do inherit across fork.  */


	Jakub

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-20 21:38   ` Rik van Riel
  2007-04-20 22:06     ` Andrew Morton
@ 2007-04-21  7:24     ` Hugh Dickins
  2007-04-21 18:06       ` Rik van Riel
  1 sibling, 1 reply; 43+ messages in thread
From: Hugh Dickins @ 2007-04-21  7:24 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, linux-kernel, linux-mm

On Fri, 20 Apr 2007, Rik van Riel wrote:
> Andrew Morton wrote:
> 
> >   I do go on about that.  But we're adding page flags at about one per
> >   year, and when we run out we're screwed - we'll need to grow the
> >   pageframe.
> 
> If you want, I can take a look at folding this into the
> ->mapping pointer.  I can guarantee you it won't be
> pretty, though :)

Please don't.  If we're going to stuff another pageflag into there,
let it be PageSwapCache the natural partner of PageAnon, rather than
whatever our latest pageflag happens to be.  I'll look into it - but
do keep an eye on me, I've developed a dubious track record of
obstructing other people's attempts to save pageflags.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE 2/2
  2007-04-20 21:24     ` Ulrich Drepper
@ 2007-04-21  7:37       ` Hugh Dickins
  2007-04-21 16:32         ` Ulrich Drepper
  0 siblings, 1 reply; 43+ messages in thread
From: Hugh Dickins @ 2007-04-21  7:37 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Andrew Morton, Rik van Riel, Jakub Jelinek, linux-kernel, linux-mm

On Fri, 20 Apr 2007, Ulrich Drepper wrote:
> 
> Just for reference: the MADV_CURRENT behavior is to throw away data in
> the range.

Not exactly.  The Linux MADV_DONTNEED never throws away data from a
PROT_WRITE,MAP_SHARED mapping (or shm) - it propagates the dirty bit,
the page will eventually get written out to file, and can be retrieved
later by subsequent access.  But the Linux MADV_DONTNEED does throw away
data from a PROT_WRITE,MAP_PRIVATE mapping (or brk or stack) - those
changes are discarded, and a subsequent access will revert to zeroes
or the underlying mapped file.  Been like that since before 2.4.0.

> The POSIX_MADV_DONTNEED behavior is to never lose data.
> I.e., file backed data is written back, anon data is at most swapped
> out.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE 2/2
  2007-04-21  7:37       ` Hugh Dickins
@ 2007-04-21 16:32         ` Ulrich Drepper
  0 siblings, 0 replies; 43+ messages in thread
From: Ulrich Drepper @ 2007-04-21 16:32 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Rik van Riel, Jakub Jelinek, linux-kernel, linux-mm

On 4/21/07, Hugh Dickins <hugh@veritas.com> wrote:
> But the Linux MADV_DONTNEED does throw away
> data from a PROT_WRITE,MAP_PRIVATE mapping (or brk or stack) - those
> changes are discarded, and a subsequent access will revert to zeroes
> or the underlying mapped file.  Been like that since before 2.4.0.

I didn't say it changed.  I just say that there is a hole in the
current implementation as it does not allow to implement
POSIX_MADV_DONTNEED with anything but a no-op.  The
POSIX_MADV_DONTNEED behavior is useful and something IMO should be
added to allow implementing it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-21  7:24     ` Hugh Dickins
@ 2007-04-21 18:06       ` Rik van Riel
  0 siblings, 0 replies; 43+ messages in thread
From: Rik van Riel @ 2007-04-21 18:06 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel, linux-mm

Hugh Dickins wrote:
> On Fri, 20 Apr 2007, Rik van Riel wrote:
>> Andrew Morton wrote:
>>
>>>   I do go on about that.  But we're adding page flags at about one per
>>>   year, and when we run out we're screwed - we'll need to grow the
>>>   pageframe.
>> If you want, I can take a look at folding this into the
>> ->mapping pointer.  I can guarantee you it won't be
>> pretty, though :)
> 
> Please don't.  If we're going to stuff another pageflag into there,
> let it be PageSwapCache the natural partner of PageAnon, rather than
> whatever our latest pageflag happens to be. 

I looked at doing what Andrew wanted, and it did indeed not
look like the right thing to do.  The locking on page->mapping
is the kind of locking we want to avoid during zap_page_range
and in the pageout code.

I like your suggestion better.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-20 23:52       ` Rik van Riel
  2007-04-21  0:48         ` Eric Dumazet
  2007-04-21  7:12         ` Jakub Jelinek
@ 2007-04-22  2:36         ` Nick Piggin
  2007-04-22  2:50           ` Nick Piggin
                             ` (2 more replies)
  2 siblings, 3 replies; 43+ messages in thread
From: Nick Piggin @ 2007-04-22  2:36 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, linux-kernel, linux-mm, shak

Rik van Riel wrote:
> Andrew Morton wrote:
> 
>> On Fri, 20 Apr 2007 17:38:06 -0400
>> Rik van Riel <riel@redhat.com> wrote:
>>
>>> Andrew Morton wrote:
>>>
>>>> I've also merged Nick's "mm: madvise avoid exclusive mmap_sem".
>>>>
>>>> - Nick's patch also will help this problem.  It could be that your 
>>>> patch
>>>>   no longer offers a 2x speedup when combined with Nick's patch.
>>>>
>>>>   It could well be that the combination of the two is even better, 
>>>> but it
>>>>   would be nice to firm that up a bit.  
>>>
>>> I'll test that.
>>
>>
>> Thanks.
> 
> 
> Well, good news.
> 
> It turns out that Nick's patch does not improve peak
> performance much, but it does prevent the decline when
> running with 16 threads on my quad core CPU!
> 
> We _definately_ want both patches, there's a huge benefit
> in having them both.
> 
> Here are the transactions/seconds for each combination:
> 
>    vanilla   new glibc  madv_free kernel   madv_free + mmap_sem
> threads
> 
> 1     610         609             596                545
> 2    1032        1136            1196               1200
> 4    1070        1128            2014               2024
> 8    1000        1088            1665               2087
> 16    779        1073            1310               1999


Is "new glibc" meaning MADV_DONTNEED + kernel with mmap_sem patch?

The strange thing with your madv_free kernel is that it doesn't
help single-threaded performance at all. So that work to avoid
zeroing the new page is not a win at all there (maybe due to the
cache effects I was worried about?).

However MADV_FREE does improve scalability, which is interesting.
The most likely reason I can see why that may be the case is that
it avoids mmap_sem when faulting pages back in (I doubt it is due
to avoiding the page allocator, but maybe?).

So where is the down_write coming from in this workload, I wonder?
Heap management? What syscalls?

x86_64's rwsems are crap under heavy parallelism (even read-only),
as I fixed in my recent generic rwsems patch. I don't expect MySQL
to be such a mmap_sem microbenchmark, but I wonder how much this
would help?

What if we ran the private futexes patch to further cut down
mmap_sem contention?

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-22  2:36         ` Nick Piggin
@ 2007-04-22  2:50           ` Nick Piggin
  2007-04-22  6:31           ` Rik van Riel
  2007-04-23  4:28           ` Rik van Riel
  2 siblings, 0 replies; 43+ messages in thread
From: Nick Piggin @ 2007-04-22  2:50 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Rik van Riel, Andrew Morton, linux-kernel, linux-mm, shak

Nick Piggin wrote:
> Rik van Riel wrote:
> 
>> Andrew Morton wrote:
>>
>>> On Fri, 20 Apr 2007 17:38:06 -0400
>>> Rik van Riel <riel@redhat.com> wrote:
>>>
>>>> Andrew Morton wrote:
>>>>
>>>>> I've also merged Nick's "mm: madvise avoid exclusive mmap_sem".
>>>>>
>>>>> - Nick's patch also will help this problem.  It could be that your 
>>>>> patch
>>>>>   no longer offers a 2x speedup when combined with Nick's patch.
>>>>>
>>>>>   It could well be that the combination of the two is even better, 
>>>>> but it
>>>>>   would be nice to firm that up a bit.  
>>>>
>>>>
>>>> I'll test that.
>>>
>>>
>>>
>>> Thanks.
>>
>>
>>
>> Well, good news.
>>
>> It turns out that Nick's patch does not improve peak
>> performance much, but it does prevent the decline when
>> running with 16 threads on my quad core CPU!
>>
>> We _definately_ want both patches, there's a huge benefit
>> in having them both.
>>
>> Here are the transactions/seconds for each combination:
>>
>>    vanilla   new glibc  madv_free kernel   madv_free + mmap_sem
>> threads
>>
>> 1     610         609             596                545
>> 2    1032        1136            1196               1200
>> 4    1070        1128            2014               2024
>> 8    1000        1088            1665               2087
>> 16    779        1073            1310               1999
> 
> 
> 
> Is "new glibc" meaning MADV_DONTNEED + kernel with mmap_sem patch?
> 
> The strange thing with your madv_free kernel is that it doesn't
> help single-threaded performance at all. So that work to avoid
> zeroing the new page is not a win at all there (maybe due to the
> cache effects I was worried about?).
> 
> However MADV_FREE does improve scalability, which is interesting.
> The most likely reason I can see why that may be the case is that
> it avoids mmap_sem when faulting pages back in (I doubt it is due
> to avoiding the page allocator, but maybe?).
> 
> So where is the down_write coming from in this workload, I wonder?
> Heap management? What syscalls?
> 
> x86_64's rwsems are crap under heavy parallelism (even read-only),
> as I fixed in my recent generic rwsems patch. I don't expect MySQL
> to be such a mmap_sem microbenchmark, but I wonder how much this
> would help?
> 
> What if we ran the private futexes patch to further cut down
> mmap_sem contention?

Hmm, without the MADV_FREE patch, I wonder if it isn't doing something
silly like read-faulting in a ZERO_PAGE then write faulting a new page
straight afterwards.. I'll have to try a few tests.

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-22  2:36         ` Nick Piggin
  2007-04-22  2:50           ` Nick Piggin
@ 2007-04-22  6:31           ` Rik van Riel
  2007-04-23  0:16             ` Nick Piggin
  2007-04-23  4:28           ` Rik van Riel
  2 siblings, 1 reply; 43+ messages in thread
From: Rik van Riel @ 2007-04-22  6:31 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, linux-mm, shak

Nick Piggin wrote:
> Rik van Riel wrote:
>> Andrew Morton wrote:
>>
>>> On Fri, 20 Apr 2007 17:38:06 -0400
>>> Rik van Riel <riel@redhat.com> wrote:
>>>
>>>> Andrew Morton wrote:
>>>>
>>>>> I've also merged Nick's "mm: madvise avoid exclusive mmap_sem".
>>>>>
>>>>> - Nick's patch also will help this problem.  It could be that your 
>>>>> patch
>>>>>   no longer offers a 2x speedup when combined with Nick's patch.
>>>>>
>>>>>   It could well be that the combination of the two is even better, 
>>>>> but it
>>>>>   would be nice to firm that up a bit.  
>>>>
>>>> I'll test that.
>>>
>>>
>>> Thanks.
>>
>>
>> Well, good news.
>>
>> It turns out that Nick's patch does not improve peak
>> performance much, but it does prevent the decline when
>> running with 16 threads on my quad core CPU!
>>
>> We _definately_ want both patches, there's a huge benefit
>> in having them both.
>>
>> Here are the transactions/seconds for each combination:
>>
>>    vanilla   new glibc  madv_free kernel   madv_free + mmap_sem
>> threads
>>
>> 1     610         609             596                545
>> 2    1032        1136            1196               1200
>> 4    1070        1128            2014               2024
>> 8    1000        1088            1665               2087
>> 16    779        1073            1310               1999
> 
> 
> Is "new glibc" meaning MADV_DONTNEED + kernel with mmap_sem patch?

No, that's just the glibc change, with a vanilla kernel.

The third column is glibc change + mmap_sem patch.

The fourth column has your patch in it, too.

> The strange thing with your madv_free kernel is that it doesn't
> help single-threaded performance at all. So that work to avoid
> zeroing the new page is not a win at all there (maybe due to the
> cache effects I was worried about?).

Well, your patch causes the performance to drop from
596 transactions/second to 545.  Your patch is the only
difference between the third and the fourth column.

> However MADV_FREE does improve scalability, which is interesting.
> The most likely reason I can see why that may be the case is that
> it avoids mmap_sem when faulting pages back in (I doubt it is due
> to avoiding the page allocator, but maybe?).
> 
> So where is the down_write coming from in this workload, I wonder?
> Heap management? What syscalls?

I wonder if the increased parallelism simply caused
more cache line bouncing, with bounces happening in
some inner loop instead of an outer loop.

Btw, it is quite possible that the MySQL sysbench
thing gives different results on your system.  It
would be good to know what it does on a real SMP
system, vs. a single quad-core chip :)

Other architectures would be interesting to know,
too.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-17  7:15 [PATCH] lazy freeing of memory through MADV_FREE Rik van Riel
  2007-04-19 21:15 ` [PATCH] lazy freeing of memory through MADV_FREE 2/2 Rik van Riel
  2007-04-20 20:57 ` [PATCH] lazy freeing of memory through MADV_FREE Andrew Morton
@ 2007-04-22  8:18 ` Andrew Morton
  2007-04-22  9:16   ` Christoph Hellwig
  2 siblings, 1 reply; 43+ messages in thread
From: Andrew Morton @ 2007-04-22  8:18 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm, David S. Miller

On Tue, 17 Apr 2007 03:15:51 -0400 Rik van Riel <riel@redhat.com> wrote:

> Make it possible for applications to have the kernel free memory
> lazily.  This reduces a repeated free/malloc cycle from freeing
> pages and allocating them, to just marking them freeable.  If the
> application wants to reuse them before the kernel needs the memory,
> not even a page fault will happen.
> 
> This patch, together with Ulrich's glibc change, increases
> MySQL sysbench performance by a factor of 2 on my quad core
> test system.
> 

In file included from include/linux/mman.h:4,
                 from arch/sparc64/kernel/sys_sparc.c:19:
include/asm/mman.h:36:1: "MADV_FREE" redefined
In file included from include/asm/mman.h:5,
                 from include/linux/mman.h:4,
                 from arch/sparc64/kernel/sys_sparc.c:19:
include/asm-generic/mman.h:32:1: this is the location of the previous definition

sparc32 and sparc64 already defined MADV_FREE:


#define MADV_FREE       0x5             /* (Solaris) contents can be freed */

I'll remove the sparc definitions for now, but we need to work out what
we're going to do here.  Your patch changes the values of MADV_FREE on
sparc.

Perhaps this should be renamed to MADV_FREE_LINUX and given a different
number.  It depends on how close your proposed behaviour is to Solaris's.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-22  8:18 ` Andrew Morton
@ 2007-04-22  9:16   ` Christoph Hellwig
  2007-04-22 16:55     ` Ulrich Drepper
  0 siblings, 1 reply; 43+ messages in thread
From: Christoph Hellwig @ 2007-04-22  9:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, linux-kernel, linux-mm, David S. Miller

On Sun, Apr 22, 2007 at 01:18:10AM -0700, Andrew Morton wrote:
> On Tue, 17 Apr 2007 03:15:51 -0400 Rik van Riel <riel@redhat.com> wrote:
> 
> > Make it possible for applications to have the kernel free memory
> > lazily.  This reduces a repeated free/malloc cycle from freeing
> > pages and allocating them, to just marking them freeable.  If the
> > application wants to reuse them before the kernel needs the memory,
> > not even a page fault will happen.
> > 
> > This patch, together with Ulrich's glibc change, increases
> > MySQL sysbench performance by a factor of 2 on my quad core
> > test system.
> > 
> 
> In file included from include/linux/mman.h:4,
>                  from arch/sparc64/kernel/sys_sparc.c:19:
> include/asm/mman.h:36:1: "MADV_FREE" redefined
> In file included from include/asm/mman.h:5,
>                  from include/linux/mman.h:4,
>                  from arch/sparc64/kernel/sys_sparc.c:19:
> include/asm-generic/mman.h:32:1: this is the location of the previous definition
> 
> sparc32 and sparc64 already defined MADV_FREE:
> 
> 
> #define MADV_FREE       0x5             /* (Solaris) contents can be freed */
> 
> I'll remove the sparc definitions for now, but we need to work out what
> we're going to do here.  Your patch changes the values of MADV_FREE on
> sparc.
> 
> Perhaps this should be renamed to MADV_FREE_LINUX and given a different
> number.  It depends on how close your proposed behaviour is to Solaris's.

Why isn't MADV_FREE defined to 5 for linux?  It's our first free madv
value?  Also the behaviour should better match the one in solaris or BSD,
the last thing we need is slightly different behaviour from operating
systems supporting this for ages.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-22  9:16   ` Christoph Hellwig
@ 2007-04-22 16:55     ` Ulrich Drepper
  0 siblings, 0 replies; 43+ messages in thread
From: Ulrich Drepper @ 2007-04-22 16:55 UTC (permalink / raw)
  To: Christoph Hellwig, Andrew Morton, Rik van Riel, linux-kernel,
	linux-mm, David S. Miller

On 4/22/07, Christoph Hellwig <hch@infradead.org> wrote:
> Why isn't MADV_FREE defined to 5 for linux?  It's our first free madv
> value?  Also the behaviour should better match the one in solaris or BSD,
> the last thing we need is slightly different behaviour from operating
> systems supporting this for ages.

The behavior should indeed be identical.  Both implementations
restrict MADV_FREE to work on anonymous memory and it is unspecified
whether a renewed access yields to a zerod page being created or
whether the old content is still there.  So, just use 0x5 for both the
Linux and Solaris version on sparc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-22  6:31           ` Rik van Riel
@ 2007-04-23  0:16             ` Nick Piggin
  2007-04-23  3:53               ` Rik van Riel
  0 siblings, 1 reply; 43+ messages in thread
From: Nick Piggin @ 2007-04-23  0:16 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, linux-kernel, linux-mm, shak

Rik van Riel wrote:
> Nick Piggin wrote:
> 
>> Rik van Riel wrote:

>>> Here are the transactions/seconds for each combination:
>>>
>>>    vanilla   new glibc  madv_free kernel   madv_free + mmap_sem
>>> threads
>>>
>>> 1     610         609             596                545
>>> 2    1032        1136            1196               1200
>>> 4    1070        1128            2014               2024
>>> 8    1000        1088            1665               2087
>>> 16    779        1073            1310               1999
>>
>>
>>
>> Is "new glibc" meaning MADV_DONTNEED + kernel with mmap_sem patch?
> 
> 
> No, that's just the glibc change, with a vanilla kernel.

OK. That would be interesting to see with the mmap_sem change,
because that should increase scalability.


> The third column is glibc change + mmap_sem patch.
> 
> The fourth column has your patch in it, too.
> 
>> The strange thing with your madv_free kernel is that it doesn't
>> help single-threaded performance at all. So that work to avoid
>> zeroing the new page is not a win at all there (maybe due to the
>> cache effects I was worried about?).
> 
> 
> Well, your patch causes the performance to drop from
> 596 transactions/second to 545.  Your patch is the only
> difference between the third and the fourth column.

Yeah. That's funny, because it means either there is some
contention on the mmap_sem (or ptl) at 1 thread, or that my
patch alters the uncontended performance.


>> However MADV_FREE does improve scalability, which is interesting.
>> The most likely reason I can see why that may be the case is that
>> it avoids mmap_sem when faulting pages back in (I doubt it is due
>> to avoiding the page allocator, but maybe?).
>>
>> So where is the down_write coming from in this workload, I wonder?
>> Heap management? What syscalls?
> 
> 
> I wonder if the increased parallelism simply caused
> more cache line bouncing, with bounces happening in
> some inner loop instead of an outer loop.
> 
> Btw, it is quite possible that the MySQL sysbench
> thing gives different results on your system.  It
> would be good to know what it does on a real SMP
> system, vs. a single quad-core chip :)
> 
> Other architectures would be interesting to know,
> too.

I don't see why parallelism should come into it at 1 thread, unless
MySQL is parallelising individual transactions. Anyway, I'll try to do
some more digging.

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-23  0:16             ` Nick Piggin
@ 2007-04-23  3:53               ` Rik van Riel
  2007-04-23  3:58                 ` Nick Piggin
  2007-04-23  3:59                 ` Rik van Riel
  0 siblings, 2 replies; 43+ messages in thread
From: Rik van Riel @ 2007-04-23  3:53 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, linux-mm, shak, jakub, drepper

Nick Piggin wrote:
> Rik van Riel wrote:
>> Nick Piggin wrote:
>>
>>> Rik van Riel wrote:
> 
>>>> Here are the transactions/seconds for each combination:

I've added a 5th column, with just your mmap_sem patch and
without my madv_free patch.  It is run with the glibc patch,
which should make it fall back to MADV_DONTNEED after the
first MADV_FREE call fails.

>>>>    vanilla   new glibc  madv_free kernel   madv_free + mmap_sem  mmap_sem
>>>> threads
>>>>
>>>> 1     610         609             596                545         534
>>>> 2    1032        1136            1196               1200        1180
>>>> 4    1070        1128            2014               2024        2027
>>>> 8    1000        1088            1665               2087        2089
>>>> 16    779        1073            1310               1999        2012

Not doing the mprotect calls is the big one I guess, especially
the fact that we don't need to take the mmap_sem for writing.

With both our patches, single and two thread performance with
MySQL sysbench is somewhat better than with just your patch,
4 and 8 thread performance are basically the same and just
your patch gives a slight benefit with 16 threads.

I guess I should benchmark up to 64 or 128 threads tomorrow,
to see if this is just luck or if the cache benefit of doing
the page faults and reusing hot pages is faster than not
having page faults at all.

I should run some benchmarks on other systems, too.  Some of
these results could be an artifact of my quad core CPU.  The
results could be very different on other systems...

> Yeah. That's funny, because it means either there is some
> contention on the mmap_sem (or ptl) at 1 thread, or that my
> patch alters the uncontended performance.

Maybe MySQL has various different threads to do
different tasks.  Something to look into...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-23  3:53               ` Rik van Riel
@ 2007-04-23  3:58                 ` Nick Piggin
  2007-04-23 10:07                   ` Nick Piggin
  2007-04-23  3:59                 ` Rik van Riel
  1 sibling, 1 reply; 43+ messages in thread
From: Nick Piggin @ 2007-04-23  3:58 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, linux-kernel, linux-mm, shak, jakub, drepper

Rik van Riel wrote:

> I've added a 5th column, with just your mmap_sem patch and
> without my madv_free patch.  It is run with the glibc patch,
> which should make it fall back to MADV_DONTNEED after the
> first MADV_FREE call fails.

Thanks! (I edited slightly so it doesn't wrap)


>   vanilla   new glibc   madv_free    mmap_sem        both
> threads
>
> 1     610         609         596         534         545
> 2    1032        1136        1196        1180        1200
> 4    1070        1128        2014        2027        2024
> 8    1000        1088        1665        2089        2087
> 16    779        1073        1310        2012        1999
> 
> 
> Not doing the mprotect calls is the big one I guess, especially
> the fact that we don't need to take the mmap_sem for writing.

Yes.


> With both our patches, single and two thread performance with
> MySQL sysbench is somewhat better than with just your patch,
> 4 and 8 thread performance are basically the same and just
> your patch gives a slight benefit with 16 threads.
> 
> I guess I should benchmark up to 64 or 128 threads tomorrow,
> to see if this is just luck or if the cache benefit of doing
> the page faults and reusing hot pages is faster than not
> having page faults at all.
> 
> I should run some benchmarks on other systems, too.  Some of
> these results could be an artifact of my quad core CPU.  The
> results could be very different on other systems...

I'm getting the 16 core box out of retirement as we speak :)

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-23  3:53               ` Rik van Riel
  2007-04-23  3:58                 ` Nick Piggin
@ 2007-04-23  3:59                 ` Rik van Riel
  2007-04-23  9:20                   ` Rik van Riel
  1 sibling, 1 reply; 43+ messages in thread
From: Rik van Riel @ 2007-04-23  3:59 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Nick Piggin, Andrew Morton, linux-kernel, linux-mm, shak, jakub, drepper

Rik van Riel wrote:
> Nick Piggin wrote:
>> Rik van Riel wrote:
>>> Nick Piggin wrote:
>>>
>>>> Rik van Riel wrote:
>>
>>>>> Here are the transactions/seconds for each combination:
> 
> I've added a 5th column, with just your mmap_sem patch and
> without my madv_free patch.  It is run with the glibc patch,
> which should make it fall back to MADV_DONTNEED after the
> first MADV_FREE call fails.
> 
>>>>>    vanilla   new glibc  madv_free kernel   madv_free + mmap_sem  
>>>>> mmap_sem
>>>>> threads
>>>>>
>>>>> 1     610         609             596                545         534
>>>>> 2    1032        1136            1196               1200        1180
>>>>> 4    1070        1128            2014               2024        2027
>>>>> 8    1000        1088            1665               2087        2089
>>>>> 16    779        1073            1310               1999        2012

Now that I think about it - this is all with the rawhide kernel
configuration, which has an ungodly number of debug config
options enabled.

I should try this with a more normal kernel, on various different
systems.

It would also be helpful if other people tried this same benchmark,
and others, on their systems.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-22  2:36         ` Nick Piggin
  2007-04-22  2:50           ` Nick Piggin
  2007-04-22  6:31           ` Rik van Riel
@ 2007-04-23  4:28           ` Rik van Riel
  2 siblings, 0 replies; 43+ messages in thread
From: Rik van Riel @ 2007-04-23  4:28 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, linux-mm, shak

Nick Piggin wrote:

> So where is the down_write coming from in this workload, I wonder?
> Heap management? What syscalls?

Trying to answer this question, I straced the mysql threads that
showed up in top when running a single threaded sysbench workload.

There were no mmap, munmap, brk, mprotect or madvise system calls
in the trace.

MySQL has me puzzled, but it seems to have some other people
interested too.

I think I'll go play a bit with ebizzy now, to see how other
workloads are affected by our kernel changes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-21  7:12         ` Jakub Jelinek
@ 2007-04-23  4:36           ` Nick Piggin
  0 siblings, 0 replies; 43+ messages in thread
From: Nick Piggin @ 2007-04-23  4:36 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Rik van Riel, Andrew Morton, linux-kernel, linux-mm, shak

Jakub Jelinek wrote:
> On Fri, Apr 20, 2007 at 07:52:44PM -0400, Rik van Riel wrote:
> 
>>It turns out that Nick's patch does not improve peak
>>performance much, but it does prevent the decline when
>>running with 16 threads on my quad core CPU!
>>
>>We _definately_ want both patches, there's a huge benefit
>>in having them both.
>>
>>Here are the transactions/seconds for each combination:
>>
>>   vanilla   new glibc  madv_free kernel   madv_free + mmap_sem
>>threads
>>
>>1     610         609             596                545
>>2    1032        1136            1196               1200
>>4    1070        1128            2014               2024
>>8    1000        1088            1665               2087
>>16    779        1073            1310               1999
> 
> 
> FYI, I have uploaded a testing glibc that uses MADV_FREE and falls back
> to MADV_DONTUSE if MADV_FREE is not available, to
> http://people.redhat.com/jakub/glibc/2.5.90-21.1/

Hmm, I wonder how glibc malloc stacks up to tcmalloc on this test
(after the mmap_sem patch as well).

I'll try running that as well!

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-23  3:59                 ` Rik van Riel
@ 2007-04-23  9:20                   ` Rik van Riel
  2007-04-23 10:21                     ` Nick Piggin
  2007-04-23 11:45                     ` Rik van Riel
  0 siblings, 2 replies; 43+ messages in thread
From: Rik van Riel @ 2007-04-23  9:20 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Nick Piggin, Andrew Morton, linux-kernel, linux-mm, shak, jakub, drepper

[-- Attachment #1: Type: text/plain, Size: 1961 bytes --]

Use TLB batching for MADV_FREE.  Adds another 10-15% extra performance
to the MySQL sysbench results on my quad core system.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
Rik van Riel wrote:

>> I've added a 5th column, with just your mmap_sem patch and
>> without my madv_free patch.  It is run with the glibc patch,
>> which should make it fall back to MADV_DONTNEED after the
>> first MADV_FREE call fails.

With the attached patch to make MADV_FREE use tlb batching, not
only do we gain an additional 10-15% performance but Nick's
mmap_sem patch also shows the performance increase that we
expected to see.

It looks like the tlb flushes (and IPIs) from zap_pte_range()
could have been the problem.  They're gone now.

The second column from the right has Nick's patch and my own
two patches.  Performance with 16 threads is almost triple what
it used to be...

vanilla   glibc  glibc      glibc        glibc      glibc      glibc
                  madv_free  madv_free               madv_free 
madv_free
                             mmap_sem     mmap_sem   mmap_sem
                                                     tlb batch  tlb_batch
threads

  1     610     609     596         545         534     547     537
  2    1032    1136    1196        1200        1180    1293    1194
  4    1070    1128    2014        2024        2027    2248    2040
  8    1000    1088    1665        2087        2089    2314    1869
  16    779    1073    1310        1999        2012    2214    1557


> Now that I think about it - this is all with the rawhide kernel
> configuration, which has an ungodly number of debug config
> options enabled.
> 
> I should try this with a more normal kernel, on various different
> systems.

This is for another day. :)

First some ebizzy runs...

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

[-- Attachment #2: linux-2.6-madv_free-lazytlb.patch --]
[-- Type: text/x-patch, Size: 690 bytes --]

--- linux-2.6.20.x86_64/mm/memory.c.orig	2007-04-23 02:48:36.000000000 -0400
+++ linux-2.6.20.x86_64/mm/memory.c	2007-04-23 02:54:42.000000000 -0400
@@ -677,11 +677,15 @@ static unsigned long zap_pte_range(struc
 						remove_exclusive_swap_page(page);
 						unlock_page(page);
 					}
-					ptep_clear_flush_dirty(vma, addr, pte);
-					ptep_clear_flush_young(vma, addr, pte);
 					SetPageLazyFree(page);
 					if (PageActive(page))
 						deactivate_tail_page(page);
+					ptent = *pte;
+					set_pte_at(mm, addr, pte,
+						pte_mkclean(pte_mkold(ptent)));
+					/* tlb_remove_page frees it again */
+					get_page(page);
+					tlb_remove_page(tlb, page);
 					continue;
 				}
 			}

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-23  3:58                 ` Nick Piggin
@ 2007-04-23 10:07                   ` Nick Piggin
  2007-04-23 10:12                     ` Rik van Riel
  0 siblings, 1 reply; 43+ messages in thread
From: Nick Piggin @ 2007-04-23 10:07 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Rik van Riel, Andrew Morton, linux-kernel, linux-mm, shak, jakub,
	drepper

Nick Piggin wrote:
> Rik van Riel wrote:
> 
>> I've added a 5th column, with just your mmap_sem patch and
>> without my madv_free patch.  It is run with the glibc patch,
>> which should make it fall back to MADV_DONTNEED after the
>> first MADV_FREE call fails.
> 
> 
> Thanks! (I edited slightly so it doesn't wrap)
> 
> 
>>   vanilla   new glibc   madv_free    mmap_sem        both
>> threads
>>
>> 1     610         609         596         534         545
>> 2    1032        1136        1196        1180        1200
>> 4    1070        1128        2014        2027        2024
>> 8    1000        1088        1665        2089        2087
>> 16    779        1073        1310        2012        1999
>>
>>
>> Not doing the mprotect calls is the big one I guess, especially
>> the fact that we don't need to take the mmap_sem for writing.
> 
> 
> Yes.
> 
> 
>> With both our patches, single and two thread performance with
>> MySQL sysbench is somewhat better than with just your patch,
>> 4 and 8 thread performance are basically the same and just
>> your patch gives a slight benefit with 16 threads.
>>
>> I guess I should benchmark up to 64 or 128 threads tomorrow,
>> to see if this is just luck or if the cache benefit of doing
>> the page faults and reusing hot pages is faster than not
>> having page faults at all.
>>
>> I should run some benchmarks on other systems, too.  Some of
>> these results could be an artifact of my quad core CPU.  The
>> results could be very different on other systems...
> 
> 
> I'm getting the 16 core box out of retirement as we speak :)
> 

OK, 10 runs at 1 client, 2.6.21-rc6, MySQL version 5.33, and new
Jakub's glibc gives a 99.9% confidence of:

vanilla:  467.2 +/- 7.9 (tps)
mmap_sem: 470.5 +/- 9.3 (tps)

However, it seems those means jump around a bit from boot to boot,
so there could be some some memory placement luck for cache and/or
NUMA goodness involved.

So I think it is safe to say that the mmap_sem patch doesn't hurt
single threaded performance (from looking at the numbers and the
patch). And that's the most important thing for that patch.

I'll post some scalability results tomorrow. From my first round
of tests, after new glibc and the mmap_sem patch, it doesn't seem
like rwsem improvements, private futexes, or avoiding zero_page
make any significant differences.

I haven't tested your MADV_FREE patch yet.

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-23 10:07                   ` Nick Piggin
@ 2007-04-23 10:12                     ` Rik van Riel
  0 siblings, 0 replies; 43+ messages in thread
From: Rik van Riel @ 2007-04-23 10:12 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, linux-mm, shak, jakub, drepper

Nick Piggin wrote:

> I haven't tested your MADV_FREE patch yet.

Good.  It turned out that one behaved a bit strange without tlb batching 
anyway.

I'm now running ebizzy across the whole set of kernels I tested before,
and will post the results in a bit.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-23  9:20                   ` Rik van Riel
@ 2007-04-23 10:21                     ` Nick Piggin
  2007-04-23 10:31                       ` Rik van Riel
  2007-04-23 10:44                       ` Jakub Jelinek
  2007-04-23 11:45                     ` Rik van Riel
  1 sibling, 2 replies; 43+ messages in thread
From: Nick Piggin @ 2007-04-23 10:21 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, linux-kernel, linux-mm, shak, jakub, drepper

Rik van Riel wrote:
> Use TLB batching for MADV_FREE.  Adds another 10-15% extra performance
> to the MySQL sysbench results on my quad core system.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
> Rik van Riel wrote:
> 
>>> I've added a 5th column, with just your mmap_sem patch and
>>> without my madv_free patch.  It is run with the glibc patch,
>>> which should make it fall back to MADV_DONTNEED after the
>>> first MADV_FREE call fails.
> 
> 
> With the attached patch to make MADV_FREE use tlb batching, not
> only do we gain an additional 10-15% performance but Nick's
> mmap_sem patch also shows the performance increase that we
> expected to see.
> 
> It looks like the tlb flushes (and IPIs) from zap_pte_range()
> could have been the problem.  They're gone now.

I guess it is a good idea to batch these things. But can you
do that on all architectures? What happens if your tlb flush
happens after another thread already accesses it again, or
after it subsequently gets removed from the address space via
another CPU?

> 
> The second column from the right has Nick's patch and my own
> two patches.  Performance with 16 threads is almost triple what
> it used to be...
> 
> vanilla   glibc  glibc      glibc        glibc      glibc      glibc
>                  madv_free  madv_free               madv_free madv_free
>                             mmap_sem     mmap_sem   mmap_sem
>                                                     tlb batch  tlb_batch
> threads
> 
>  1     610     609     596         545         534     547     537
>  2    1032    1136    1196        1200        1180    1293    1194
>  4    1070    1128    2014        2024        2027    2248    2040
>  8    1000    1088    1665        2087        2089    2314    1869
>  16    779    1073    1310        1999        2012    2214    1557
> 
> 
>> Now that I think about it - this is all with the rawhide kernel
>> configuration, which has an ungodly number of debug config
>> options enabled.
>>
>> I should try this with a more normal kernel, on various different
>> systems.
> 
> 
> This is for another day. :)
> 
> First some ebizzy runs...
> 
> 
> ------------------------------------------------------------------------
> 
> --- linux-2.6.20.x86_64/mm/memory.c.orig	2007-04-23 02:48:36.000000000 -0400
> +++ linux-2.6.20.x86_64/mm/memory.c	2007-04-23 02:54:42.000000000 -0400
> @@ -677,11 +677,15 @@ static unsigned long zap_pte_range(struc
>  						remove_exclusive_swap_page(page);
>  						unlock_page(page);
>  					}
> -					ptep_clear_flush_dirty(vma, addr, pte);
> -					ptep_clear_flush_young(vma, addr, pte);
>  					SetPageLazyFree(page);
>  					if (PageActive(page))
>  						deactivate_tail_page(page);
> +					ptent = *pte;
> +					set_pte_at(mm, addr, pte,
> +						pte_mkclean(pte_mkold(ptent)));
> +					/* tlb_remove_page frees it again */
> +					get_page(page);
> +					tlb_remove_page(tlb, page);
>  					continue;
>  				}
>  			}


-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-23 10:21                     ` Nick Piggin
@ 2007-04-23 10:31                       ` Rik van Riel
  2007-04-23 10:35                         ` Nick Piggin
  2007-04-23 10:44                       ` Jakub Jelinek
  1 sibling, 1 reply; 43+ messages in thread
From: Rik van Riel @ 2007-04-23 10:31 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, linux-mm, shak, jakub, drepper

Nick Piggin wrote:

>> It looks like the tlb flushes (and IPIs) from zap_pte_range()
>> could have been the problem.  They're gone now.
> 
> I guess it is a good idea to batch these things. But can you
> do that on all architectures? What happens if your tlb flush
> happens after another thread already accesses it again, or
> after it subsequently gets removed from the address space via
> another CPU?

I have thought about this a lot tonight, and have come to the conclusion
that they are ok.

The reason is simple:

1) we do the TLB flush before we return from the
    madvise(MADV_FREE) syscall.

2) anything that accessess the pages between the start
    and end of the MADV_FREE procedure does not know in
    which order we go through the pages, so it could hit
    a page either before or after we get to processing
    it

3) because of this, we can treat any such accesses as
    happening simultaneously with the MADV_FREE and
    as illegal, aka undefined behaviour territory and
    we do not need to worry about them

4) because we flush the tlb before releasing the page
    table lock, other CPUs cannot remove this page from
    the address space - they will block on the page
    table lock before looking at this pte

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-23 10:31                       ` Rik van Riel
@ 2007-04-23 10:35                         ` Nick Piggin
  2007-04-23 10:44                           ` Rik van Riel
  2007-04-24  2:53                           ` Rik van Riel
  0 siblings, 2 replies; 43+ messages in thread
From: Nick Piggin @ 2007-04-23 10:35 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, linux-kernel, linux-mm, shak, jakub, drepper

Rik van Riel wrote:
> Nick Piggin wrote:
> 
>>> It looks like the tlb flushes (and IPIs) from zap_pte_range()
>>> could have been the problem.  They're gone now.
>>
>>
>> I guess it is a good idea to batch these things. But can you
>> do that on all architectures? What happens if your tlb flush
>> happens after another thread already accesses it again, or
>> after it subsequently gets removed from the address space via
>> another CPU?
> 
> 
> I have thought about this a lot tonight, and have come to the conclusion
> that they are ok.
> 
> The reason is simple:
> 
> 1) we do the TLB flush before we return from the
>    madvise(MADV_FREE) syscall.
> 
> 2) anything that accessess the pages between the start
>    and end of the MADV_FREE procedure does not know in
>    which order we go through the pages, so it could hit
>    a page either before or after we get to processing
>    it
> 
> 3) because of this, we can treat any such accesses as
>    happening simultaneously with the MADV_FREE and
>    as illegal, aka undefined behaviour territory and
>    we do not need to worry about them

Yes, but I'm wondering if it is legal in all architectures.

> 
> 4) because we flush the tlb before releasing the page
>    table lock, other CPUs cannot remove this page from
>    the address space - they will block on the page
>    table lock before looking at this pte

We don't when the ptl is split.

What the tlb flush used to be able to assume is that the page
has been removed from the pagetables when they are put in the
tlb flush batch.

I'm not saying there is any bugs, but just suggesting there
might be.

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-23 10:21                     ` Nick Piggin
  2007-04-23 10:31                       ` Rik van Riel
@ 2007-04-23 10:44                       ` Jakub Jelinek
  1 sibling, 0 replies; 43+ messages in thread
From: Jakub Jelinek @ 2007-04-23 10:44 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Rik van Riel, Andrew Morton, linux-kernel, linux-mm, shak, drepper

On Mon, Apr 23, 2007 at 08:21:37PM +1000, Nick Piggin wrote:
> I guess it is a good idea to batch these things. But can you
> do that on all architectures? What happens if your tlb flush
> happens after another thread already accesses it again, or
> after it subsequently gets removed from the address space via
> another CPU?

Accessing the page by another thread before madvise (MADV_FREE)
returns is undefined behavior, it can act as if that access happened
right before the madvise (MADV_FREE) call or right after it.
That's ok for glibc and supposedly any other malloc implementation,
madvise (MADV_FREE) is called while holding containing's arena lock
and for whatever malloc implementaton, madvise (MADV_FREE) would be
part of free operations and you definitely need some synchronization
between one thread freeing some memory and other thread deciding
to reuse that memory and return it from malloc/realloc/calloc/etc.

My only concern is whether using non-atomic update of the pte is
ok or not.
ptep_test_and_clear_young/ptep_test_and_clear_dirty Rik's patch
was doing before are done using atomic instructions, at least on x86_64.
The operation we want for MADV_FREE is, clear young/dirty bits if they
have been set on entry to the MADV_FREE madvise call, undefined values
for these 2 bits if some other task modifies the young/dirty bits
concurrently with this MADV_FREE zap_page_range, but I'd say other
bits need to be unmodified.
Now, is there some kernel code which while either not holding corresponding
mmap_sem at all or holding it just down_read modifies other bits
in the pte?  If yes, we need to do this clearing atomically, basically
do a cmpxchg loop until we succeed to clear the 2 bits and then flush
the tlb if any of them was set before (ptep_test_and_clear_dirty_and_young?),
if not, set_pte_at is ok and faster than a lock prefixed insn.

	Jakub

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-23 10:35                         ` Nick Piggin
@ 2007-04-23 10:44                           ` Rik van Riel
  2007-04-24  1:15                             ` Nick Piggin
  2007-04-24  2:53                           ` Rik van Riel
  1 sibling, 1 reply; 43+ messages in thread
From: Rik van Riel @ 2007-04-23 10:44 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, linux-mm, shak, jakub, drepper

[-- Attachment #1: Type: text/plain, Size: 1847 bytes --]

Use TLB batching for MADV_FREE.  Adds another 10-15% extra performance
to the MySQL sysbench results on my quad core system.

Signed-off-by: Rik van Riel <riel@redhat.com>
---

Nick Piggin wrote:

>> 3) because of this, we can treat any such accesses as
>>    happening simultaneously with the MADV_FREE and
>>    as illegal, aka undefined behaviour territory and
>>    we do not need to worry about them
> 
> Yes, but I'm wondering if it is legal in all architectures.

It's similar to trying to access memory during an munmap.

You may be able to for a short time, but it'll come back to
haunt you.

>> 4) because we flush the tlb before releasing the page
>>    table lock, other CPUs cannot remove this page from
>>    the address space - they will block on the page
>>    table lock before looking at this pte
> 
> We don't when the ptl is split.

Even then we do.  Each invocation of zap_pte_range() only touches
one page table page, and it flushes the TLB before releasing the
page table lock.

> What the tlb flush used to be able to assume is that the page
> has been removed from the pagetables when they are put in the
> tlb flush batch.

All the tlb flush code seems to assume is that the tlb entries
should be invalidated.

> I'm not saying there is any bugs, but just suggesting there
> might be.

Jakub found a potential bug, in that I did not use an atomic
operation to clear the page table entries.  I've attached a
new patch which simply uses ptep_test_and_clear_dirty/young
to get rid of the dirty and accessed bits.

It uses the same atomic accesses we use elsewhere in the VM
and the code is a line shorter than before.

Andrew, please use this one.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

[-- Attachment #2: linux-2.6-madv_free-lazytlb.patch --]
[-- Type: text/x-patch, Size: 697 bytes --]

--- linux-2.6.20.x86_64/mm/memory.c.orig	2007-04-23 02:48:36.000000000 -0400
+++ linux-2.6.20.x86_64/mm/memory.c	2007-04-23 02:54:42.000000000 -0400
@@ -677,11 +677,14 @@ static unsigned long zap_pte_range(struc
 						remove_exclusive_swap_page(page);
 						unlock_page(page);
 					}
-					ptep_clear_flush_dirty(vma, addr, pte);
-					ptep_clear_flush_young(vma, addr, pte);
+					ptep_test_and_clear_dirty(vma, addr, pte);
+					ptep_test_and_clear_young(vma, addr, pte);
 					SetPageLazyFree(page);
 					if (PageActive(page))
 						deactivate_tail_page(page);
+					/* tlb_remove_page frees it again */
+					get_page(page);
+					tlb_remove_page(tlb, page);
 					continue;
 				}
 			}

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-23  9:20                   ` Rik van Riel
  2007-04-23 10:21                     ` Nick Piggin
@ 2007-04-23 11:45                     ` Rik van Riel
  1 sibling, 0 replies; 43+ messages in thread
From: Rik van Riel @ 2007-04-23 11:45 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, linux-mm, shak, jakub, drepper

Rik van Riel wrote:

> First some ebizzy runs...

This is interesting.  Ginormous speedups in ebizzy[1] on my quad core
test system.  The following numbers are the average of 10 runs, since
ebizzy shows some variability.

You can see a big influence from the tlb batching and from Nick's
madv_sem patch.  The reduction in system time from 100 seconds to
3 seconds is way more than I had expected, but I'm not complaining.
The 4 fold reduction in wall clock time is a nice bonus.

According to Val, ebizzy shows the weaknesses of Linux with a real
workload, so this could be a useful result.

kernel
                    user     system     wall clock    %CPU

vanilla             186s    101s       123s          230%
madv_free (madv)    175s     96s       120s          230%
mmap_sem (sem)      100s     40s        40s          370%
madv+sem            200s    140s       100s          393%
madv+sem+tlb        118s      3s        30s          395%
madv+tlb            150s     10s        50s          310%

[1] http://www.ussg.iu.edu/hypermail/linux/kernel/0604.2/1699.html
-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-23 10:44                           ` Rik van Riel
@ 2007-04-24  1:15                             ` Nick Piggin
  2007-04-24  1:58                               ` Rik van Riel
  0 siblings, 1 reply; 43+ messages in thread
From: Nick Piggin @ 2007-04-24  1:15 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, linux-kernel, linux-mm, shak, jakub, drepper

Rik van Riel wrote:
> Use TLB batching for MADV_FREE.  Adds another 10-15% extra performance
> to the MySQL sysbench results on my quad core system.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
> 
> Nick Piggin wrote:
> 
>>> 3) because of this, we can treat any such accesses as
>>>    happening simultaneously with the MADV_FREE and
>>>    as illegal, aka undefined behaviour territory and
>>>    we do not need to worry about them
>>
>>
>> Yes, but I'm wondering if it is legal in all architectures.
> 
> 
> It's similar to trying to access memory during an munmap.
> 
> You may be able to for a short time, but it'll come back to
> haunt you.

The question is whether the architecture specific tlb
flushing code will break or not.


>>> 4) because we flush the tlb before releasing the page
>>>    table lock, other CPUs cannot remove this page from
>>>    the address space - they will block on the page
>>>    table lock before looking at this pte
>>
>>
>> We don't when the ptl is split.
> 
> 
> Even then we do.  Each invocation of zap_pte_range() only touches
> one page table page, and it flushes the TLB before releasing the
> page table lock.

What kernel are you looking at? -rc7 and rc6-mm1 don't, AFAIKS.

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-24  1:15                             ` Nick Piggin
@ 2007-04-24  1:58                               ` Rik van Riel
  2007-04-24  2:16                                 ` Nick Piggin
  2007-04-24  4:42                                 ` Paul Mackerras
  0 siblings, 2 replies; 43+ messages in thread
From: Rik van Riel @ 2007-04-24  1:58 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, linux-mm, shak, jakub, drepper

[-- Attachment #1: Type: text/plain, Size: 1458 bytes --]

This should fix the MADV_FREE code for PPC's hashed tlb.

Signed-off-by: Rik van Riel <riel@redhat.com>
---

Nick Piggin wrote:
>> Nick Piggin wrote:
>>
>>>> 3) because of this, we can treat any such accesses as
>>>>    happening simultaneously with the MADV_FREE and
>>>>    as illegal, aka undefined behaviour territory and
>>>>    we do not need to worry about them
>>>
>>>
>>> Yes, but I'm wondering if it is legal in all architectures.
>>
>>
>> It's similar to trying to access memory during an munmap.
>>
>> You may be able to for a short time, but it'll come back to
>> haunt you.
> 
> The question is whether the architecture specific tlb
> flushing code will break or not.

I guess we'll need to call tlb_remove_tlb_entry() inside the
MADV_FREE code to keep powerpc happy.

Thanks for pointing this one out.

>> Even then we do.  Each invocation of zap_pte_range() only touches
>> one page table page, and it flushes the TLB before releasing the
>> page table lock.
> 
> What kernel are you looking at? -rc7 and rc6-mm1 don't, AFAIKS.

Oh dear.  I see it now...

The tlb end things inside zap_pte_range() are actually
noops and the actual tlb flush only happens inside
zap_page_range().

I guess the fact that munmap gets the mmap_sem for
writing should save us, though...

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

[-- Attachment #2: linux-2.6-madv-ppcfix.patch --]
[-- Type: text/x-patch, Size: 453 bytes --]

--- linux-2.6.20.x86_64/mm/memory.c.noppc	2007-04-23 21:50:09.000000000 -0400
+++ linux-2.6.20.x86_64/mm/memory.c	2007-04-23 21:48:59.000000000 -0400
@@ -679,6 +679,7 @@ static unsigned long zap_pte_range(struc
 					}
 					ptep_test_and_clear_dirty(vma, addr, pte);
 					ptep_test_and_clear_young(vma, addr, pte);
+					tlb_remove_tlb_entry(tlb, pte, addr);
 					SetPageLazyFree(page);
 					if (PageActive(page))
 						deactivate_tail_page(page);

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-24  1:58                               ` Rik van Riel
@ 2007-04-24  2:16                                 ` Nick Piggin
  2007-04-24  4:42                                 ` Paul Mackerras
  1 sibling, 0 replies; 43+ messages in thread
From: Nick Piggin @ 2007-04-24  2:16 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, linux-kernel, linux-mm, shak, jakub, drepper

Rik van Riel wrote:
> This should fix the MADV_FREE code for PPC's hashed tlb.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
> 
> Nick Piggin wrote:
> 
>>> Nick Piggin wrote:
>>>
>>>>> 3) because of this, we can treat any such accesses as
>>>>>    happening simultaneously with the MADV_FREE and
>>>>>    as illegal, aka undefined behaviour territory and
>>>>>    we do not need to worry about them
>>>>
>>>>
>>>>
>>>> Yes, but I'm wondering if it is legal in all architectures.
>>>
>>>
>>>
>>> It's similar to trying to access memory during an munmap.
>>>
>>> You may be able to for a short time, but it'll come back to
>>> haunt you.
>>
>>
>> The question is whether the architecture specific tlb
>> flushing code will break or not.
> 
> 
> I guess we'll need to call tlb_remove_tlb_entry() inside the
> MADV_FREE code to keep powerpc happy.
> 
> Thanks for pointing this one out.
> 
>>> Even then we do.  Each invocation of zap_pte_range() only touches
>>> one page table page, and it flushes the TLB before releasing the
>>> page table lock.
>>
>>
>> What kernel are you looking at? -rc7 and rc6-mm1 don't, AFAIKS.
> 
> 
> Oh dear.  I see it now...
> 
> The tlb end things inside zap_pte_range() are actually
> noops and the actual tlb flush only happens inside
> zap_page_range().
> 
> I guess the fact that munmap gets the mmap_sem for
> writing should save us, though...

What about an unmap_mapping_range, or another MADV_FREE or
MADV_DONTNEED?

> 
> 
> ------------------------------------------------------------------------
> 
> --- linux-2.6.20.x86_64/mm/memory.c.noppc	2007-04-23 21:50:09.000000000 -0400
> +++ linux-2.6.20.x86_64/mm/memory.c	2007-04-23 21:48:59.000000000 -0400
> @@ -679,6 +679,7 @@ static unsigned long zap_pte_range(struc
>  					}
>  					ptep_test_and_clear_dirty(vma, addr, pte);
>  					ptep_test_and_clear_young(vma, addr, pte);
> +					tlb_remove_tlb_entry(tlb, pte, addr);
>  					SetPageLazyFree(page);
>  					if (PageActive(page))
>  						deactivate_tail_page(page);


-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-23 10:35                         ` Nick Piggin
  2007-04-23 10:44                           ` Rik van Riel
@ 2007-04-24  2:53                           ` Rik van Riel
  2007-04-24  3:08                             ` Andrew Morton
  1 sibling, 1 reply; 43+ messages in thread
From: Rik van Riel @ 2007-04-24  2:53 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, linux-mm, shak, jakub, drepper

[-- Attachment #1: Type: text/plain, Size: 838 bytes --]

Nick Piggin wrote:

> What the tlb flush used to be able to assume is that the page
> has been removed from the pagetables when they are put in the
> tlb flush batch.

I think this is still the case, to a degree.  There should be
no harm in removing the TLB entries after the page table has
been unlocked, right?

Or is something like the attached really needed?

 From what I can see, the page table lock should be enough
synchronization between unmap_mapping_range, MADV_FREE and
MADV_DONTNEED.

I don't see why we need the attached, but in case you find
a good reason, here's my signed-off-by line for Andrew :)

Signed-off-by: Rik van Riel <riel@redhat.com>

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

[-- Attachment #2: linux-2.6-madv_free-flushme.patch --]
[-- Type: text/x-patch, Size: 750 bytes --]

--- linux-2.6.20.x86_64/mm/memory.c.flushme	2007-04-23 22:26:06.000000000 -0400
+++ linux-2.6.20.x86_64/mm/memory.c	2007-04-23 22:42:06.000000000 -0400
@@ -628,6 +628,7 @@ static unsigned long zap_pte_range(struc
 				long *zap_work, struct zap_details *details)
 {
 	struct mm_struct *mm = tlb->mm;
+	unsigned long start_addr = addr;
 	pte_t *pte;
 	spinlock_t *ptl;
 	int file_rss = 0;
@@ -726,6 +727,11 @@ static unsigned long zap_pte_range(struc
 
 	add_mm_rss(mm, file_rss, anon_rss);
 	arch_leave_lazy_mmu_mode();
+	if (details && details->madv_free) {
+		/* Protect against MADV_DONTNEED or unmap_mapping_range */
+		tlb_finish_mmu(tlb, start_addr, addr);
+		tlb = tlb_gather_mmu(mm, 0);
+	}
 	pte_unmap_unlock(pte - 1, ptl);
 
 	return addr;

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-24  2:53                           ` Rik van Riel
@ 2007-04-24  3:08                             ` Andrew Morton
  0 siblings, 0 replies; 43+ messages in thread
From: Andrew Morton @ 2007-04-24  3:08 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Nick Piggin, linux-kernel, linux-mm, shak, jakub, drepper

On Mon, 23 Apr 2007 22:53:49 -0400 Rik van Riel <riel@redhat.com> wrote:

> I don't see why we need the attached, but in case you find
> a good reason, here's my signed-off-by line for Andrew :)

Andew is in a defensive crouch trying to work his way through all the bugs
he's been sent.  After I've managed to release 2.6.21-rc7-mm1 (say, December)
I expect I'll drop the MADV_FREE stuff, give you a run at creating a new
patch series.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-24  1:58                               ` Rik van Riel
  2007-04-24  2:16                                 ` Nick Piggin
@ 2007-04-24  4:42                                 ` Paul Mackerras
  2007-04-24  5:13                                   ` Rik van Riel
  1 sibling, 1 reply; 43+ messages in thread
From: Paul Mackerras @ 2007-04-24  4:42 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Nick Piggin, Andrew Morton, linux-kernel, linux-mm, shak, jakub, drepper

Rik van Riel writes:

> I guess we'll need to call tlb_remove_tlb_entry() inside the
> MADV_FREE code to keep powerpc happy.

I don't see why; once ptep_test_and_clear_young has returned, the
entry in the hash table has already been removed.  Adding the
tlb_remove_tlb_entry call certainly won't do anything on 64-bit
powerpc, since it expands to do {} while (0) there, and in fact it
won't do anything on 32-bit powerpc either.

Paul.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] lazy freeing of memory through MADV_FREE
  2007-04-24  4:42                                 ` Paul Mackerras
@ 2007-04-24  5:13                                   ` Rik van Riel
  0 siblings, 0 replies; 43+ messages in thread
From: Rik van Riel @ 2007-04-24  5:13 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Nick Piggin, Andrew Morton, linux-kernel, linux-mm, shak, jakub, drepper

Paul Mackerras wrote:
> Rik van Riel writes:
> 
>> I guess we'll need to call tlb_remove_tlb_entry() inside the
>> MADV_FREE code to keep powerpc happy.
> 
> I don't see why; once ptep_test_and_clear_young has returned, the
> entry in the hash table has already been removed. 

OK, so this one won't be necessary. Good to know that.

Andrew, it looks like things won't be that bad :)

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2007-04-24  5:13 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-04-17  7:15 [PATCH] lazy freeing of memory through MADV_FREE Rik van Riel
2007-04-19 21:15 ` [PATCH] lazy freeing of memory through MADV_FREE 2/2 Rik van Riel
2007-04-20 21:03   ` Andrew Morton
2007-04-20 21:24     ` Ulrich Drepper
2007-04-21  7:37       ` Hugh Dickins
2007-04-21 16:32         ` Ulrich Drepper
2007-04-20 20:57 ` [PATCH] lazy freeing of memory through MADV_FREE Andrew Morton
2007-04-20 21:38   ` Rik van Riel
2007-04-20 22:06     ` Andrew Morton
2007-04-20 23:52       ` Rik van Riel
2007-04-21  0:48         ` Eric Dumazet
2007-04-21  3:58           ` Rik van Riel
2007-04-21  7:12         ` Jakub Jelinek
2007-04-23  4:36           ` Nick Piggin
2007-04-22  2:36         ` Nick Piggin
2007-04-22  2:50           ` Nick Piggin
2007-04-22  6:31           ` Rik van Riel
2007-04-23  0:16             ` Nick Piggin
2007-04-23  3:53               ` Rik van Riel
2007-04-23  3:58                 ` Nick Piggin
2007-04-23 10:07                   ` Nick Piggin
2007-04-23 10:12                     ` Rik van Riel
2007-04-23  3:59                 ` Rik van Riel
2007-04-23  9:20                   ` Rik van Riel
2007-04-23 10:21                     ` Nick Piggin
2007-04-23 10:31                       ` Rik van Riel
2007-04-23 10:35                         ` Nick Piggin
2007-04-23 10:44                           ` Rik van Riel
2007-04-24  1:15                             ` Nick Piggin
2007-04-24  1:58                               ` Rik van Riel
2007-04-24  2:16                                 ` Nick Piggin
2007-04-24  4:42                                 ` Paul Mackerras
2007-04-24  5:13                                   ` Rik van Riel
2007-04-24  2:53                           ` Rik van Riel
2007-04-24  3:08                             ` Andrew Morton
2007-04-23 10:44                       ` Jakub Jelinek
2007-04-23 11:45                     ` Rik van Riel
2007-04-23  4:28           ` Rik van Riel
2007-04-21  7:24     ` Hugh Dickins
2007-04-21 18:06       ` Rik van Riel
2007-04-22  8:18 ` Andrew Morton
2007-04-22  9:16   ` Christoph Hellwig
2007-04-22 16:55     ` Ulrich Drepper

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox