* [PATCH] lazy freeing of memory through MADV_FREE
@ 2007-04-17 7:15 Rik van Riel
2007-04-19 21:15 ` [PATCH] lazy freeing of memory through MADV_FREE 2/2 Rik van Riel
` (2 more replies)
0 siblings, 3 replies; 43+ messages in thread
From: Rik van Riel @ 2007-04-17 7:15 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, linux-mm
[-- Attachment #1: Type: text/plain, Size: 1024 bytes --]
Make it possible for applications to have the kernel free memory
lazily. This reduces a repeated free/malloc cycle from freeing
pages and allocating them, to just marking them freeable. If the
application wants to reuse them before the kernel needs the memory,
not even a page fault will happen.
This patch, together with Ulrich's glibc change, increases
MySQL sysbench performance by a factor of 2 on my quad core
test system.
Signed-off-by: Rik van Riel <riel@redhat.com>
---
Ulrich Drepper has test glibc RPMS for this functionality at:
http://people.redhat.com/drepper/rpms
Andrew, I have stress tested this patch for a few days now and
have not been able to find any more bugs. I believe it is ready
to be merged in -mm, and upstream at the next merge window.
When the patch goes upstream, I will submit a small follow-up
patch to revert MADV_DONTNEED behaviour to what it did previously
and have the new behaviour trigger only on MADV_FREE: at that
point people will have to get new test RPMs of glibc.
[-- Attachment #2: linux-2.6.21-rc6-mm1-madv_free.patch --]
[-- Type: text/x-patch, Size: 11514 bytes --]
--- linux-2.6.21-rc6-mm1/include/asm-parisc/mman.h.madv_free 2007-04-17 02:17:19.000000000 -0400
+++ linux-2.6.21-rc6-mm1/include/asm-parisc/mman.h 2007-04-17 02:22:46.000000000 -0400
@@ -38,6 +38,7 @@
#define MADV_SPACEAVAIL 5 /* insure that resources are reserved */
#define MADV_VPS_PURGE 6 /* Purge pages from VM page cache */
#define MADV_VPS_INHERIT 7 /* Inherit parents page size */
+#define MADV_FREE 8 /* don't need the pages or the data */
/* common/generic parameters */
#define MADV_REMOVE 9 /* remove these pages & resources */
--- linux-2.6.21-rc6-mm1/include/asm-mips/mman.h.madv_free 2007-04-17 02:17:19.000000000 -0400
+++ linux-2.6.21-rc6-mm1/include/asm-mips/mman.h 2007-04-17 02:22:46.000000000 -0400
@@ -65,6 +65,7 @@
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
+#define MADV_FREE 5 /* don't need the pages or the data */
/* common parameters: try to keep these consistent across architectures */
#define MADV_REMOVE 9 /* remove these pages & resources */
--- linux-2.6.21-rc6-mm1/include/asm-xtensa/mman.h.madv_free 2007-04-17 02:17:19.000000000 -0400
+++ linux-2.6.21-rc6-mm1/include/asm-xtensa/mman.h 2007-04-17 02:22:46.000000000 -0400
@@ -72,6 +72,7 @@
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
+#define MADV_FREE 5 /* don't need the pages or the data */
/* common parameters: try to keep these consistent across architectures */
#define MADV_REMOVE 9 /* remove these pages & resources */
--- linux-2.6.21-rc6-mm1/include/linux/swap.h.madv_free 2007-04-17 02:17:43.000000000 -0400
+++ linux-2.6.21-rc6-mm1/include/linux/swap.h 2007-04-17 02:22:46.000000000 -0400
@@ -182,6 +182,7 @@ extern void FASTCALL(lru_cache_add(struc
extern void FASTCALL(lru_cache_add_active(struct page *));
extern void FASTCALL(lru_cache_add_tail(struct page *));
extern void FASTCALL(activate_page(struct page *));
+extern void FASTCALL(deactivate_tail_page(struct page *));
extern void FASTCALL(mark_page_accessed(struct page *));
extern void lru_add_drain(void);
extern int lru_add_drain_all(void);
--- linux-2.6.21-rc6-mm1/include/linux/mm.h.madv_free 2007-04-17 02:17:43.000000000 -0400
+++ linux-2.6.21-rc6-mm1/include/linux/mm.h 2007-04-17 02:22:46.000000000 -0400
@@ -767,6 +767,7 @@ struct zap_details {
pgoff_t last_index; /* Highest page->index to unmap */
spinlock_t *i_mmap_lock; /* For unmap_mapping_range: */
unsigned long truncate_count; /* Compare vm_truncate_count */
+ short madv_free; /* MADV_FREE anonymous memory */
};
struct page *vm_normal_page(struct vm_area_struct *, unsigned long, pte_t);
--- linux-2.6.21-rc6-mm1/include/linux/page-flags.h.madv_free 2007-04-17 02:17:43.000000000 -0400
+++ linux-2.6.21-rc6-mm1/include/linux/page-flags.h 2007-04-17 02:23:16.000000000 -0400
@@ -91,6 +91,7 @@
#define PG_booked 20 /* Has blocks reserved on-disk */
#define PG_readahead 21 /* Reminder to do read-ahead */
+#define PG_lazyfree 22 /* MADV_FREE potential throwaway */
/* PG_owner_priv_1 users should have descriptive aliases */
#define PG_checked PG_owner_priv_1 /* Used by some filesystems */
@@ -216,6 +217,11 @@ static inline void SetPageUptodate(struc
#define ClearPageReclaim(page) clear_bit(PG_reclaim, &(page)->flags)
#define TestClearPageReclaim(page) test_and_clear_bit(PG_reclaim, &(page)->flags)
+#define PageLazyFree(page) test_bit(PG_lazyfree, &(page)->flags)
+#define SetPageLazyFree(page) set_bit(PG_lazyfree, &(page)->flags)
+#define ClearPageLazyFree(page) clear_bit(PG_lazyfree, &(page)->flags)
+#define __ClearPageLazyFree(page) __clear_bit(PG_lazyfree, &(page)->flags)
+
#define PageCompound(page) test_bit(PG_compound, &(page)->flags)
#define __SetPageCompound(page) __set_bit(PG_compound, &(page)->flags)
#define __ClearPageCompound(page) __clear_bit(PG_compound, &(page)->flags)
--- linux-2.6.21-rc6-mm1/include/asm-alpha/mman.h.madv_free 2007-04-17 02:17:19.000000000 -0400
+++ linux-2.6.21-rc6-mm1/include/asm-alpha/mman.h 2007-04-17 02:22:46.000000000 -0400
@@ -42,6 +42,7 @@
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_SPACEAVAIL 5 /* ensure resources are available */
#define MADV_DONTNEED 6 /* don't need these pages */
+#define MADV_FREE 7 /* don't need the pages or the data */
/* common/generic parameters */
#define MADV_REMOVE 9 /* remove these pages & resources */
--- linux-2.6.21-rc6-mm1/include/asm-generic/mman.h.madv_free 2007-04-17 02:17:19.000000000 -0400
+++ linux-2.6.21-rc6-mm1/include/asm-generic/mman.h 2007-04-17 02:22:46.000000000 -0400
@@ -29,6 +29,7 @@
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
+#define MADV_FREE 5 /* don't need the pages or the data */
/* common parameters: try to keep these consistent across architectures */
#define MADV_REMOVE 9 /* remove these pages & resources */
--- linux-2.6.21-rc6-mm1/mm/memory.c.madv_free 2007-04-17 02:17:43.000000000 -0400
+++ linux-2.6.21-rc6-mm1/mm/memory.c 2007-04-17 02:22:46.000000000 -0400
@@ -432,6 +432,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
unsigned long vm_flags = vma->vm_flags;
pte_t pte = *src_pte;
struct page *page;
+ int dirty = 0;
/* pte contains position in swap or file, so copy. */
if (unlikely(!pte_present(pte))) {
@@ -466,6 +467,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
* in the parent and the child
*/
if (is_cow_mapping(vm_flags)) {
+ dirty = pte_dirty(pte);
ptep_set_wrprotect(src_mm, addr, src_pte);
pte = pte_wrprotect(pte);
}
@@ -483,6 +485,8 @@ copy_one_pte(struct mm_struct *dst_mm, s
get_page(page);
page_dup_rmap(page, vma, addr);
rss[!!PageAnon(page)]++;
+ if (dirty && PageLazyFree(page))
+ ClearPageLazyFree(page);
}
out_set_pte:
@@ -661,6 +665,25 @@ static unsigned long zap_pte_range(struc
(page->index < details->first_index ||
page->index > details->last_index))
continue;
+
+ /*
+ * MADV_FREE is used to lazily recycle
+ * anon memory. The process no longer
+ * needs the data and wants to avoid IO.
+ */
+ if (details->madv_free && PageAnon(page)) {
+ if (unlikely(PageSwapCache(page)) &&
+ !TestSetPageLocked(page)) {
+ remove_exclusive_swap_page(page);
+ unlock_page(page);
+ }
+ ptep_clear_flush_dirty(vma, addr, pte);
+ ptep_clear_flush_young(vma, addr, pte);
+ SetPageLazyFree(page);
+ if (PageActive(page))
+ deactivate_tail_page(page);
+ continue;
+ }
}
ptent = ptep_get_and_clear_full(mm, addr, pte,
tlb->fullmm);
@@ -689,7 +713,8 @@ static unsigned long zap_pte_range(struc
* If details->check_mapping, we leave swap entries;
* if details->nonlinear_vma, we leave file entries.
*/
- if (unlikely(details))
+ if (unlikely(details && (details->check_mapping ||
+ details->nonlinear_vma)))
continue;
if (!pte_file(ptent))
free_swap_and_cache(pte_to_swp_entry(ptent));
@@ -755,7 +780,8 @@ static unsigned long unmap_page_range(st
pgd_t *pgd;
unsigned long next;
- if (details && !details->check_mapping && !details->nonlinear_vma)
+ if (details && !details->check_mapping && !details->nonlinear_vma
+ && !details->madv_free)
details = NULL;
BUG_ON(addr >= end);
--- linux-2.6.21-rc6-mm1/mm/page_alloc.c.madv_free 2007-04-17 02:17:43.000000000 -0400
+++ linux-2.6.21-rc6-mm1/mm/page_alloc.c 2007-04-17 02:22:46.000000000 -0400
@@ -266,6 +266,7 @@ static void bad_page(struct page *page)
1 << PG_slab |
1 << PG_swapcache |
1 << PG_writeback |
+ 1 << PG_lazyfree |
1 << PG_buddy );
set_page_count(page, 0);
reset_page_mapcount(page);
@@ -514,6 +515,8 @@ static inline int free_pages_check(struc
bad_page(page);
if (PageDirty(page))
__ClearPageDirty(page);
+ if (PageLazyFree(page))
+ __ClearPageLazyFree(page);
/*
* For now, we report if PG_reserved was found set, but do not
* clear it, and do not free the page. But we shall soon need
@@ -661,6 +664,7 @@ static int prep_new_page(struct page *pa
1 << PG_swapcache |
1 << PG_writeback |
1 << PG_reserved |
+ 1 << PG_lazyfree |
1 << PG_buddy ))))
bad_page(page);
--- linux-2.6.21-rc6-mm1/mm/swap.c.madv_free 2007-04-17 02:17:43.000000000 -0400
+++ linux-2.6.21-rc6-mm1/mm/swap.c 2007-04-17 02:22:46.000000000 -0400
@@ -152,6 +152,20 @@ void fastcall activate_page(struct page
spin_unlock_irq(&zone->lru_lock);
}
+void fastcall deactivate_tail_page(struct page *page)
+{
+ struct zone *zone = page_zone(page);
+
+ spin_lock_irq(&zone->lru_lock);
+ if (PageLRU(page) && PageActive(page)) {
+ del_page_from_active_list(zone, page);
+ ClearPageActive(page);
+ add_page_to_inactive_list_tail(zone, page);
+ __count_vm_event(PGDEACTIVATE);
+ }
+ spin_unlock_irq(&zone->lru_lock);
+}
+
/*
* Mark a page as having seen activity.
*
--- linux-2.6.21-rc6-mm1/mm/vmscan.c.madv_free 2007-04-17 02:17:43.000000000 -0400
+++ linux-2.6.21-rc6-mm1/mm/vmscan.c 2007-04-17 02:22:46.000000000 -0400
@@ -460,6 +460,24 @@ static unsigned long shrink_page_list(st
sc->nr_scanned++;
+ /*
+ * MADV_DONTNEED pages get reclaimed lazily, unless the
+ * process reuses it before we get to it.
+ */
+ if (PageLazyFree(page)) {
+ switch (try_to_unmap(page, 0)) {
+ case SWAP_FAIL:
+ ClearPageLazyFree(page);
+ goto activate_locked;
+ case SWAP_AGAIN:
+ ClearPageLazyFree(page);
+ goto keep_locked;
+ case SWAP_SUCCESS:
+ ClearPageLazyFree(page);
+ goto free_it;
+ }
+ }
+
if (!sc->may_swap && page_mapped(page))
goto keep_locked;
--- linux-2.6.21-rc6-mm1/mm/madvise.c.madv_free 2007-04-17 02:17:20.000000000 -0400
+++ linux-2.6.21-rc6-mm1/mm/madvise.c 2007-04-17 02:22:46.000000000 -0400
@@ -142,8 +142,12 @@ static long madvise_dontneed(struct vm_a
.last_index = ULONG_MAX,
};
zap_page_range(vma, start, end - start, &details);
- } else
- zap_page_range(vma, start, end - start, NULL);
+ } else {
+ struct zap_details details = {
+ .madv_free = 1,
+ };
+ zap_page_range(vma, start, end - start, &details);
+ }
return 0;
}
@@ -215,7 +219,9 @@ madvise_vma(struct vm_area_struct *vma,
error = madvise_willneed(vma, prev, start, end);
break;
+ /* FIXME: POSIX says that MADV_DONTNEED cannot throw away data. */
case MADV_DONTNEED:
+ case MADV_FREE:
error = madvise_dontneed(vma, prev, start, end);
break;
--- linux-2.6.21-rc6-mm1/mm/rmap.c.madv_free 2007-04-17 02:17:43.000000000 -0400
+++ linux-2.6.21-rc6-mm1/mm/rmap.c 2007-04-17 02:22:46.000000000 -0400
@@ -707,7 +707,17 @@ static int try_to_unmap_one(struct page
/* Update high watermark before we lower rss */
update_hiwater_rss(mm);
- if (PageAnon(page)) {
+ /* MADV_FREE is used to lazily free memory from userspace. */
+ if (PageLazyFree(page) && !migration) {
+ /* There is new data in the page. Reinstate it. */
+ if (unlikely(pte_dirty(pteval))) {
+ set_pte_at(mm, address, pte, pteval);
+ ret = SWAP_FAIL;
+ goto out_unmap;
+ }
+ /* Throw the page away. */
+ dec_mm_counter(mm, anon_rss);
+ } else if (PageAnon(page)) {
swp_entry_t entry = { .val = page_private(page) };
if (PageSwapCache(page)) {
^ permalink raw reply [flat|nested] 43+ messages in thread* Re: [PATCH] lazy freeing of memory through MADV_FREE 2/2 2007-04-17 7:15 [PATCH] lazy freeing of memory through MADV_FREE Rik van Riel @ 2007-04-19 21:15 ` Rik van Riel 2007-04-20 21:03 ` Andrew Morton 2007-04-20 20:57 ` [PATCH] lazy freeing of memory through MADV_FREE Andrew Morton 2007-04-22 8:18 ` Andrew Morton 2 siblings, 1 reply; 43+ messages in thread From: Rik van Riel @ 2007-04-19 21:15 UTC (permalink / raw) To: Jakub Jelinek; +Cc: Andrew Morton, linux-kernel, linux-mm [-- Attachment #1: Type: text/plain, Size: 459 bytes --] Restore MADV_DONTNEED to its original Linux behaviour. This is still not the same behaviour as POSIX, but applications may be depending on the Linux behaviour already. Besides, glibc catches POSIX_MADV_DONTNEED and makes sure nothing is done... Signed-off-by: Rik van Riel <riel@redhat.com> --- This is to be applied over of the original MADV_FREE patch. It turns out that the current glibc patch already falls back to MADV_DONTNEED if it gets an -EINVAL. [-- Attachment #2: linux-2.6-madv-dontneed-restore.patch --] [-- Type: text/x-patch, Size: 1317 bytes --] --- linux-2.6.20.x86_64/mm/madvise.c.madv_free 2007-04-19 16:46:22.000000000 -0400 +++ linux-2.6.20.x86_64/mm/madvise.c 2007-04-19 16:52:19.000000000 -0400 @@ -130,7 +130,8 @@ static long madvise_willneed(struct vm_a */ static long madvise_dontneed(struct vm_area_struct * vma, struct vm_area_struct ** prev, - unsigned long start, unsigned long end) + unsigned long start, unsigned long end, + int behavior) { *prev = vma; if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP)) @@ -142,12 +143,14 @@ static long madvise_dontneed(struct vm_a .last_index = ULONG_MAX, }; zap_page_range(vma, start, end - start, &details); - } else { + } else if (behavior == MADV_FREE) { struct zap_details details = { .madv_free = 1, }; zap_page_range(vma, start, end - start, &details); - } + } else /* behavior == MADV_DONTNEED */ + zap_page_range(vma, start, end - start, NULL); + return 0; } @@ -219,10 +222,9 @@ madvise_vma(struct vm_area_struct *vma, error = madvise_willneed(vma, prev, start, end); break; - /* FIXME: POSIX says that MADV_DONTNEED cannot throw away data. */ case MADV_DONTNEED: case MADV_FREE: - error = madvise_dontneed(vma, prev, start, end); + error = madvise_dontneed(vma, prev, start, end, behavior); break; default: ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2/2 2007-04-19 21:15 ` [PATCH] lazy freeing of memory through MADV_FREE 2/2 Rik van Riel @ 2007-04-20 21:03 ` Andrew Morton 2007-04-20 21:24 ` Ulrich Drepper 0 siblings, 1 reply; 43+ messages in thread From: Andrew Morton @ 2007-04-20 21:03 UTC (permalink / raw) To: Rik van Riel; +Cc: Jakub Jelinek, linux-kernel, linux-mm On Thu, 19 Apr 2007 17:15:28 -0400 Rik van Riel <riel@redhat.com> wrote: > Restore MADV_DONTNEED to its original Linux behaviour. This is still > not the same behaviour as POSIX, but applications may be depending on > the Linux behaviour already. Besides, glibc catches POSIX_MADV_DONTNEED > and makes sure nothing is done... OK, we need to flesh this out a lot please. People often get confused about what our MADV_DONTNEED behaviour is. I regularly forget, then look at the code, then get it wrong. That's for mainline, let alone older kernels whose behaviour is gawd-knows-what. So... For the changelog (and the manpage) could we please have a full description of the 2.6.21 behaviour and the 2.6.21-post-rik behaviour (and the 2.4 behaviour, if it differs at all)? Also some code comments to demystify all of this once and for all? Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2/2 2007-04-20 21:03 ` Andrew Morton @ 2007-04-20 21:24 ` Ulrich Drepper 2007-04-21 7:37 ` Hugh Dickins 0 siblings, 1 reply; 43+ messages in thread From: Ulrich Drepper @ 2007-04-20 21:24 UTC (permalink / raw) To: Andrew Morton; +Cc: Rik van Riel, Jakub Jelinek, linux-kernel, linux-mm On 4/20/07, Andrew Morton <akpm@linux-foundation.org> wrote: > OK, we need to flesh this out a lot please. People often get confused > about what our MADV_DONTNEED behaviour is. Well, there's not really much to flesh out. The current MADV_DONTNEED is useful in some situations. The behavior cannot be changed, even glibc will rely on it for the case when MADV_FREE is not supported. What might be nice to have is to have a POSIX-compliant POSIX_MADV_DONTNEED implementation. We currently do nothing which is OK since no test suite can detect that. But some code might want to use the real behavior and we're missing an optimization possibility. Just for reference: the MADV_CURRENT behavior is to throw away data in the range. The POSIX_MADV_DONTNEED behavior is to never lose data. I.e., file backed data is written back, anon data is at most swapped out. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2/2 2007-04-20 21:24 ` Ulrich Drepper @ 2007-04-21 7:37 ` Hugh Dickins 2007-04-21 16:32 ` Ulrich Drepper 0 siblings, 1 reply; 43+ messages in thread From: Hugh Dickins @ 2007-04-21 7:37 UTC (permalink / raw) To: Ulrich Drepper Cc: Andrew Morton, Rik van Riel, Jakub Jelinek, linux-kernel, linux-mm On Fri, 20 Apr 2007, Ulrich Drepper wrote: > > Just for reference: the MADV_CURRENT behavior is to throw away data in > the range. Not exactly. The Linux MADV_DONTNEED never throws away data from a PROT_WRITE,MAP_SHARED mapping (or shm) - it propagates the dirty bit, the page will eventually get written out to file, and can be retrieved later by subsequent access. But the Linux MADV_DONTNEED does throw away data from a PROT_WRITE,MAP_PRIVATE mapping (or brk or stack) - those changes are discarded, and a subsequent access will revert to zeroes or the underlying mapped file. Been like that since before 2.4.0. > The POSIX_MADV_DONTNEED behavior is to never lose data. > I.e., file backed data is written back, anon data is at most swapped > out. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2/2 2007-04-21 7:37 ` Hugh Dickins @ 2007-04-21 16:32 ` Ulrich Drepper 0 siblings, 0 replies; 43+ messages in thread From: Ulrich Drepper @ 2007-04-21 16:32 UTC (permalink / raw) To: Hugh Dickins Cc: Andrew Morton, Rik van Riel, Jakub Jelinek, linux-kernel, linux-mm On 4/21/07, Hugh Dickins <hugh@veritas.com> wrote: > But the Linux MADV_DONTNEED does throw away > data from a PROT_WRITE,MAP_PRIVATE mapping (or brk or stack) - those > changes are discarded, and a subsequent access will revert to zeroes > or the underlying mapped file. Been like that since before 2.4.0. I didn't say it changed. I just say that there is a hole in the current implementation as it does not allow to implement POSIX_MADV_DONTNEED with anything but a no-op. The POSIX_MADV_DONTNEED behavior is useful and something IMO should be added to allow implementing it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-17 7:15 [PATCH] lazy freeing of memory through MADV_FREE Rik van Riel 2007-04-19 21:15 ` [PATCH] lazy freeing of memory through MADV_FREE 2/2 Rik van Riel @ 2007-04-20 20:57 ` Andrew Morton 2007-04-20 21:38 ` Rik van Riel 2007-04-22 8:18 ` Andrew Morton 2 siblings, 1 reply; 43+ messages in thread From: Andrew Morton @ 2007-04-20 20:57 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-kernel, linux-mm On Tue, 17 Apr 2007 03:15:51 -0400 Rik van Riel <riel@redhat.com> wrote: > Make it possible for applications to have the kernel free memory > lazily. This reduces a repeated free/malloc cycle from freeing > pages and allocating them, to just marking them freeable. If the > application wants to reuse them before the kernel needs the memory, > not even a page fault will happen. > > This patch, together with Ulrich's glibc change, increases > MySQL sysbench performance by a factor of 2 on my quad core > test system. > > Signed-off-by: Rik van Riel <riel@redhat.com> > > --- > Ulrich Drepper has test glibc RPMS for this functionality at: > > http://people.redhat.com/drepper/rpms > > Andrew, I have stress tested this patch for a few days now and > have not been able to find any more bugs. I believe it is ready > to be merged in -mm, and upstream at the next merge window. > > When the patch goes upstream, I will submit a small follow-up > patch to revert MADV_DONTNEED behaviour to what it did previously > and have the new behaviour trigger only on MADV_FREE: at that > point people will have to get new test RPMs of glibc. > > I've also merged Nick's "mm: madvise avoid exclusive mmap_sem". - Nick's patch also will help this problem. It could be that your patch no longer offers a 2x speedup when combined with Nick's patch. It could well be that the combination of the two is even better, but it would be nice to firm that up a bit. Chewing a page flag is an expensive thing to do. I do go on about that. But we're adding page flags at about one per year, and when we run out we're screwed - we'll need to grow the pageframe. - I need to update your patch for Nick's patch. Please confirm that down_read(mmap_sem) is sufficient for MADV_FREE. Stylistic nit: > + if (PageLazyFree(page) && !migration) { > + /* There is new data in the page. Reinstate it. */ > + if (unlikely(pte_dirty(pteval))) { > + set_pte_at(mm, address, pte, pteval); > + ret = SWAP_FAIL; > + goto out_unmap; > + } The comment should be inside the second `if' statement. As it is, It looks like we reinstate the page if (PageLazyFree(page) && !migration). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-20 20:57 ` [PATCH] lazy freeing of memory through MADV_FREE Andrew Morton @ 2007-04-20 21:38 ` Rik van Riel 2007-04-20 22:06 ` Andrew Morton 2007-04-21 7:24 ` Hugh Dickins 0 siblings, 2 replies; 43+ messages in thread From: Rik van Riel @ 2007-04-20 21:38 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, linux-mm Andrew Morton wrote: > I've also merged Nick's "mm: madvise avoid exclusive mmap_sem". > > - Nick's patch also will help this problem. It could be that your patch > no longer offers a 2x speedup when combined with Nick's patch. > > It could well be that the combination of the two is even better, but it > would be nice to firm that up a bit. I'll test that. > I do go on about that. But we're adding page flags at about one per > year, and when we run out we're screwed - we'll need to grow the > pageframe. If you want, I can take a look at folding this into the ->mapping pointer. I can guarantee you it won't be pretty, though :) > - I need to update your patch for Nick's patch. Please confirm that > down_read(mmap_sem) is sufficient for MADV_FREE. It is. MADV_FREE needs no more protection than MADV_DONTNEED. > Stylistic nit: > >> + if (PageLazyFree(page) && !migration) { >> + /* There is new data in the page. Reinstate it. */ >> + if (unlikely(pte_dirty(pteval))) { >> + set_pte_at(mm, address, pte, pteval); >> + ret = SWAP_FAIL; >> + goto out_unmap; >> + } > > The comment should be inside the second `if' statement. As it is, It > looks like we reinstate the page if (PageLazyFree(page) && !migration). Want me to move it? -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-20 21:38 ` Rik van Riel @ 2007-04-20 22:06 ` Andrew Morton 2007-04-20 23:52 ` Rik van Riel 2007-04-21 7:24 ` Hugh Dickins 1 sibling, 1 reply; 43+ messages in thread From: Andrew Morton @ 2007-04-20 22:06 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-kernel, linux-mm On Fri, 20 Apr 2007 17:38:06 -0400 Rik van Riel <riel@redhat.com> wrote: > Andrew Morton wrote: > > > I've also merged Nick's "mm: madvise avoid exclusive mmap_sem". > > > > - Nick's patch also will help this problem. It could be that your patch > > no longer offers a 2x speedup when combined with Nick's patch. > > > > It could well be that the combination of the two is even better, but it > > would be nice to firm that up a bit. > > I'll test that. Thanks. > > I do go on about that. But we're adding page flags at about one per > > year, and when we run out we're screwed - we'll need to grow the > > pageframe. > > If you want, I can take a look at folding this into the > ->mapping pointer. I can guarantee you it won't be > pretty, though :) Well, let's see how fugly it ends up looking? > > - I need to update your patch for Nick's patch. Please confirm that > > down_read(mmap_sem) is sufficient for MADV_FREE. > > It is. MADV_FREE needs no more protection than MADV_DONTNEED. > > > Stylistic nit: > > > >> + if (PageLazyFree(page) && !migration) { > >> + /* There is new data in the page. Reinstate it. */ > >> + if (unlikely(pte_dirty(pteval))) { > >> + set_pte_at(mm, address, pte, pteval); > >> + ret = SWAP_FAIL; > >> + goto out_unmap; > >> + } > > > > The comment should be inside the second `if' statement. As it is, It > > looks like we reinstate the page if (PageLazyFree(page) && !migration). > > Want me to move it? I did that, thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-20 22:06 ` Andrew Morton @ 2007-04-20 23:52 ` Rik van Riel 2007-04-21 0:48 ` Eric Dumazet ` (2 more replies) 0 siblings, 3 replies; 43+ messages in thread From: Rik van Riel @ 2007-04-20 23:52 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, linux-mm, shak Andrew Morton wrote: > On Fri, 20 Apr 2007 17:38:06 -0400 > Rik van Riel <riel@redhat.com> wrote: > >> Andrew Morton wrote: >> >>> I've also merged Nick's "mm: madvise avoid exclusive mmap_sem". >>> >>> - Nick's patch also will help this problem. It could be that your patch >>> no longer offers a 2x speedup when combined with Nick's patch. >>> >>> It could well be that the combination of the two is even better, but it >>> would be nice to firm that up a bit. >> I'll test that. > > Thanks. Well, good news. It turns out that Nick's patch does not improve peak performance much, but it does prevent the decline when running with 16 threads on my quad core CPU! We _definately_ want both patches, there's a huge benefit in having them both. Here are the transactions/seconds for each combination: vanilla new glibc madv_free kernel madv_free + mmap_sem threads 1 610 609 596 545 2 1032 1136 1196 1200 4 1070 1128 2014 2024 8 1000 1088 1665 2087 16 779 1073 1310 1999 -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-20 23:52 ` Rik van Riel @ 2007-04-21 0:48 ` Eric Dumazet 2007-04-21 3:58 ` Rik van Riel 2007-04-21 7:12 ` Jakub Jelinek 2007-04-22 2:36 ` Nick Piggin 2 siblings, 1 reply; 43+ messages in thread From: Eric Dumazet @ 2007-04-21 0:48 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrew Morton, linux-kernel, linux-mm, shak Rik van Riel a A(C)crit : > Andrew Morton wrote: >> On Fri, 20 Apr 2007 17:38:06 -0400 >> Rik van Riel <riel@redhat.com> wrote: >> >>> Andrew Morton wrote: >>> >>>> I've also merged Nick's "mm: madvise avoid exclusive mmap_sem". >>>> >>>> - Nick's patch also will help this problem. It could be that your >>>> patch >>>> no longer offers a 2x speedup when combined with Nick's patch. >>>> >>>> It could well be that the combination of the two is even better, >>>> but it >>>> would be nice to firm that up a bit. >>> I'll test that. >> >> Thanks. > > Well, good news. > > It turns out that Nick's patch does not improve peak > performance much, but it does prevent the decline when > running with 16 threads on my quad core CPU! > > We _definately_ want both patches, there's a huge benefit > in having them both. > > Here are the transactions/seconds for each combination: > > vanilla new glibc madv_free kernel madv_free + mmap_sem > threads > > 1 610 609 596 545 545 tps versus 610 tps for one thread ? It seems quite bad, no ? Could you please find an explanation for this ? > 2 1032 1136 1196 1200 > 4 1070 1128 2014 2024 > 8 1000 1088 1665 2087 > 16 779 1073 1310 1999 > > Thank you -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-21 0:48 ` Eric Dumazet @ 2007-04-21 3:58 ` Rik van Riel 0 siblings, 0 replies; 43+ messages in thread From: Rik van Riel @ 2007-04-21 3:58 UTC (permalink / raw) To: Eric Dumazet; +Cc: Andrew Morton, linux-kernel, linux-mm, shak Eric Dumazet wrote: > Rik van Riel a A(C)crit : >> Andrew Morton wrote: >>> On Fri, 20 Apr 2007 17:38:06 -0400 >>> Rik van Riel <riel@redhat.com> wrote: >>> >>>> Andrew Morton wrote: >>>> >>>>> I've also merged Nick's "mm: madvise avoid exclusive mmap_sem". >>>>> >>>>> - Nick's patch also will help this problem. It could be that your >>>>> patch >>>>> no longer offers a 2x speedup when combined with Nick's patch. >>>>> >>>>> It could well be that the combination of the two is even better, >>>>> but it >>>>> would be nice to firm that up a bit. >>>> I'll test that. >>> >>> Thanks. >> >> Well, good news. >> >> It turns out that Nick's patch does not improve peak >> performance much, but it does prevent the decline when >> running with 16 threads on my quad core CPU! >> >> We _definately_ want both patches, there's a huge benefit >> in having them both. >> >> Here are the transactions/seconds for each combination: >> >> vanilla new glibc madv_free kernel madv_free + mmap_sem >> threads >> >> 1 610 609 596 545 > > 545 tps versus 610 tps for one thread ? It seems quite bad, no ? > > Could you please find an explanation for this ? I have no idea why this happens. Especially the last one, going from a write lock to a read lock on the mmap_sem should not make ANY difference whatsoever since we're running single threaded! >> 2 1032 1136 1196 1200 >> 4 1070 1128 2014 2024 >> 8 1000 1088 1665 2087 >> 16 779 1073 1310 1999 Performance with 2 database threads is way better though, and performance with 4 or more threads more than doubles... If you have an explanation on why single threaded performance went down a little on my quad core system, please let me know. Does performance suffer at all on a real UP system? -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-20 23:52 ` Rik van Riel 2007-04-21 0:48 ` Eric Dumazet @ 2007-04-21 7:12 ` Jakub Jelinek 2007-04-23 4:36 ` Nick Piggin 2007-04-22 2:36 ` Nick Piggin 2 siblings, 1 reply; 43+ messages in thread From: Jakub Jelinek @ 2007-04-21 7:12 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrew Morton, linux-kernel, linux-mm, shak On Fri, Apr 20, 2007 at 07:52:44PM -0400, Rik van Riel wrote: > It turns out that Nick's patch does not improve peak > performance much, but it does prevent the decline when > running with 16 threads on my quad core CPU! > > We _definately_ want both patches, there's a huge benefit > in having them both. > > Here are the transactions/seconds for each combination: > > vanilla new glibc madv_free kernel madv_free + mmap_sem > threads > > 1 610 609 596 545 > 2 1032 1136 1196 1200 > 4 1070 1128 2014 2024 > 8 1000 1088 1665 2087 > 16 779 1073 1310 1999 FYI, I have uploaded a testing glibc that uses MADV_FREE and falls back to MADV_DONTUSE if MADV_FREE is not available, to http://people.redhat.com/jakub/glibc/2.5.90-21.1/ and I'm also attaching the glibc patch for those who want to build it themselves: 2007-04-19 Ulrich Drepper <drepper@redhat.com> Jakub Jelinek <jakub@redhat.com> * malloc/arena.c (heap_info): Add mprotect_size field, adjust pad. (new_heap): Initialize mprotect_size. (no_madv_free): New variable. (grow_heap): When growing, only mprotect from mprotect_size till new_size if mprotect_size is smaller. When shrinking, use PROT_NONE MMAP for __libc_enable_secure only, otherwise if MADV_FREE is available use it and fall back to MADV_DONTNEED. * sysdeps/unix/sysv/linux/alpha/bits/mman.h (MADV_FREE): Define. * sysdeps/unix/sysv/linux/ia64/bits/mman.h (MADV_FREE): Likewise. * sysdeps/unix/sysv/linux/i386/bits/mman.h (MADV_FREE): Likewise. * sysdeps/unix/sysv/linux/s390/bits/mman.h (MADV_FREE): Likewise. * sysdeps/unix/sysv/linux/powerpc/bits/mman.h (MADV_FREE): Likewise. * sysdeps/unix/sysv/linux/x86_64/bits/mman.h (MADV_FREE): Likewise. * sysdeps/unix/sysv/linux/sparc/bits/mman.h (MADV_FREE): Likewise. * sysdeps/unix/sysv/linux/sh/bits/mman.h (MADV_FREE): Likewise. --- libc/malloc/arena.c.jj 2006-10-31 23:05:31.000000000 +0100 +++ libc/malloc/arena.c 2007-04-19 18:54:20.000000000 +0200 @@ -1,5 +1,6 @@ /* Malloc implementation for multiple threads without lock contention. - Copyright (C) 2001,2002,2003,2004,2005,2006 Free Software Foundation, Inc. + Copyright (C) 2001,2002,2003,2004,2005,2006,2007 + Free Software Foundation, Inc. This file is part of the GNU C Library. Contributed by Wolfram Gloger <wg@malloc.de>, 2001. @@ -59,10 +60,12 @@ typedef struct _heap_info { mstate ar_ptr; /* Arena for this heap. */ struct _heap_info *prev; /* Previous heap. */ size_t size; /* Current size in bytes. */ + size_t mprotect_size; /* Size in bytes that has been mprotected + PROT_READ|PROT_WRITE. */ /* Make sure the following data is properly aligned, particularly that sizeof (heap_info) + 2 * SIZE_SZ is a multiple of - MALLOG_ALIGNMENT. */ - char pad[-5 * SIZE_SZ & MALLOC_ALIGN_MASK]; + MALLOC_ALIGNMENT. */ + char pad[-6 * SIZE_SZ & MALLOC_ALIGN_MASK]; } heap_info; /* Get a compile-time error if the heap_info padding is not correct @@ -692,10 +695,15 @@ new_heap(size, top_pad) size_t size, top } h = (heap_info *)p2; h->size = size; + h->mprotect_size = size; THREAD_STAT(stat_n_heaps++); return h; } +#if defined _LIBC && defined MADV_FREE +static int no_madv_free; +#endif + /* Grow or shrink a heap. size is automatically rounded up to a multiple of the page size if it is positive. */ @@ -714,17 +722,49 @@ grow_heap(h, diff) heap_info *h; long di new_size = (long)h->size + diff; if((unsigned long) new_size > (unsigned long) HEAP_MAX_SIZE) return -1; - if(mprotect((char *)h + h->size, diff, PROT_READ|PROT_WRITE) != 0) - return -2; + if((unsigned long) new_size > h->mprotect_size) { + if (mprotect((char *)h + h->mprotect_size, + (unsigned long) new_size - h->mprotect_size, + PROT_READ|PROT_WRITE) != 0) + return -2; + h->mprotect_size = new_size; + } } else { new_size = (long)h->size + diff; if(new_size < (long)sizeof(*h)) return -1; /* Try to re-map the extra heap space freshly to save memory, and make it inaccessible. */ - if((char *)MMAP((char *)h + new_size, -diff, PROT_NONE, - MAP_PRIVATE|MAP_FIXED) == (char *) MAP_FAILED) - return -2; +#ifdef _LIBC + if (__builtin_expect (__libc_enable_secure, 0)) +#else + if (1) +#endif + { + if((char *)MMAP((char *)h + new_size, -diff, PROT_NONE, + MAP_PRIVATE|MAP_FIXED) == (char *) MAP_FAILED) + return -2; + h->mprotect_size = new_size; + } +#ifdef _LIBC + else + { +# ifdef MADV_FREE + if (!__builtin_expect (no_madv_free, 0)) + { + if (__builtin_expect (madvise ((char *)h + new_size, + -diff, MADV_FREE), 0) == -1 + && errno == EINVAL) + { + no_madv_free = 1; + madvise ((char *)h + new_size, -diff, MADV_DONTNEED); + } + } + else +# endif + madvise ((char *)h + new_size, -diff, MADV_DONTNEED); + } +#endif /*fprintf(stderr, "shrink %p %08lx\n", h, new_size);*/ } h->size = new_size; --- libc/sysdeps/unix/sysv/linux/alpha/bits/mman.h.jj 2006-05-02 16:33:44.000000000 +0200 +++ libc/sysdeps/unix/sysv/linux/alpha/bits/mman.h 2007-04-19 18:37:43.000000000 +0200 @@ -1,5 +1,6 @@ /* Definitions for POSIX memory map interface. Linux/Alpha version. - Copyright (C) 1997, 1998, 2000, 2003, 2006 Free Software Foundation, Inc. + Copyright (C) 1997, 1998, 2000, 2003, 2006, 2007 + Free Software Foundation, Inc. This file is part of the GNU C Library. The GNU C Library is free software; you can redistribute it and/or @@ -96,6 +97,7 @@ # define MADV_SEQUENTIAL 2 /* Expect sequential page references. */ # define MADV_WILLNEED 3 /* Will need these pages. */ # define MADV_DONTNEED 6 /* Don't need these pages. */ +# define MADV_FREE 7 /* Content can be freed. */ # define MADV_REMOVE 9 /* Remove these pages and resources. */ # define MADV_DONTFORK 10 /* Do not inherit across fork. */ # define MADV_DOFORK 11 /* Do inherit across fork. */ --- libc/sysdeps/unix/sysv/linux/ia64/bits/mman.h.jj 2006-05-02 16:33:44.000000000 +0200 +++ libc/sysdeps/unix/sysv/linux/ia64/bits/mman.h 2007-04-19 18:37:43.000000000 +0200 @@ -1,5 +1,6 @@ /* Definitions for POSIX memory map interface. Linux/ia64 version. - Copyright (C) 1997,1998,2000,2003,2005,2006 Free Software Foundation, Inc. + Copyright (C) 1997,1998,2000,2003,2005,2006,2007 + Free Software Foundation, Inc. This file is part of the GNU C Library. The GNU C Library is free software; you can redistribute it and/or @@ -89,6 +90,7 @@ # define MADV_SEQUENTIAL 2 /* Expect sequential page references. */ # define MADV_WILLNEED 3 /* Will need these pages. */ # define MADV_DONTNEED 4 /* Don't need these pages. */ +# define MADV_FREE 5 /* Content can be freed. */ # define MADV_REMOVE 9 /* Remove these pages and resources. */ # define MADV_DONTFORK 10 /* Do not inherit across fork. */ # define MADV_DOFORK 11 /* Do inherit across fork. */ --- libc/sysdeps/unix/sysv/linux/i386/bits/mman.h.jj 2006-05-02 16:33:44.000000000 +0200 +++ libc/sysdeps/unix/sysv/linux/i386/bits/mman.h 2007-04-19 18:37:43.000000000 +0200 @@ -1,5 +1,6 @@ /* Definitions for POSIX memory map interface. Linux/i386 version. - Copyright (C) 1997, 2000, 2003, 2005, 2006 Free Software Foundation, Inc. + Copyright (C) 1997, 2000, 2003, 2005, 2006, 2007 + Free Software Foundation, Inc. This file is part of the GNU C Library. The GNU C Library is free software; you can redistribute it and/or @@ -88,6 +89,7 @@ # define MADV_SEQUENTIAL 2 /* Expect sequential page references. */ # define MADV_WILLNEED 3 /* Will need these pages. */ # define MADV_DONTNEED 4 /* Don't need these pages. */ +# define MADV_FREE 5 /* Content can be freed. */ # define MADV_REMOVE 9 /* Remove these pages and resources. */ # define MADV_DONTFORK 10 /* Do not inherit across fork. */ # define MADV_DOFORK 11 /* Do inherit across fork. */ --- libc/sysdeps/unix/sysv/linux/s390/bits/mman.h.jj 2006-05-02 16:33:44.000000000 +0200 +++ libc/sysdeps/unix/sysv/linux/s390/bits/mman.h 2007-04-19 18:37:43.000000000 +0200 @@ -1,5 +1,6 @@ /* Definitions for POSIX memory map interface. Linux/s390 version. - Copyright (C) 2000,2001,2002,2003,2005,2006 Free Software Foundation, Inc. + Copyright (C) 2000,2001,2002,2003,2005,2006,2007 + Free Software Foundation, Inc. This file is part of the GNU C Library. The GNU C Library is free software; you can redistribute it and/or @@ -89,6 +90,7 @@ # define MADV_SEQUENTIAL 2 /* Expect sequential page references. */ # define MADV_WILLNEED 3 /* Will need these pages. */ # define MADV_DONTNEED 4 /* Don't need these pages. */ +# define MADV_FREE 5 /* Content can be freed. */ # define MADV_REMOVE 9 /* Remove these pages and resources. */ # define MADV_DONTFORK 10 /* Do not inherit across fork. */ # define MADV_DOFORK 11 /* Do inherit across fork. */ --- libc/sysdeps/unix/sysv/linux/powerpc/bits/mman.h.jj 2006-05-02 16:33:44.000000000 +0200 +++ libc/sysdeps/unix/sysv/linux/powerpc/bits/mman.h 2007-04-19 18:37:43.000000000 +0200 @@ -1,5 +1,6 @@ /* Definitions for POSIX memory map interface. Linux/PowerPC version. - Copyright (C) 1997, 2000, 2003, 2005, 2006 Free Software Foundation, Inc. + Copyright (C) 1997, 2000, 2003, 2005, 2006, 2007 + Free Software Foundation, Inc. This file is part of the GNU C Library. The GNU C Library is free software; you can redistribute it and/or @@ -89,6 +90,7 @@ # define MADV_SEQUENTIAL 2 /* Expect sequential page references. */ # define MADV_WILLNEED 3 /* Will need these pages. */ # define MADV_DONTNEED 4 /* Don't need these pages. */ +# define MADV_FREE 5 /* Content can be freed. */ # define MADV_REMOVE 9 /* Remove these pages and resources. */ # define MADV_DONTFORK 10 /* Do not inherit across fork. */ # define MADV_DOFORK 11 /* Do inherit across fork. */ --- libc/sysdeps/unix/sysv/linux/x86_64/bits/mman.h.jj 2006-05-02 16:33:46.000000000 +0200 +++ libc/sysdeps/unix/sysv/linux/x86_64/bits/mman.h 2007-04-19 18:37:43.000000000 +0200 @@ -1,5 +1,5 @@ /* Definitions for POSIX memory map interface. Linux/x86_64 version. - Copyright (C) 2001, 2003, 2005, 2006 Free Software Foundation, Inc. + Copyright (C) 2001, 2003, 2005, 2006, 2007 Free Software Foundation, Inc. This file is part of the GNU C Library. The GNU C Library is free software; you can redistribute it and/or @@ -89,6 +89,7 @@ # define MADV_SEQUENTIAL 2 /* Expect sequential page references. */ # define MADV_WILLNEED 3 /* Will need these pages. */ # define MADV_DONTNEED 4 /* Don't need these pages. */ +# define MADV_FREE 5 /* Content can be freed. */ # define MADV_REMOVE 9 /* Remove these pages and resources. */ # define MADV_DONTFORK 10 /* Do not inherit across fork. */ # define MADV_DOFORK 11 /* Do inherit across fork. */ --- libc/sysdeps/unix/sysv/linux/sparc/bits/mman.h.jj 2006-05-02 16:33:44.000000000 +0200 +++ libc/sysdeps/unix/sysv/linux/sparc/bits/mman.h 2007-04-19 18:37:43.000000000 +0200 @@ -1,5 +1,6 @@ /* Definitions for POSIX memory map interface. Linux/SPARC version. - Copyright (C) 1997,1999,2000,2003,2005,2006 Free Software Foundation, Inc. + Copyright (C) 1997,1999,2000,2003,2005,2006,2007 + Free Software Foundation, Inc. This file is part of the GNU C Library. The GNU C Library is free software; you can redistribute it and/or @@ -90,7 +91,7 @@ # define MADV_SEQUENTIAL 2 /* Expect sequential page references. */ # define MADV_WILLNEED 3 /* Will need these pages. */ # define MADV_DONTNEED 4 /* Don't need these pages. */ -# define MADV_FREE 5 /* Content can be freed (Solaris). */ +# define MADV_FREE 5 /* Content can be freed. */ # define MADV_REMOVE 9 /* Remove these pages and resources. */ # define MADV_DONTFORK 10 /* Do not inherit across fork. */ # define MADV_DOFORK 11 /* Do inherit across fork. */ --- libc/sysdeps/unix/sysv/linux/sh/bits/mman.h.jj 2006-05-02 16:33:44.000000000 +0200 +++ libc/sysdeps/unix/sysv/linux/sh/bits/mman.h 2007-04-19 18:37:43.000000000 +0200 @@ -1,5 +1,6 @@ /* Definitions for POSIX memory map interface. Linux/SH version. - Copyright (C) 1997,1999,2000,2003,2005,2006 Free Software Foundation, Inc. + Copyright (C) 1997,1999,2000,2003,2005,2006,2007 + Free Software Foundation, Inc. This file is part of the GNU C Library. The GNU C Library is free software; you can redistribute it and/or @@ -88,6 +89,7 @@ # define MADV_SEQUENTIAL 2 /* Expect sequential page references. */ # define MADV_WILLNEED 3 /* Will need these pages. */ # define MADV_DONTNEED 4 /* Don't need these pages. */ +# define MADV_FREE 5 /* Content can be freed. */ # define MADV_REMOVE 9 /* Remove these pages and resources. */ # define MADV_DONTFORK 10 /* Do not inherit across fork. */ # define MADV_DOFORK 11 /* Do inherit across fork. */ Jakub -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-21 7:12 ` Jakub Jelinek @ 2007-04-23 4:36 ` Nick Piggin 0 siblings, 0 replies; 43+ messages in thread From: Nick Piggin @ 2007-04-23 4:36 UTC (permalink / raw) To: Jakub Jelinek; +Cc: Rik van Riel, Andrew Morton, linux-kernel, linux-mm, shak Jakub Jelinek wrote: > On Fri, Apr 20, 2007 at 07:52:44PM -0400, Rik van Riel wrote: > >>It turns out that Nick's patch does not improve peak >>performance much, but it does prevent the decline when >>running with 16 threads on my quad core CPU! >> >>We _definately_ want both patches, there's a huge benefit >>in having them both. >> >>Here are the transactions/seconds for each combination: >> >> vanilla new glibc madv_free kernel madv_free + mmap_sem >>threads >> >>1 610 609 596 545 >>2 1032 1136 1196 1200 >>4 1070 1128 2014 2024 >>8 1000 1088 1665 2087 >>16 779 1073 1310 1999 > > > FYI, I have uploaded a testing glibc that uses MADV_FREE and falls back > to MADV_DONTUSE if MADV_FREE is not available, to > http://people.redhat.com/jakub/glibc/2.5.90-21.1/ Hmm, I wonder how glibc malloc stacks up to tcmalloc on this test (after the mmap_sem patch as well). I'll try running that as well! -- SUSE Labs, Novell Inc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-20 23:52 ` Rik van Riel 2007-04-21 0:48 ` Eric Dumazet 2007-04-21 7:12 ` Jakub Jelinek @ 2007-04-22 2:36 ` Nick Piggin 2007-04-22 2:50 ` Nick Piggin ` (2 more replies) 2 siblings, 3 replies; 43+ messages in thread From: Nick Piggin @ 2007-04-22 2:36 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrew Morton, linux-kernel, linux-mm, shak Rik van Riel wrote: > Andrew Morton wrote: > >> On Fri, 20 Apr 2007 17:38:06 -0400 >> Rik van Riel <riel@redhat.com> wrote: >> >>> Andrew Morton wrote: >>> >>>> I've also merged Nick's "mm: madvise avoid exclusive mmap_sem". >>>> >>>> - Nick's patch also will help this problem. It could be that your >>>> patch >>>> no longer offers a 2x speedup when combined with Nick's patch. >>>> >>>> It could well be that the combination of the two is even better, >>>> but it >>>> would be nice to firm that up a bit. >>> >>> I'll test that. >> >> >> Thanks. > > > Well, good news. > > It turns out that Nick's patch does not improve peak > performance much, but it does prevent the decline when > running with 16 threads on my quad core CPU! > > We _definately_ want both patches, there's a huge benefit > in having them both. > > Here are the transactions/seconds for each combination: > > vanilla new glibc madv_free kernel madv_free + mmap_sem > threads > > 1 610 609 596 545 > 2 1032 1136 1196 1200 > 4 1070 1128 2014 2024 > 8 1000 1088 1665 2087 > 16 779 1073 1310 1999 Is "new glibc" meaning MADV_DONTNEED + kernel with mmap_sem patch? The strange thing with your madv_free kernel is that it doesn't help single-threaded performance at all. So that work to avoid zeroing the new page is not a win at all there (maybe due to the cache effects I was worried about?). However MADV_FREE does improve scalability, which is interesting. The most likely reason I can see why that may be the case is that it avoids mmap_sem when faulting pages back in (I doubt it is due to avoiding the page allocator, but maybe?). So where is the down_write coming from in this workload, I wonder? Heap management? What syscalls? x86_64's rwsems are crap under heavy parallelism (even read-only), as I fixed in my recent generic rwsems patch. I don't expect MySQL to be such a mmap_sem microbenchmark, but I wonder how much this would help? What if we ran the private futexes patch to further cut down mmap_sem contention? -- SUSE Labs, Novell Inc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-22 2:36 ` Nick Piggin @ 2007-04-22 2:50 ` Nick Piggin 2007-04-22 6:31 ` Rik van Riel 2007-04-23 4:28 ` Rik van Riel 2 siblings, 0 replies; 43+ messages in thread From: Nick Piggin @ 2007-04-22 2:50 UTC (permalink / raw) To: Nick Piggin; +Cc: Rik van Riel, Andrew Morton, linux-kernel, linux-mm, shak Nick Piggin wrote: > Rik van Riel wrote: > >> Andrew Morton wrote: >> >>> On Fri, 20 Apr 2007 17:38:06 -0400 >>> Rik van Riel <riel@redhat.com> wrote: >>> >>>> Andrew Morton wrote: >>>> >>>>> I've also merged Nick's "mm: madvise avoid exclusive mmap_sem". >>>>> >>>>> - Nick's patch also will help this problem. It could be that your >>>>> patch >>>>> no longer offers a 2x speedup when combined with Nick's patch. >>>>> >>>>> It could well be that the combination of the two is even better, >>>>> but it >>>>> would be nice to firm that up a bit. >>>> >>>> >>>> I'll test that. >>> >>> >>> >>> Thanks. >> >> >> >> Well, good news. >> >> It turns out that Nick's patch does not improve peak >> performance much, but it does prevent the decline when >> running with 16 threads on my quad core CPU! >> >> We _definately_ want both patches, there's a huge benefit >> in having them both. >> >> Here are the transactions/seconds for each combination: >> >> vanilla new glibc madv_free kernel madv_free + mmap_sem >> threads >> >> 1 610 609 596 545 >> 2 1032 1136 1196 1200 >> 4 1070 1128 2014 2024 >> 8 1000 1088 1665 2087 >> 16 779 1073 1310 1999 > > > > Is "new glibc" meaning MADV_DONTNEED + kernel with mmap_sem patch? > > The strange thing with your madv_free kernel is that it doesn't > help single-threaded performance at all. So that work to avoid > zeroing the new page is not a win at all there (maybe due to the > cache effects I was worried about?). > > However MADV_FREE does improve scalability, which is interesting. > The most likely reason I can see why that may be the case is that > it avoids mmap_sem when faulting pages back in (I doubt it is due > to avoiding the page allocator, but maybe?). > > So where is the down_write coming from in this workload, I wonder? > Heap management? What syscalls? > > x86_64's rwsems are crap under heavy parallelism (even read-only), > as I fixed in my recent generic rwsems patch. I don't expect MySQL > to be such a mmap_sem microbenchmark, but I wonder how much this > would help? > > What if we ran the private futexes patch to further cut down > mmap_sem contention? Hmm, without the MADV_FREE patch, I wonder if it isn't doing something silly like read-faulting in a ZERO_PAGE then write faulting a new page straight afterwards.. I'll have to try a few tests. -- SUSE Labs, Novell Inc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-22 2:36 ` Nick Piggin 2007-04-22 2:50 ` Nick Piggin @ 2007-04-22 6:31 ` Rik van Riel 2007-04-23 0:16 ` Nick Piggin 2007-04-23 4:28 ` Rik van Riel 2 siblings, 1 reply; 43+ messages in thread From: Rik van Riel @ 2007-04-22 6:31 UTC (permalink / raw) To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, linux-mm, shak Nick Piggin wrote: > Rik van Riel wrote: >> Andrew Morton wrote: >> >>> On Fri, 20 Apr 2007 17:38:06 -0400 >>> Rik van Riel <riel@redhat.com> wrote: >>> >>>> Andrew Morton wrote: >>>> >>>>> I've also merged Nick's "mm: madvise avoid exclusive mmap_sem". >>>>> >>>>> - Nick's patch also will help this problem. It could be that your >>>>> patch >>>>> no longer offers a 2x speedup when combined with Nick's patch. >>>>> >>>>> It could well be that the combination of the two is even better, >>>>> but it >>>>> would be nice to firm that up a bit. >>>> >>>> I'll test that. >>> >>> >>> Thanks. >> >> >> Well, good news. >> >> It turns out that Nick's patch does not improve peak >> performance much, but it does prevent the decline when >> running with 16 threads on my quad core CPU! >> >> We _definately_ want both patches, there's a huge benefit >> in having them both. >> >> Here are the transactions/seconds for each combination: >> >> vanilla new glibc madv_free kernel madv_free + mmap_sem >> threads >> >> 1 610 609 596 545 >> 2 1032 1136 1196 1200 >> 4 1070 1128 2014 2024 >> 8 1000 1088 1665 2087 >> 16 779 1073 1310 1999 > > > Is "new glibc" meaning MADV_DONTNEED + kernel with mmap_sem patch? No, that's just the glibc change, with a vanilla kernel. The third column is glibc change + mmap_sem patch. The fourth column has your patch in it, too. > The strange thing with your madv_free kernel is that it doesn't > help single-threaded performance at all. So that work to avoid > zeroing the new page is not a win at all there (maybe due to the > cache effects I was worried about?). Well, your patch causes the performance to drop from 596 transactions/second to 545. Your patch is the only difference between the third and the fourth column. > However MADV_FREE does improve scalability, which is interesting. > The most likely reason I can see why that may be the case is that > it avoids mmap_sem when faulting pages back in (I doubt it is due > to avoiding the page allocator, but maybe?). > > So where is the down_write coming from in this workload, I wonder? > Heap management? What syscalls? I wonder if the increased parallelism simply caused more cache line bouncing, with bounces happening in some inner loop instead of an outer loop. Btw, it is quite possible that the MySQL sysbench thing gives different results on your system. It would be good to know what it does on a real SMP system, vs. a single quad-core chip :) Other architectures would be interesting to know, too. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-22 6:31 ` Rik van Riel @ 2007-04-23 0:16 ` Nick Piggin 2007-04-23 3:53 ` Rik van Riel 0 siblings, 1 reply; 43+ messages in thread From: Nick Piggin @ 2007-04-23 0:16 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrew Morton, linux-kernel, linux-mm, shak Rik van Riel wrote: > Nick Piggin wrote: > >> Rik van Riel wrote: >>> Here are the transactions/seconds for each combination: >>> >>> vanilla new glibc madv_free kernel madv_free + mmap_sem >>> threads >>> >>> 1 610 609 596 545 >>> 2 1032 1136 1196 1200 >>> 4 1070 1128 2014 2024 >>> 8 1000 1088 1665 2087 >>> 16 779 1073 1310 1999 >> >> >> >> Is "new glibc" meaning MADV_DONTNEED + kernel with mmap_sem patch? > > > No, that's just the glibc change, with a vanilla kernel. OK. That would be interesting to see with the mmap_sem change, because that should increase scalability. > The third column is glibc change + mmap_sem patch. > > The fourth column has your patch in it, too. > >> The strange thing with your madv_free kernel is that it doesn't >> help single-threaded performance at all. So that work to avoid >> zeroing the new page is not a win at all there (maybe due to the >> cache effects I was worried about?). > > > Well, your patch causes the performance to drop from > 596 transactions/second to 545. Your patch is the only > difference between the third and the fourth column. Yeah. That's funny, because it means either there is some contention on the mmap_sem (or ptl) at 1 thread, or that my patch alters the uncontended performance. >> However MADV_FREE does improve scalability, which is interesting. >> The most likely reason I can see why that may be the case is that >> it avoids mmap_sem when faulting pages back in (I doubt it is due >> to avoiding the page allocator, but maybe?). >> >> So where is the down_write coming from in this workload, I wonder? >> Heap management? What syscalls? > > > I wonder if the increased parallelism simply caused > more cache line bouncing, with bounces happening in > some inner loop instead of an outer loop. > > Btw, it is quite possible that the MySQL sysbench > thing gives different results on your system. It > would be good to know what it does on a real SMP > system, vs. a single quad-core chip :) > > Other architectures would be interesting to know, > too. I don't see why parallelism should come into it at 1 thread, unless MySQL is parallelising individual transactions. Anyway, I'll try to do some more digging. -- SUSE Labs, Novell Inc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-23 0:16 ` Nick Piggin @ 2007-04-23 3:53 ` Rik van Riel 2007-04-23 3:58 ` Nick Piggin 2007-04-23 3:59 ` Rik van Riel 0 siblings, 2 replies; 43+ messages in thread From: Rik van Riel @ 2007-04-23 3:53 UTC (permalink / raw) To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, linux-mm, shak, jakub, drepper Nick Piggin wrote: > Rik van Riel wrote: >> Nick Piggin wrote: >> >>> Rik van Riel wrote: > >>>> Here are the transactions/seconds for each combination: I've added a 5th column, with just your mmap_sem patch and without my madv_free patch. It is run with the glibc patch, which should make it fall back to MADV_DONTNEED after the first MADV_FREE call fails. >>>> vanilla new glibc madv_free kernel madv_free + mmap_sem mmap_sem >>>> threads >>>> >>>> 1 610 609 596 545 534 >>>> 2 1032 1136 1196 1200 1180 >>>> 4 1070 1128 2014 2024 2027 >>>> 8 1000 1088 1665 2087 2089 >>>> 16 779 1073 1310 1999 2012 Not doing the mprotect calls is the big one I guess, especially the fact that we don't need to take the mmap_sem for writing. With both our patches, single and two thread performance with MySQL sysbench is somewhat better than with just your patch, 4 and 8 thread performance are basically the same and just your patch gives a slight benefit with 16 threads. I guess I should benchmark up to 64 or 128 threads tomorrow, to see if this is just luck or if the cache benefit of doing the page faults and reusing hot pages is faster than not having page faults at all. I should run some benchmarks on other systems, too. Some of these results could be an artifact of my quad core CPU. The results could be very different on other systems... > Yeah. That's funny, because it means either there is some > contention on the mmap_sem (or ptl) at 1 thread, or that my > patch alters the uncontended performance. Maybe MySQL has various different threads to do different tasks. Something to look into... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-23 3:53 ` Rik van Riel @ 2007-04-23 3:58 ` Nick Piggin 2007-04-23 10:07 ` Nick Piggin 2007-04-23 3:59 ` Rik van Riel 1 sibling, 1 reply; 43+ messages in thread From: Nick Piggin @ 2007-04-23 3:58 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrew Morton, linux-kernel, linux-mm, shak, jakub, drepper Rik van Riel wrote: > I've added a 5th column, with just your mmap_sem patch and > without my madv_free patch. It is run with the glibc patch, > which should make it fall back to MADV_DONTNEED after the > first MADV_FREE call fails. Thanks! (I edited slightly so it doesn't wrap) > vanilla new glibc madv_free mmap_sem both > threads > > 1 610 609 596 534 545 > 2 1032 1136 1196 1180 1200 > 4 1070 1128 2014 2027 2024 > 8 1000 1088 1665 2089 2087 > 16 779 1073 1310 2012 1999 > > > Not doing the mprotect calls is the big one I guess, especially > the fact that we don't need to take the mmap_sem for writing. Yes. > With both our patches, single and two thread performance with > MySQL sysbench is somewhat better than with just your patch, > 4 and 8 thread performance are basically the same and just > your patch gives a slight benefit with 16 threads. > > I guess I should benchmark up to 64 or 128 threads tomorrow, > to see if this is just luck or if the cache benefit of doing > the page faults and reusing hot pages is faster than not > having page faults at all. > > I should run some benchmarks on other systems, too. Some of > these results could be an artifact of my quad core CPU. The > results could be very different on other systems... I'm getting the 16 core box out of retirement as we speak :) -- SUSE Labs, Novell Inc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-23 3:58 ` Nick Piggin @ 2007-04-23 10:07 ` Nick Piggin 2007-04-23 10:12 ` Rik van Riel 0 siblings, 1 reply; 43+ messages in thread From: Nick Piggin @ 2007-04-23 10:07 UTC (permalink / raw) To: Nick Piggin Cc: Rik van Riel, Andrew Morton, linux-kernel, linux-mm, shak, jakub, drepper Nick Piggin wrote: > Rik van Riel wrote: > >> I've added a 5th column, with just your mmap_sem patch and >> without my madv_free patch. It is run with the glibc patch, >> which should make it fall back to MADV_DONTNEED after the >> first MADV_FREE call fails. > > > Thanks! (I edited slightly so it doesn't wrap) > > >> vanilla new glibc madv_free mmap_sem both >> threads >> >> 1 610 609 596 534 545 >> 2 1032 1136 1196 1180 1200 >> 4 1070 1128 2014 2027 2024 >> 8 1000 1088 1665 2089 2087 >> 16 779 1073 1310 2012 1999 >> >> >> Not doing the mprotect calls is the big one I guess, especially >> the fact that we don't need to take the mmap_sem for writing. > > > Yes. > > >> With both our patches, single and two thread performance with >> MySQL sysbench is somewhat better than with just your patch, >> 4 and 8 thread performance are basically the same and just >> your patch gives a slight benefit with 16 threads. >> >> I guess I should benchmark up to 64 or 128 threads tomorrow, >> to see if this is just luck or if the cache benefit of doing >> the page faults and reusing hot pages is faster than not >> having page faults at all. >> >> I should run some benchmarks on other systems, too. Some of >> these results could be an artifact of my quad core CPU. The >> results could be very different on other systems... > > > I'm getting the 16 core box out of retirement as we speak :) > OK, 10 runs at 1 client, 2.6.21-rc6, MySQL version 5.33, and new Jakub's glibc gives a 99.9% confidence of: vanilla: 467.2 +/- 7.9 (tps) mmap_sem: 470.5 +/- 9.3 (tps) However, it seems those means jump around a bit from boot to boot, so there could be some some memory placement luck for cache and/or NUMA goodness involved. So I think it is safe to say that the mmap_sem patch doesn't hurt single threaded performance (from looking at the numbers and the patch). And that's the most important thing for that patch. I'll post some scalability results tomorrow. From my first round of tests, after new glibc and the mmap_sem patch, it doesn't seem like rwsem improvements, private futexes, or avoiding zero_page make any significant differences. I haven't tested your MADV_FREE patch yet. -- SUSE Labs, Novell Inc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-23 10:07 ` Nick Piggin @ 2007-04-23 10:12 ` Rik van Riel 0 siblings, 0 replies; 43+ messages in thread From: Rik van Riel @ 2007-04-23 10:12 UTC (permalink / raw) To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, linux-mm, shak, jakub, drepper Nick Piggin wrote: > I haven't tested your MADV_FREE patch yet. Good. It turned out that one behaved a bit strange without tlb batching anyway. I'm now running ebizzy across the whole set of kernels I tested before, and will post the results in a bit. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-23 3:53 ` Rik van Riel 2007-04-23 3:58 ` Nick Piggin @ 2007-04-23 3:59 ` Rik van Riel 2007-04-23 9:20 ` Rik van Riel 1 sibling, 1 reply; 43+ messages in thread From: Rik van Riel @ 2007-04-23 3:59 UTC (permalink / raw) To: Rik van Riel Cc: Nick Piggin, Andrew Morton, linux-kernel, linux-mm, shak, jakub, drepper Rik van Riel wrote: > Nick Piggin wrote: >> Rik van Riel wrote: >>> Nick Piggin wrote: >>> >>>> Rik van Riel wrote: >> >>>>> Here are the transactions/seconds for each combination: > > I've added a 5th column, with just your mmap_sem patch and > without my madv_free patch. It is run with the glibc patch, > which should make it fall back to MADV_DONTNEED after the > first MADV_FREE call fails. > >>>>> vanilla new glibc madv_free kernel madv_free + mmap_sem >>>>> mmap_sem >>>>> threads >>>>> >>>>> 1 610 609 596 545 534 >>>>> 2 1032 1136 1196 1200 1180 >>>>> 4 1070 1128 2014 2024 2027 >>>>> 8 1000 1088 1665 2087 2089 >>>>> 16 779 1073 1310 1999 2012 Now that I think about it - this is all with the rawhide kernel configuration, which has an ungodly number of debug config options enabled. I should try this with a more normal kernel, on various different systems. It would also be helpful if other people tried this same benchmark, and others, on their systems. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-23 3:59 ` Rik van Riel @ 2007-04-23 9:20 ` Rik van Riel 2007-04-23 10:21 ` Nick Piggin 2007-04-23 11:45 ` Rik van Riel 0 siblings, 2 replies; 43+ messages in thread From: Rik van Riel @ 2007-04-23 9:20 UTC (permalink / raw) To: Rik van Riel Cc: Nick Piggin, Andrew Morton, linux-kernel, linux-mm, shak, jakub, drepper [-- Attachment #1: Type: text/plain, Size: 1961 bytes --] Use TLB batching for MADV_FREE. Adds another 10-15% extra performance to the MySQL sysbench results on my quad core system. Signed-off-by: Rik van Riel <riel@redhat.com> --- Rik van Riel wrote: >> I've added a 5th column, with just your mmap_sem patch and >> without my madv_free patch. It is run with the glibc patch, >> which should make it fall back to MADV_DONTNEED after the >> first MADV_FREE call fails. With the attached patch to make MADV_FREE use tlb batching, not only do we gain an additional 10-15% performance but Nick's mmap_sem patch also shows the performance increase that we expected to see. It looks like the tlb flushes (and IPIs) from zap_pte_range() could have been the problem. They're gone now. The second column from the right has Nick's patch and my own two patches. Performance with 16 threads is almost triple what it used to be... vanilla glibc glibc glibc glibc glibc glibc madv_free madv_free madv_free madv_free mmap_sem mmap_sem mmap_sem tlb batch tlb_batch threads 1 610 609 596 545 534 547 537 2 1032 1136 1196 1200 1180 1293 1194 4 1070 1128 2014 2024 2027 2248 2040 8 1000 1088 1665 2087 2089 2314 1869 16 779 1073 1310 1999 2012 2214 1557 > Now that I think about it - this is all with the rawhide kernel > configuration, which has an ungodly number of debug config > options enabled. > > I should try this with a more normal kernel, on various different > systems. This is for another day. :) First some ebizzy runs... -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. [-- Attachment #2: linux-2.6-madv_free-lazytlb.patch --] [-- Type: text/x-patch, Size: 690 bytes --] --- linux-2.6.20.x86_64/mm/memory.c.orig 2007-04-23 02:48:36.000000000 -0400 +++ linux-2.6.20.x86_64/mm/memory.c 2007-04-23 02:54:42.000000000 -0400 @@ -677,11 +677,15 @@ static unsigned long zap_pte_range(struc remove_exclusive_swap_page(page); unlock_page(page); } - ptep_clear_flush_dirty(vma, addr, pte); - ptep_clear_flush_young(vma, addr, pte); SetPageLazyFree(page); if (PageActive(page)) deactivate_tail_page(page); + ptent = *pte; + set_pte_at(mm, addr, pte, + pte_mkclean(pte_mkold(ptent))); + /* tlb_remove_page frees it again */ + get_page(page); + tlb_remove_page(tlb, page); continue; } } ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-23 9:20 ` Rik van Riel @ 2007-04-23 10:21 ` Nick Piggin 2007-04-23 10:31 ` Rik van Riel 2007-04-23 10:44 ` Jakub Jelinek 2007-04-23 11:45 ` Rik van Riel 1 sibling, 2 replies; 43+ messages in thread From: Nick Piggin @ 2007-04-23 10:21 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrew Morton, linux-kernel, linux-mm, shak, jakub, drepper Rik van Riel wrote: > Use TLB batching for MADV_FREE. Adds another 10-15% extra performance > to the MySQL sysbench results on my quad core system. > > Signed-off-by: Rik van Riel <riel@redhat.com> > --- > Rik van Riel wrote: > >>> I've added a 5th column, with just your mmap_sem patch and >>> without my madv_free patch. It is run with the glibc patch, >>> which should make it fall back to MADV_DONTNEED after the >>> first MADV_FREE call fails. > > > With the attached patch to make MADV_FREE use tlb batching, not > only do we gain an additional 10-15% performance but Nick's > mmap_sem patch also shows the performance increase that we > expected to see. > > It looks like the tlb flushes (and IPIs) from zap_pte_range() > could have been the problem. They're gone now. I guess it is a good idea to batch these things. But can you do that on all architectures? What happens if your tlb flush happens after another thread already accesses it again, or after it subsequently gets removed from the address space via another CPU? > > The second column from the right has Nick's patch and my own > two patches. Performance with 16 threads is almost triple what > it used to be... > > vanilla glibc glibc glibc glibc glibc glibc > madv_free madv_free madv_free madv_free > mmap_sem mmap_sem mmap_sem > tlb batch tlb_batch > threads > > 1 610 609 596 545 534 547 537 > 2 1032 1136 1196 1200 1180 1293 1194 > 4 1070 1128 2014 2024 2027 2248 2040 > 8 1000 1088 1665 2087 2089 2314 1869 > 16 779 1073 1310 1999 2012 2214 1557 > > >> Now that I think about it - this is all with the rawhide kernel >> configuration, which has an ungodly number of debug config >> options enabled. >> >> I should try this with a more normal kernel, on various different >> systems. > > > This is for another day. :) > > First some ebizzy runs... > > > ------------------------------------------------------------------------ > > --- linux-2.6.20.x86_64/mm/memory.c.orig 2007-04-23 02:48:36.000000000 -0400 > +++ linux-2.6.20.x86_64/mm/memory.c 2007-04-23 02:54:42.000000000 -0400 > @@ -677,11 +677,15 @@ static unsigned long zap_pte_range(struc > remove_exclusive_swap_page(page); > unlock_page(page); > } > - ptep_clear_flush_dirty(vma, addr, pte); > - ptep_clear_flush_young(vma, addr, pte); > SetPageLazyFree(page); > if (PageActive(page)) > deactivate_tail_page(page); > + ptent = *pte; > + set_pte_at(mm, addr, pte, > + pte_mkclean(pte_mkold(ptent))); > + /* tlb_remove_page frees it again */ > + get_page(page); > + tlb_remove_page(tlb, page); > continue; > } > } -- SUSE Labs, Novell Inc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-23 10:21 ` Nick Piggin @ 2007-04-23 10:31 ` Rik van Riel 2007-04-23 10:35 ` Nick Piggin 2007-04-23 10:44 ` Jakub Jelinek 1 sibling, 1 reply; 43+ messages in thread From: Rik van Riel @ 2007-04-23 10:31 UTC (permalink / raw) To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, linux-mm, shak, jakub, drepper Nick Piggin wrote: >> It looks like the tlb flushes (and IPIs) from zap_pte_range() >> could have been the problem. They're gone now. > > I guess it is a good idea to batch these things. But can you > do that on all architectures? What happens if your tlb flush > happens after another thread already accesses it again, or > after it subsequently gets removed from the address space via > another CPU? I have thought about this a lot tonight, and have come to the conclusion that they are ok. The reason is simple: 1) we do the TLB flush before we return from the madvise(MADV_FREE) syscall. 2) anything that accessess the pages between the start and end of the MADV_FREE procedure does not know in which order we go through the pages, so it could hit a page either before or after we get to processing it 3) because of this, we can treat any such accesses as happening simultaneously with the MADV_FREE and as illegal, aka undefined behaviour territory and we do not need to worry about them 4) because we flush the tlb before releasing the page table lock, other CPUs cannot remove this page from the address space - they will block on the page table lock before looking at this pte -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-23 10:31 ` Rik van Riel @ 2007-04-23 10:35 ` Nick Piggin 2007-04-23 10:44 ` Rik van Riel 2007-04-24 2:53 ` Rik van Riel 0 siblings, 2 replies; 43+ messages in thread From: Nick Piggin @ 2007-04-23 10:35 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrew Morton, linux-kernel, linux-mm, shak, jakub, drepper Rik van Riel wrote: > Nick Piggin wrote: > >>> It looks like the tlb flushes (and IPIs) from zap_pte_range() >>> could have been the problem. They're gone now. >> >> >> I guess it is a good idea to batch these things. But can you >> do that on all architectures? What happens if your tlb flush >> happens after another thread already accesses it again, or >> after it subsequently gets removed from the address space via >> another CPU? > > > I have thought about this a lot tonight, and have come to the conclusion > that they are ok. > > The reason is simple: > > 1) we do the TLB flush before we return from the > madvise(MADV_FREE) syscall. > > 2) anything that accessess the pages between the start > and end of the MADV_FREE procedure does not know in > which order we go through the pages, so it could hit > a page either before or after we get to processing > it > > 3) because of this, we can treat any such accesses as > happening simultaneously with the MADV_FREE and > as illegal, aka undefined behaviour territory and > we do not need to worry about them Yes, but I'm wondering if it is legal in all architectures. > > 4) because we flush the tlb before releasing the page > table lock, other CPUs cannot remove this page from > the address space - they will block on the page > table lock before looking at this pte We don't when the ptl is split. What the tlb flush used to be able to assume is that the page has been removed from the pagetables when they are put in the tlb flush batch. I'm not saying there is any bugs, but just suggesting there might be. -- SUSE Labs, Novell Inc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-23 10:35 ` Nick Piggin @ 2007-04-23 10:44 ` Rik van Riel 2007-04-24 1:15 ` Nick Piggin 2007-04-24 2:53 ` Rik van Riel 1 sibling, 1 reply; 43+ messages in thread From: Rik van Riel @ 2007-04-23 10:44 UTC (permalink / raw) To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, linux-mm, shak, jakub, drepper [-- Attachment #1: Type: text/plain, Size: 1847 bytes --] Use TLB batching for MADV_FREE. Adds another 10-15% extra performance to the MySQL sysbench results on my quad core system. Signed-off-by: Rik van Riel <riel@redhat.com> --- Nick Piggin wrote: >> 3) because of this, we can treat any such accesses as >> happening simultaneously with the MADV_FREE and >> as illegal, aka undefined behaviour territory and >> we do not need to worry about them > > Yes, but I'm wondering if it is legal in all architectures. It's similar to trying to access memory during an munmap. You may be able to for a short time, but it'll come back to haunt you. >> 4) because we flush the tlb before releasing the page >> table lock, other CPUs cannot remove this page from >> the address space - they will block on the page >> table lock before looking at this pte > > We don't when the ptl is split. Even then we do. Each invocation of zap_pte_range() only touches one page table page, and it flushes the TLB before releasing the page table lock. > What the tlb flush used to be able to assume is that the page > has been removed from the pagetables when they are put in the > tlb flush batch. All the tlb flush code seems to assume is that the tlb entries should be invalidated. > I'm not saying there is any bugs, but just suggesting there > might be. Jakub found a potential bug, in that I did not use an atomic operation to clear the page table entries. I've attached a new patch which simply uses ptep_test_and_clear_dirty/young to get rid of the dirty and accessed bits. It uses the same atomic accesses we use elsewhere in the VM and the code is a line shorter than before. Andrew, please use this one. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. [-- Attachment #2: linux-2.6-madv_free-lazytlb.patch --] [-- Type: text/x-patch, Size: 697 bytes --] --- linux-2.6.20.x86_64/mm/memory.c.orig 2007-04-23 02:48:36.000000000 -0400 +++ linux-2.6.20.x86_64/mm/memory.c 2007-04-23 02:54:42.000000000 -0400 @@ -677,11 +677,14 @@ static unsigned long zap_pte_range(struc remove_exclusive_swap_page(page); unlock_page(page); } - ptep_clear_flush_dirty(vma, addr, pte); - ptep_clear_flush_young(vma, addr, pte); + ptep_test_and_clear_dirty(vma, addr, pte); + ptep_test_and_clear_young(vma, addr, pte); SetPageLazyFree(page); if (PageActive(page)) deactivate_tail_page(page); + /* tlb_remove_page frees it again */ + get_page(page); + tlb_remove_page(tlb, page); continue; } } ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-23 10:44 ` Rik van Riel @ 2007-04-24 1:15 ` Nick Piggin 2007-04-24 1:58 ` Rik van Riel 0 siblings, 1 reply; 43+ messages in thread From: Nick Piggin @ 2007-04-24 1:15 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrew Morton, linux-kernel, linux-mm, shak, jakub, drepper Rik van Riel wrote: > Use TLB batching for MADV_FREE. Adds another 10-15% extra performance > to the MySQL sysbench results on my quad core system. > > Signed-off-by: Rik van Riel <riel@redhat.com> > --- > > Nick Piggin wrote: > >>> 3) because of this, we can treat any such accesses as >>> happening simultaneously with the MADV_FREE and >>> as illegal, aka undefined behaviour territory and >>> we do not need to worry about them >> >> >> Yes, but I'm wondering if it is legal in all architectures. > > > It's similar to trying to access memory during an munmap. > > You may be able to for a short time, but it'll come back to > haunt you. The question is whether the architecture specific tlb flushing code will break or not. >>> 4) because we flush the tlb before releasing the page >>> table lock, other CPUs cannot remove this page from >>> the address space - they will block on the page >>> table lock before looking at this pte >> >> >> We don't when the ptl is split. > > > Even then we do. Each invocation of zap_pte_range() only touches > one page table page, and it flushes the TLB before releasing the > page table lock. What kernel are you looking at? -rc7 and rc6-mm1 don't, AFAIKS. -- SUSE Labs, Novell Inc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-24 1:15 ` Nick Piggin @ 2007-04-24 1:58 ` Rik van Riel 2007-04-24 2:16 ` Nick Piggin 2007-04-24 4:42 ` Paul Mackerras 0 siblings, 2 replies; 43+ messages in thread From: Rik van Riel @ 2007-04-24 1:58 UTC (permalink / raw) To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, linux-mm, shak, jakub, drepper [-- Attachment #1: Type: text/plain, Size: 1458 bytes --] This should fix the MADV_FREE code for PPC's hashed tlb. Signed-off-by: Rik van Riel <riel@redhat.com> --- Nick Piggin wrote: >> Nick Piggin wrote: >> >>>> 3) because of this, we can treat any such accesses as >>>> happening simultaneously with the MADV_FREE and >>>> as illegal, aka undefined behaviour territory and >>>> we do not need to worry about them >>> >>> >>> Yes, but I'm wondering if it is legal in all architectures. >> >> >> It's similar to trying to access memory during an munmap. >> >> You may be able to for a short time, but it'll come back to >> haunt you. > > The question is whether the architecture specific tlb > flushing code will break or not. I guess we'll need to call tlb_remove_tlb_entry() inside the MADV_FREE code to keep powerpc happy. Thanks for pointing this one out. >> Even then we do. Each invocation of zap_pte_range() only touches >> one page table page, and it flushes the TLB before releasing the >> page table lock. > > What kernel are you looking at? -rc7 and rc6-mm1 don't, AFAIKS. Oh dear. I see it now... The tlb end things inside zap_pte_range() are actually noops and the actual tlb flush only happens inside zap_page_range(). I guess the fact that munmap gets the mmap_sem for writing should save us, though... -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. [-- Attachment #2: linux-2.6-madv-ppcfix.patch --] [-- Type: text/x-patch, Size: 453 bytes --] --- linux-2.6.20.x86_64/mm/memory.c.noppc 2007-04-23 21:50:09.000000000 -0400 +++ linux-2.6.20.x86_64/mm/memory.c 2007-04-23 21:48:59.000000000 -0400 @@ -679,6 +679,7 @@ static unsigned long zap_pte_range(struc } ptep_test_and_clear_dirty(vma, addr, pte); ptep_test_and_clear_young(vma, addr, pte); + tlb_remove_tlb_entry(tlb, pte, addr); SetPageLazyFree(page); if (PageActive(page)) deactivate_tail_page(page); ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-24 1:58 ` Rik van Riel @ 2007-04-24 2:16 ` Nick Piggin 2007-04-24 4:42 ` Paul Mackerras 1 sibling, 0 replies; 43+ messages in thread From: Nick Piggin @ 2007-04-24 2:16 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrew Morton, linux-kernel, linux-mm, shak, jakub, drepper Rik van Riel wrote: > This should fix the MADV_FREE code for PPC's hashed tlb. > > Signed-off-by: Rik van Riel <riel@redhat.com> > --- > > Nick Piggin wrote: > >>> Nick Piggin wrote: >>> >>>>> 3) because of this, we can treat any such accesses as >>>>> happening simultaneously with the MADV_FREE and >>>>> as illegal, aka undefined behaviour territory and >>>>> we do not need to worry about them >>>> >>>> >>>> >>>> Yes, but I'm wondering if it is legal in all architectures. >>> >>> >>> >>> It's similar to trying to access memory during an munmap. >>> >>> You may be able to for a short time, but it'll come back to >>> haunt you. >> >> >> The question is whether the architecture specific tlb >> flushing code will break or not. > > > I guess we'll need to call tlb_remove_tlb_entry() inside the > MADV_FREE code to keep powerpc happy. > > Thanks for pointing this one out. > >>> Even then we do. Each invocation of zap_pte_range() only touches >>> one page table page, and it flushes the TLB before releasing the >>> page table lock. >> >> >> What kernel are you looking at? -rc7 and rc6-mm1 don't, AFAIKS. > > > Oh dear. I see it now... > > The tlb end things inside zap_pte_range() are actually > noops and the actual tlb flush only happens inside > zap_page_range(). > > I guess the fact that munmap gets the mmap_sem for > writing should save us, though... What about an unmap_mapping_range, or another MADV_FREE or MADV_DONTNEED? > > > ------------------------------------------------------------------------ > > --- linux-2.6.20.x86_64/mm/memory.c.noppc 2007-04-23 21:50:09.000000000 -0400 > +++ linux-2.6.20.x86_64/mm/memory.c 2007-04-23 21:48:59.000000000 -0400 > @@ -679,6 +679,7 @@ static unsigned long zap_pte_range(struc > } > ptep_test_and_clear_dirty(vma, addr, pte); > ptep_test_and_clear_young(vma, addr, pte); > + tlb_remove_tlb_entry(tlb, pte, addr); > SetPageLazyFree(page); > if (PageActive(page)) > deactivate_tail_page(page); -- SUSE Labs, Novell Inc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-24 1:58 ` Rik van Riel 2007-04-24 2:16 ` Nick Piggin @ 2007-04-24 4:42 ` Paul Mackerras 2007-04-24 5:13 ` Rik van Riel 1 sibling, 1 reply; 43+ messages in thread From: Paul Mackerras @ 2007-04-24 4:42 UTC (permalink / raw) To: Rik van Riel Cc: Nick Piggin, Andrew Morton, linux-kernel, linux-mm, shak, jakub, drepper Rik van Riel writes: > I guess we'll need to call tlb_remove_tlb_entry() inside the > MADV_FREE code to keep powerpc happy. I don't see why; once ptep_test_and_clear_young has returned, the entry in the hash table has already been removed. Adding the tlb_remove_tlb_entry call certainly won't do anything on 64-bit powerpc, since it expands to do {} while (0) there, and in fact it won't do anything on 32-bit powerpc either. Paul. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-24 4:42 ` Paul Mackerras @ 2007-04-24 5:13 ` Rik van Riel 0 siblings, 0 replies; 43+ messages in thread From: Rik van Riel @ 2007-04-24 5:13 UTC (permalink / raw) To: Paul Mackerras Cc: Nick Piggin, Andrew Morton, linux-kernel, linux-mm, shak, jakub, drepper Paul Mackerras wrote: > Rik van Riel writes: > >> I guess we'll need to call tlb_remove_tlb_entry() inside the >> MADV_FREE code to keep powerpc happy. > > I don't see why; once ptep_test_and_clear_young has returned, the > entry in the hash table has already been removed. OK, so this one won't be necessary. Good to know that. Andrew, it looks like things won't be that bad :) -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-23 10:35 ` Nick Piggin 2007-04-23 10:44 ` Rik van Riel @ 2007-04-24 2:53 ` Rik van Riel 2007-04-24 3:08 ` Andrew Morton 1 sibling, 1 reply; 43+ messages in thread From: Rik van Riel @ 2007-04-24 2:53 UTC (permalink / raw) To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, linux-mm, shak, jakub, drepper [-- Attachment #1: Type: text/plain, Size: 838 bytes --] Nick Piggin wrote: > What the tlb flush used to be able to assume is that the page > has been removed from the pagetables when they are put in the > tlb flush batch. I think this is still the case, to a degree. There should be no harm in removing the TLB entries after the page table has been unlocked, right? Or is something like the attached really needed? From what I can see, the page table lock should be enough synchronization between unmap_mapping_range, MADV_FREE and MADV_DONTNEED. I don't see why we need the attached, but in case you find a good reason, here's my signed-off-by line for Andrew :) Signed-off-by: Rik van Riel <riel@redhat.com> -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. [-- Attachment #2: linux-2.6-madv_free-flushme.patch --] [-- Type: text/x-patch, Size: 750 bytes --] --- linux-2.6.20.x86_64/mm/memory.c.flushme 2007-04-23 22:26:06.000000000 -0400 +++ linux-2.6.20.x86_64/mm/memory.c 2007-04-23 22:42:06.000000000 -0400 @@ -628,6 +628,7 @@ static unsigned long zap_pte_range(struc long *zap_work, struct zap_details *details) { struct mm_struct *mm = tlb->mm; + unsigned long start_addr = addr; pte_t *pte; spinlock_t *ptl; int file_rss = 0; @@ -726,6 +727,11 @@ static unsigned long zap_pte_range(struc add_mm_rss(mm, file_rss, anon_rss); arch_leave_lazy_mmu_mode(); + if (details && details->madv_free) { + /* Protect against MADV_DONTNEED or unmap_mapping_range */ + tlb_finish_mmu(tlb, start_addr, addr); + tlb = tlb_gather_mmu(mm, 0); + } pte_unmap_unlock(pte - 1, ptl); return addr; ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-24 2:53 ` Rik van Riel @ 2007-04-24 3:08 ` Andrew Morton 0 siblings, 0 replies; 43+ messages in thread From: Andrew Morton @ 2007-04-24 3:08 UTC (permalink / raw) To: Rik van Riel; +Cc: Nick Piggin, linux-kernel, linux-mm, shak, jakub, drepper On Mon, 23 Apr 2007 22:53:49 -0400 Rik van Riel <riel@redhat.com> wrote: > I don't see why we need the attached, but in case you find > a good reason, here's my signed-off-by line for Andrew :) Andew is in a defensive crouch trying to work his way through all the bugs he's been sent. After I've managed to release 2.6.21-rc7-mm1 (say, December) I expect I'll drop the MADV_FREE stuff, give you a run at creating a new patch series. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-23 10:21 ` Nick Piggin 2007-04-23 10:31 ` Rik van Riel @ 2007-04-23 10:44 ` Jakub Jelinek 1 sibling, 0 replies; 43+ messages in thread From: Jakub Jelinek @ 2007-04-23 10:44 UTC (permalink / raw) To: Nick Piggin Cc: Rik van Riel, Andrew Morton, linux-kernel, linux-mm, shak, drepper On Mon, Apr 23, 2007 at 08:21:37PM +1000, Nick Piggin wrote: > I guess it is a good idea to batch these things. But can you > do that on all architectures? What happens if your tlb flush > happens after another thread already accesses it again, or > after it subsequently gets removed from the address space via > another CPU? Accessing the page by another thread before madvise (MADV_FREE) returns is undefined behavior, it can act as if that access happened right before the madvise (MADV_FREE) call or right after it. That's ok for glibc and supposedly any other malloc implementation, madvise (MADV_FREE) is called while holding containing's arena lock and for whatever malloc implementaton, madvise (MADV_FREE) would be part of free operations and you definitely need some synchronization between one thread freeing some memory and other thread deciding to reuse that memory and return it from malloc/realloc/calloc/etc. My only concern is whether using non-atomic update of the pte is ok or not. ptep_test_and_clear_young/ptep_test_and_clear_dirty Rik's patch was doing before are done using atomic instructions, at least on x86_64. The operation we want for MADV_FREE is, clear young/dirty bits if they have been set on entry to the MADV_FREE madvise call, undefined values for these 2 bits if some other task modifies the young/dirty bits concurrently with this MADV_FREE zap_page_range, but I'd say other bits need to be unmodified. Now, is there some kernel code which while either not holding corresponding mmap_sem at all or holding it just down_read modifies other bits in the pte? If yes, we need to do this clearing atomically, basically do a cmpxchg loop until we succeed to clear the 2 bits and then flush the tlb if any of them was set before (ptep_test_and_clear_dirty_and_young?), if not, set_pte_at is ok and faster than a lock prefixed insn. Jakub -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-23 9:20 ` Rik van Riel 2007-04-23 10:21 ` Nick Piggin @ 2007-04-23 11:45 ` Rik van Riel 1 sibling, 0 replies; 43+ messages in thread From: Rik van Riel @ 2007-04-23 11:45 UTC (permalink / raw) To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, linux-mm, shak, jakub, drepper Rik van Riel wrote: > First some ebizzy runs... This is interesting. Ginormous speedups in ebizzy[1] on my quad core test system. The following numbers are the average of 10 runs, since ebizzy shows some variability. You can see a big influence from the tlb batching and from Nick's madv_sem patch. The reduction in system time from 100 seconds to 3 seconds is way more than I had expected, but I'm not complaining. The 4 fold reduction in wall clock time is a nice bonus. According to Val, ebizzy shows the weaknesses of Linux with a real workload, so this could be a useful result. kernel user system wall clock %CPU vanilla 186s 101s 123s 230% madv_free (madv) 175s 96s 120s 230% mmap_sem (sem) 100s 40s 40s 370% madv+sem 200s 140s 100s 393% madv+sem+tlb 118s 3s 30s 395% madv+tlb 150s 10s 50s 310% [1] http://www.ussg.iu.edu/hypermail/linux/kernel/0604.2/1699.html -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-22 2:36 ` Nick Piggin 2007-04-22 2:50 ` Nick Piggin 2007-04-22 6:31 ` Rik van Riel @ 2007-04-23 4:28 ` Rik van Riel 2 siblings, 0 replies; 43+ messages in thread From: Rik van Riel @ 2007-04-23 4:28 UTC (permalink / raw) To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, linux-mm, shak Nick Piggin wrote: > So where is the down_write coming from in this workload, I wonder? > Heap management? What syscalls? Trying to answer this question, I straced the mysql threads that showed up in top when running a single threaded sysbench workload. There were no mmap, munmap, brk, mprotect or madvise system calls in the trace. MySQL has me puzzled, but it seems to have some other people interested too. I think I'll go play a bit with ebizzy now, to see how other workloads are affected by our kernel changes. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-20 21:38 ` Rik van Riel 2007-04-20 22:06 ` Andrew Morton @ 2007-04-21 7:24 ` Hugh Dickins 2007-04-21 18:06 ` Rik van Riel 1 sibling, 1 reply; 43+ messages in thread From: Hugh Dickins @ 2007-04-21 7:24 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrew Morton, linux-kernel, linux-mm On Fri, 20 Apr 2007, Rik van Riel wrote: > Andrew Morton wrote: > > > I do go on about that. But we're adding page flags at about one per > > year, and when we run out we're screwed - we'll need to grow the > > pageframe. > > If you want, I can take a look at folding this into the > ->mapping pointer. I can guarantee you it won't be > pretty, though :) Please don't. If we're going to stuff another pageflag into there, let it be PageSwapCache the natural partner of PageAnon, rather than whatever our latest pageflag happens to be. I'll look into it - but do keep an eye on me, I've developed a dubious track record of obstructing other people's attempts to save pageflags. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-21 7:24 ` Hugh Dickins @ 2007-04-21 18:06 ` Rik van Riel 0 siblings, 0 replies; 43+ messages in thread From: Rik van Riel @ 2007-04-21 18:06 UTC (permalink / raw) To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel, linux-mm Hugh Dickins wrote: > On Fri, 20 Apr 2007, Rik van Riel wrote: >> Andrew Morton wrote: >> >>> I do go on about that. But we're adding page flags at about one per >>> year, and when we run out we're screwed - we'll need to grow the >>> pageframe. >> If you want, I can take a look at folding this into the >> ->mapping pointer. I can guarantee you it won't be >> pretty, though :) > > Please don't. If we're going to stuff another pageflag into there, > let it be PageSwapCache the natural partner of PageAnon, rather than > whatever our latest pageflag happens to be. I looked at doing what Andrew wanted, and it did indeed not look like the right thing to do. The locking on page->mapping is the kind of locking we want to avoid during zap_page_range and in the pageout code. I like your suggestion better. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-17 7:15 [PATCH] lazy freeing of memory through MADV_FREE Rik van Riel 2007-04-19 21:15 ` [PATCH] lazy freeing of memory through MADV_FREE 2/2 Rik van Riel 2007-04-20 20:57 ` [PATCH] lazy freeing of memory through MADV_FREE Andrew Morton @ 2007-04-22 8:18 ` Andrew Morton 2007-04-22 9:16 ` Christoph Hellwig 2 siblings, 1 reply; 43+ messages in thread From: Andrew Morton @ 2007-04-22 8:18 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-kernel, linux-mm, David S. Miller On Tue, 17 Apr 2007 03:15:51 -0400 Rik van Riel <riel@redhat.com> wrote: > Make it possible for applications to have the kernel free memory > lazily. This reduces a repeated free/malloc cycle from freeing > pages and allocating them, to just marking them freeable. If the > application wants to reuse them before the kernel needs the memory, > not even a page fault will happen. > > This patch, together with Ulrich's glibc change, increases > MySQL sysbench performance by a factor of 2 on my quad core > test system. > In file included from include/linux/mman.h:4, from arch/sparc64/kernel/sys_sparc.c:19: include/asm/mman.h:36:1: "MADV_FREE" redefined In file included from include/asm/mman.h:5, from include/linux/mman.h:4, from arch/sparc64/kernel/sys_sparc.c:19: include/asm-generic/mman.h:32:1: this is the location of the previous definition sparc32 and sparc64 already defined MADV_FREE: #define MADV_FREE 0x5 /* (Solaris) contents can be freed */ I'll remove the sparc definitions for now, but we need to work out what we're going to do here. Your patch changes the values of MADV_FREE on sparc. Perhaps this should be renamed to MADV_FREE_LINUX and given a different number. It depends on how close your proposed behaviour is to Solaris's. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-22 8:18 ` Andrew Morton @ 2007-04-22 9:16 ` Christoph Hellwig 2007-04-22 16:55 ` Ulrich Drepper 0 siblings, 1 reply; 43+ messages in thread From: Christoph Hellwig @ 2007-04-22 9:16 UTC (permalink / raw) To: Andrew Morton; +Cc: Rik van Riel, linux-kernel, linux-mm, David S. Miller On Sun, Apr 22, 2007 at 01:18:10AM -0700, Andrew Morton wrote: > On Tue, 17 Apr 2007 03:15:51 -0400 Rik van Riel <riel@redhat.com> wrote: > > > Make it possible for applications to have the kernel free memory > > lazily. This reduces a repeated free/malloc cycle from freeing > > pages and allocating them, to just marking them freeable. If the > > application wants to reuse them before the kernel needs the memory, > > not even a page fault will happen. > > > > This patch, together with Ulrich's glibc change, increases > > MySQL sysbench performance by a factor of 2 on my quad core > > test system. > > > > In file included from include/linux/mman.h:4, > from arch/sparc64/kernel/sys_sparc.c:19: > include/asm/mman.h:36:1: "MADV_FREE" redefined > In file included from include/asm/mman.h:5, > from include/linux/mman.h:4, > from arch/sparc64/kernel/sys_sparc.c:19: > include/asm-generic/mman.h:32:1: this is the location of the previous definition > > sparc32 and sparc64 already defined MADV_FREE: > > > #define MADV_FREE 0x5 /* (Solaris) contents can be freed */ > > I'll remove the sparc definitions for now, but we need to work out what > we're going to do here. Your patch changes the values of MADV_FREE on > sparc. > > Perhaps this should be renamed to MADV_FREE_LINUX and given a different > number. It depends on how close your proposed behaviour is to Solaris's. Why isn't MADV_FREE defined to 5 for linux? It's our first free madv value? Also the behaviour should better match the one in solaris or BSD, the last thing we need is slightly different behaviour from operating systems supporting this for ages. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] lazy freeing of memory through MADV_FREE 2007-04-22 9:16 ` Christoph Hellwig @ 2007-04-22 16:55 ` Ulrich Drepper 0 siblings, 0 replies; 43+ messages in thread From: Ulrich Drepper @ 2007-04-22 16:55 UTC (permalink / raw) To: Christoph Hellwig, Andrew Morton, Rik van Riel, linux-kernel, linux-mm, David S. Miller On 4/22/07, Christoph Hellwig <hch@infradead.org> wrote: > Why isn't MADV_FREE defined to 5 for linux? It's our first free madv > value? Also the behaviour should better match the one in solaris or BSD, > the last thing we need is slightly different behaviour from operating > systems supporting this for ages. The behavior should indeed be identical. Both implementations restrict MADV_FREE to work on anonymous memory and it is unspecified whether a renewed access yields to a zerod page being created or whether the old content is still there. So, just use 0x5 for both the Linux and Solaris version on sparc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2007-04-24 5:13 UTC | newest] Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2007-04-17 7:15 [PATCH] lazy freeing of memory through MADV_FREE Rik van Riel 2007-04-19 21:15 ` [PATCH] lazy freeing of memory through MADV_FREE 2/2 Rik van Riel 2007-04-20 21:03 ` Andrew Morton 2007-04-20 21:24 ` Ulrich Drepper 2007-04-21 7:37 ` Hugh Dickins 2007-04-21 16:32 ` Ulrich Drepper 2007-04-20 20:57 ` [PATCH] lazy freeing of memory through MADV_FREE Andrew Morton 2007-04-20 21:38 ` Rik van Riel 2007-04-20 22:06 ` Andrew Morton 2007-04-20 23:52 ` Rik van Riel 2007-04-21 0:48 ` Eric Dumazet 2007-04-21 3:58 ` Rik van Riel 2007-04-21 7:12 ` Jakub Jelinek 2007-04-23 4:36 ` Nick Piggin 2007-04-22 2:36 ` Nick Piggin 2007-04-22 2:50 ` Nick Piggin 2007-04-22 6:31 ` Rik van Riel 2007-04-23 0:16 ` Nick Piggin 2007-04-23 3:53 ` Rik van Riel 2007-04-23 3:58 ` Nick Piggin 2007-04-23 10:07 ` Nick Piggin 2007-04-23 10:12 ` Rik van Riel 2007-04-23 3:59 ` Rik van Riel 2007-04-23 9:20 ` Rik van Riel 2007-04-23 10:21 ` Nick Piggin 2007-04-23 10:31 ` Rik van Riel 2007-04-23 10:35 ` Nick Piggin 2007-04-23 10:44 ` Rik van Riel 2007-04-24 1:15 ` Nick Piggin 2007-04-24 1:58 ` Rik van Riel 2007-04-24 2:16 ` Nick Piggin 2007-04-24 4:42 ` Paul Mackerras 2007-04-24 5:13 ` Rik van Riel 2007-04-24 2:53 ` Rik van Riel 2007-04-24 3:08 ` Andrew Morton 2007-04-23 10:44 ` Jakub Jelinek 2007-04-23 11:45 ` Rik van Riel 2007-04-23 4:28 ` Rik van Riel 2007-04-21 7:24 ` Hugh Dickins 2007-04-21 18:06 ` Rik van Riel 2007-04-22 8:18 ` Andrew Morton 2007-04-22 9:16 ` Christoph Hellwig 2007-04-22 16:55 ` Ulrich Drepper
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox