linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] make MADV_FREE lazily free memory
@ 2007-04-11  4:30 Rik van Riel
  2007-04-11 22:41 ` Eric Dumazet
  0 siblings, 1 reply; 13+ messages in thread
From: Rik van Riel @ 2007-04-11  4:30 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, Ulrich Drepper

[-- Attachment #1: Type: text/plain, Size: 934 bytes --]

Make it possible for applications to have the kernel free memory
lazily.  This reduces a repeated free/malloc cycle from freeing
pages and allocating them, to just marking them freeable.  If the
application wants to reuse them before the kernel needs the memory,
not even a page fault will happen.

This version has one bugfix over the last one: if a PG_lazyfree
page was found dirty at fork time, we clear the flag in
copy_one_pte().

Ulrich Drepper has test glibc RPMS for this functionality at:

	http://people.redhat.com/drepper/rpms

Because MADV_FREE has not been defined as a fixed number yet,
for the moment MADV_DONTNEED is defined to have the same
functionality.

Any test results of this patch in combination with Ulrich's
test glibc are welcome.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

[-- Attachment #2: linux-2.6-madv_free.patch --]
[-- Type: text/x-patch, Size: 12444 bytes --]

--- linux-2.6.20.noarch/include/asm-alpha/mman.h.madvise	2007-04-04 16:44:50.000000000 -0400
+++ linux-2.6.20.noarch/include/asm-alpha/mman.h	2007-04-04 16:56:24.000000000 -0400
@@ -42,6 +42,7 @@
 #define MADV_WILLNEED	3		/* will need these pages */
 #define	MADV_SPACEAVAIL	5		/* ensure resources are available */
 #define MADV_DONTNEED	6		/* don't need these pages */
+#define MADV_FREE	7		/* don't need the pages or the data */
 
 /* common/generic parameters */
 #define MADV_REMOVE	9		/* remove these pages & resources */
--- linux-2.6.20.noarch/include/asm-generic/mman.h.madvise	2007-04-04 16:44:50.000000000 -0400
+++ linux-2.6.20.noarch/include/asm-generic/mman.h	2007-04-04 16:56:53.000000000 -0400
@@ -29,6 +29,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* don't need the pages or the data */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
--- linux-2.6.20.noarch/include/asm-mips/mman.h.madvise	2007-04-04 16:44:50.000000000 -0400
+++ linux-2.6.20.noarch/include/asm-mips/mman.h	2007-04-04 16:58:02.000000000 -0400
@@ -65,6 +65,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* don't need the pages or the data */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
--- linux-2.6.20.noarch/include/asm-parisc/mman.h.madvise	2007-04-04 16:44:50.000000000 -0400
+++ linux-2.6.20.noarch/include/asm-parisc/mman.h	2007-04-04 16:58:40.000000000 -0400
@@ -38,6 +38,7 @@
 #define MADV_SPACEAVAIL 5               /* insure that resources are reserved */
 #define MADV_VPS_PURGE  6               /* Purge pages from VM page cache */
 #define MADV_VPS_INHERIT 7              /* Inherit parents page size */
+#define MADV_FREE	8		/* don't need the pages or the data */
 
 /* common/generic parameters */
 #define MADV_REMOVE	9		/* remove these pages & resources */
--- linux-2.6.20.noarch/include/asm-xtensa/mman.h.madvise	2007-04-04 16:44:51.000000000 -0400
+++ linux-2.6.20.noarch/include/asm-xtensa/mman.h	2007-04-04 16:59:14.000000000 -0400
@@ -72,6 +72,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* don't need the pages or the data */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
--- linux-2.6.20.noarch/include/linux/mm_inline.h.madvise	2007-04-03 22:53:25.000000000 -0400
+++ linux-2.6.20.noarch/include/linux/mm_inline.h	2007-04-04 22:19:24.000000000 -0400
@@ -13,6 +13,13 @@ add_page_to_inactive_list(struct zone *z
 }
 
 static inline void
+add_page_to_inactive_list_tail(struct zone *zone, struct page *page)
+{
+	list_add_tail(&page->lru, &zone->inactive_list);
+	__inc_zone_state(zone, NR_INACTIVE);
+}
+
+static inline void
 del_page_from_active_list(struct zone *zone, struct page *page)
 {
 	list_del(&page->lru);
--- linux-2.6.20.noarch/include/linux/mm.h.madvise	2007-04-03 22:53:25.000000000 -0400
+++ linux-2.6.20.noarch/include/linux/mm.h	2007-04-04 22:06:45.000000000 -0400
@@ -716,6 +716,7 @@ struct zap_details {
 	pgoff_t last_index;			/* Highest page->index to unmap */
 	spinlock_t *i_mmap_lock;		/* For unmap_mapping_range: */
 	unsigned long truncate_count;		/* Compare vm_truncate_count */
+	short madv_free;			/* MADV_FREE anonymous memory */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *, unsigned long, pte_t);
--- linux-2.6.20.noarch/include/linux/page-flags.h.madvise	2007-04-03 22:54:58.000000000 -0400
+++ linux-2.6.20.noarch/include/linux/page-flags.h	2007-04-05 01:27:38.000000000 -0400
@@ -91,6 +91,8 @@
 #define PG_nosave_free		18	/* Used for system suspend/resume */
 #define PG_buddy		19	/* Page is free, on buddy lists */
 
+#define PG_lazyfree		20	/* MADV_FREE potential throwaway */
+
 /* PG_owner_priv_1 users should have descriptive aliases */
 #define PG_checked		PG_owner_priv_1 /* Used by some filesystems */
 
@@ -237,6 +239,11 @@ static inline void SetPageUptodate(struc
 #define ClearPageReclaim(page)	clear_bit(PG_reclaim, &(page)->flags)
 #define TestClearPageReclaim(page) test_and_clear_bit(PG_reclaim, &(page)->flags)
 
+#define PageLazyFree(page)	test_bit(PG_lazyfree, &(page)->flags)
+#define SetPageLazyFree(page)	set_bit(PG_lazyfree, &(page)->flags)
+#define ClearPageLazyFree(page)	clear_bit(PG_lazyfree, &(page)->flags)
+#define __ClearPageLazyFree(page) __clear_bit(PG_lazyfree, &(page)->flags)
+
 #define PageCompound(page)	test_bit(PG_compound, &(page)->flags)
 #define __SetPageCompound(page)	__set_bit(PG_compound, &(page)->flags)
 #define __ClearPageCompound(page) __clear_bit(PG_compound, &(page)->flags)
--- linux-2.6.20.noarch/include/linux/swap.h.madvise	2007-04-05 00:29:40.000000000 -0400
+++ linux-2.6.20.noarch/include/linux/swap.h	2007-04-06 17:19:20.000000000 -0400
@@ -181,6 +181,7 @@ extern unsigned int nr_free_pagecache_pa
 extern void FASTCALL(lru_cache_add(struct page *));
 extern void FASTCALL(lru_cache_add_active(struct page *));
 extern void FASTCALL(activate_page(struct page *));
+extern void FASTCALL(deactivate_tail_page(struct page *));
 extern void FASTCALL(mark_page_accessed(struct page *));
 extern void lru_add_drain(void);
 extern int lru_add_drain_all(void);
@@ -232,7 +233,6 @@ extern void delete_from_swap_cache(struc
 extern int move_to_swap_cache(struct page *, swp_entry_t);
 extern int move_from_swap_cache(struct page *, unsigned long,
 		struct address_space *);
-extern void free_swap_cache(struct page *page);
 extern void free_page_and_swap_cache(struct page *);
 extern void free_pages_and_swap_cache(struct page **, int);
 extern struct page * lookup_swap_cache(swp_entry_t);
--- linux-2.6.20.noarch/mm/madvise.c.madvise	2007-04-03 21:53:47.000000000 -0400
+++ linux-2.6.20.noarch/mm/madvise.c	2007-04-04 23:48:34.000000000 -0400
@@ -142,8 +142,12 @@ static long madvise_dontneed(struct vm_a
 			.last_index = ULONG_MAX,
 		};
 		zap_page_range(vma, start, end - start, &details);
-	} else
-		zap_page_range(vma, start, end - start, NULL);
+	} else {
+		struct zap_details details = {
+			.madv_free = 1,
+		};
+		zap_page_range(vma, start, end - start, &details);
+	}
 	return 0;
 }
 
@@ -209,7 +213,9 @@ madvise_vma(struct vm_area_struct *vma, 
 		error = madvise_willneed(vma, prev, start, end);
 		break;
 
+	/* FIXME: POSIX says that MADV_DONTNEED cannot throw away data. */
 	case MADV_DONTNEED:
+	case MADV_FREE:
 		error = madvise_dontneed(vma, prev, start, end);
 		break;
 
--- linux-2.6.20.noarch/mm/memory.c.madvise	2007-04-03 21:53:47.000000000 -0400
+++ linux-2.6.20.noarch/mm/memory.c	2007-04-06 17:18:23.000000000 -0400
@@ -432,6 +432,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
 	unsigned long vm_flags = vma->vm_flags;
 	pte_t pte = *src_pte;
 	struct page *page;
+	int dirty = 0;
 
 	/* pte contains position in swap or file, so copy. */
 	if (unlikely(!pte_present(pte))) {
@@ -466,6 +467,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
 	 * in the parent and the child
 	 */
 	if (is_cow_mapping(vm_flags)) {
+		dirty = pte_dirty(pte);
 		ptep_set_wrprotect(src_mm, addr, src_pte);
 		pte = pte_wrprotect(pte);
 	}
@@ -483,6 +485,8 @@ copy_one_pte(struct mm_struct *dst_mm, s
 		get_page(page);
 		page_dup_rmap(page);
 		rss[!!PageAnon(page)]++;
+		if (dirty && PageLazyFree(page))
+			ClearPageLazyFree(page);
 	}
 
 out_set_pte:
@@ -661,6 +665,26 @@ static unsigned long zap_pte_range(struc
 				    (page->index < details->first_index ||
 				     page->index > details->last_index))
 					continue;
+
+				/*
+				 * MADV_FREE is used to lazily recycle
+				 * anon memory.  The process no longer
+				 * needs the data and wants to avoid IO.
+				 */
+				if (details->madv_free && PageAnon(page)) {
+					if (unlikely(PageSwapCache(page)) &&
+					    !TestSetPageLocked(page)) {
+						remove_exclusive_swap_page(page);
+						unlock_page(page);
+					}
+					/* Optimize this... */
+					ptep_clear_flush_dirty(vma, addr, pte);
+					ptep_clear_flush_young(vma, addr, pte);
+					SetPageLazyFree(page);
+					if (PageActive(page))
+						deactivate_tail_page(page);
+					continue;
+				}
 			}
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
 							tlb->fullmm);
@@ -689,7 +713,8 @@ static unsigned long zap_pte_range(struc
 		 * If details->check_mapping, we leave swap entries;
 		 * if details->nonlinear_vma, we leave file entries.
 		 */
-		if (unlikely(details))
+		if (unlikely(!details->check_mapping &&
+				!details->nonlinear_vma))
 			continue;
 		if (!pte_file(ptent))
 			free_swap_and_cache(pte_to_swp_entry(ptent));
@@ -755,7 +780,8 @@ static unsigned long unmap_page_range(st
 	pgd_t *pgd;
 	unsigned long next;
 
-	if (details && !details->check_mapping && !details->nonlinear_vma)
+	if (details && !details->check_mapping && !details->nonlinear_vma
+			&& !details->madv_free)
 		details = NULL;
 
 	BUG_ON(addr >= end);
--- linux-2.6.20.noarch/mm/page_alloc.c.madvise	2007-04-03 21:53:47.000000000 -0400
+++ linux-2.6.20.noarch/mm/page_alloc.c	2007-04-05 01:27:55.000000000 -0400
@@ -203,6 +203,7 @@ static void bad_page(struct page *page)
 			1 << PG_slab    |
 			1 << PG_swapcache |
 			1 << PG_writeback |
+			1 << PG_lazyfree |
 			1 << PG_buddy );
 	set_page_count(page, 0);
 	reset_page_mapcount(page);
@@ -442,6 +443,8 @@ static inline int free_pages_check(struc
 		bad_page(page);
 	if (PageDirty(page))
 		__ClearPageDirty(page);
+	if (PageLazyFree(page))
+		__ClearPageLazyFree(page);
 	/*
 	 * For now, we report if PG_reserved was found set, but do not
 	 * clear it, and do not free the page.  But we shall soon need
@@ -588,6 +591,7 @@ static int prep_new_page(struct page *pa
 			1 << PG_swapcache |
 			1 << PG_writeback |
 			1 << PG_reserved |
+			1 << PG_lazyfree |
 			1 << PG_buddy ))))
 		bad_page(page);
 
--- linux-2.6.20.noarch/mm/rmap.c.madvise	2007-04-03 21:53:47.000000000 -0400
+++ linux-2.6.20.noarch/mm/rmap.c	2007-04-04 23:53:29.000000000 -0400
@@ -656,7 +656,17 @@ static int try_to_unmap_one(struct page 
 	/* Update high watermark before we lower rss */
 	update_hiwater_rss(mm);
 
-	if (PageAnon(page)) {
+	/* MADV_FREE is used to lazily free memory from userspace. */
+	if (PageLazyFree(page) && !migration) {
+		/* There is new data in the page.  Reinstate it. */
+		if (unlikely(pte_dirty(pteval))) {
+			set_pte_at(mm, address, pte, pteval);
+			ret = SWAP_FAIL;
+			goto out_unmap;
+		}
+		/* Throw the page away. */
+		dec_mm_counter(mm, anon_rss);
+	} else if (PageAnon(page)) {
 		swp_entry_t entry = { .val = page_private(page) };
 
 		if (PageSwapCache(page)) {
--- linux-2.6.20.noarch/mm/swap.c.madvise	2007-04-03 21:53:47.000000000 -0400
+++ linux-2.6.20.noarch/mm/swap.c	2007-04-04 23:33:27.000000000 -0400
@@ -151,6 +151,20 @@ void fastcall activate_page(struct page 
 	spin_unlock_irq(&zone->lru_lock);
 }
 
+void fastcall deactivate_tail_page(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+
+	spin_lock_irq(&zone->lru_lock);
+	if (PageLRU(page) && PageActive(page)) {
+		del_page_from_active_list(zone, page);
+		ClearPageActive(page);
+		add_page_to_inactive_list_tail(zone, page);
+		__count_vm_event(PGDEACTIVATE);
+	}
+	spin_unlock_irq(&zone->lru_lock);
+}
+
 /*
  * Mark a page as having seen activity.
  *
--- linux-2.6.20.noarch/mm/vmscan.c.madvise	2007-04-03 21:53:47.000000000 -0400
+++ linux-2.6.20.noarch/mm/vmscan.c	2007-04-04 03:34:56.000000000 -0400
@@ -473,6 +473,24 @@ static unsigned long shrink_page_list(st
 
 		sc->nr_scanned++;
 
+		/* 
+		 * MADV_DONTNEED pages get reclaimed lazily, unless the
+		 * process reuses it before we get to it.
+		 */
+		if (PageLazyFree(page)) {
+			switch (try_to_unmap(page, 0)) {
+			case SWAP_FAIL:
+				ClearPageLazyFree(page);
+				goto activate_locked;
+			case SWAP_AGAIN:
+				ClearPageLazyFree(page);
+				goto keep_locked;
+			case SWAP_SUCCESS:
+				ClearPageLazyFree(page);
+				goto free_it;
+			}
+		}
+
 		if (!sc->may_swap && page_mapped(page))
 			goto keep_locked;
 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] make MADV_FREE lazily free memory
  2007-04-11  4:30 [PATCH] make MADV_FREE lazily free memory Rik van Riel
@ 2007-04-11 22:41 ` Eric Dumazet
  2007-04-11 22:56   ` Rik van Riel
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2007-04-11 22:41 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm, Ulrich Drepper

Rik van Riel a A(C)crit :
> Make it possible for applications to have the kernel free memory
> lazily.  This reduces a repeated free/malloc cycle from freeing
> pages and allocating them, to just marking them freeable.  If the
> application wants to reuse them before the kernel needs the memory,
> not even a page fault will happen.
> 

Hi Rik

I dont understand this last sentence. If not even a page fault happens, how 
the kernel knows that the page was eventually reused by the application, and 
should not be freed in case of memory pressure ?

ptr = mmap(some space);
madvise(ptr, length, MADV_FREE);
/* kernel may free the pages */
sleep(10);

/* what the application must do know before reusing space ? */
memset(ptr, data, 10000);
/* kernel should not free ptr[0..10000] now */

Thank you

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] make MADV_FREE lazily free memory
  2007-04-11 22:41 ` Eric Dumazet
@ 2007-04-11 22:56   ` Rik van Riel
  2007-04-12  5:44     ` Eric Dumazet
  0 siblings, 1 reply; 13+ messages in thread
From: Rik van Riel @ 2007-04-11 22:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, linux-mm, Ulrich Drepper

Eric Dumazet wrote:
> Rik van Riel a A(C)crit :
>> Make it possible for applications to have the kernel free memory
>> lazily.  This reduces a repeated free/malloc cycle from freeing
>> pages and allocating them, to just marking them freeable.  If the
>> application wants to reuse them before the kernel needs the memory,
>> not even a page fault will happen.

> I dont understand this last sentence. If not even a page fault happens, 
> how the kernel knows that the page was eventually reused by the 
> application, and should not be freed in case of memory pressure ?

Before maybe freeing the page, the kernel checks the referenced
and dirty bits of the page table entries mapping that page.

> ptr = mmap(some space);
> madvise(ptr, length, MADV_FREE);
> /* kernel may free the pages */

All this call does is:
- clear the accessed and dirty bits
- move the page to the far end of the inactive list,
   where it will be the first to be reclaimed

> sleep(10);
> 
> /* what the application must do know before reusing space ? */
> memset(ptr, data, 10000);
> /* kernel should not free ptr[0..10000] now */

Two things can happen here.

If this program used the pages before the kernel needed
them, the program will be reusing its old pages.

If the kernel got there first, you will get page faults
and the kernel will fill in the memory with new pages.

Both of these alternatives are transparent to userspace.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] make MADV_FREE lazily free memory
  2007-04-11 22:56   ` Rik van Riel
@ 2007-04-12  5:44     ` Eric Dumazet
  2007-04-12  6:08       ` Nick Piggin
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2007-04-12  5:44 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm, Ulrich Drepper

Rik van Riel a A(C)crit :
> Eric Dumazet wrote:
>> Rik van Riel a A(C)crit :
>>> Make it possible for applications to have the kernel free memory
>>> lazily.  This reduces a repeated free/malloc cycle from freeing
>>> pages and allocating them, to just marking them freeable.  If the
>>> application wants to reuse them before the kernel needs the memory,
>>> not even a page fault will happen.
> 
>> I dont understand this last sentence. If not even a page fault 
>> happens, how the kernel knows that the page was eventually reused by 
>> the application, and should not be freed in case of memory pressure ?
> 
> Before maybe freeing the page, the kernel checks the referenced
> and dirty bits of the page table entries mapping that page.
> 
>> ptr = mmap(some space);
>> madvise(ptr, length, MADV_FREE);
>> /* kernel may free the pages */
> 
> All this call does is:
> - clear the accessed and dirty bits
> - move the page to the far end of the inactive list,
>   where it will be the first to be reclaimed
> 
>> sleep(10);
>>
>> /* what the application must do know before reusing space ? */
>> memset(ptr, data, 10000);
>> /* kernel should not free ptr[0..10000] now */
> 
> Two things can happen here.
> 
> If this program used the pages before the kernel needed
> them, the program will be reusing its old pages.

ah ok, this is because accessed/dirty bits are set by hardware and not a page 
fault. Is it true for all architectures ?

> 
> If the kernel got there first, you will get page faults
> and the kernel will fill in the memory with new pages.

perfect

> 
> Both of these alternatives are transparent to userspace.
> 

Thanks a lot for these clarifications. This will fly :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] make MADV_FREE lazily free memory
  2007-04-12  5:44     ` Eric Dumazet
@ 2007-04-12  6:08       ` Nick Piggin
  2007-04-12  6:12         ` Nick Piggin
  0 siblings, 1 reply; 13+ messages in thread
From: Nick Piggin @ 2007-04-12  6:08 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Rik van Riel, linux-kernel, linux-mm, Ulrich Drepper

Eric Dumazet wrote:
> Rik van Riel a ecrit :
> 
>> Eric Dumazet wrote:
>>
>>> Rik van Riel a ecrit :
>>>
>>>> Make it possible for applications to have the kernel free memory
>>>> lazily.  This reduces a repeated free/malloc cycle from freeing
>>>> pages and allocating them, to just marking them freeable.  If the
>>>> application wants to reuse them before the kernel needs the memory,
>>>> not even a page fault will happen.
>>
>>
>>> I dont understand this last sentence. If not even a page fault 
>>> happens, how the kernel knows that the page was eventually reused by 
>>> the application, and should not be freed in case of memory pressure ?
>>
>>
>> Before maybe freeing the page, the kernel checks the referenced
>> and dirty bits of the page table entries mapping that page.
>>
>>> ptr = mmap(some space);
>>> madvise(ptr, length, MADV_FREE);
>>> /* kernel may free the pages */
>>
>>
>> All this call does is:
>> - clear the accessed and dirty bits
>> - move the page to the far end of the inactive list,
>>   where it will be the first to be reclaimed
>>
>>> sleep(10);
>>>
>>> /* what the application must do know before reusing space ? */
>>> memset(ptr, data, 10000);
>>> /* kernel should not free ptr[0..10000] now */
>>
>>
>> Two things can happen here.
>>
>> If this program used the pages before the kernel needed
>> them, the program will be reusing its old pages.
> 
> 
> ah ok, this is because accessed/dirty bits are set by hardware and not a 
> page fault.

No it isn't.

> Is it true for all architectures ?

No.

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] make MADV_FREE lazily free memory
  2007-04-12  6:08       ` Nick Piggin
@ 2007-04-12  6:12         ` Nick Piggin
  2007-04-12  7:22           ` Rik van Riel
  0 siblings, 1 reply; 13+ messages in thread
From: Nick Piggin @ 2007-04-12  6:12 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Eric Dumazet, Rik van Riel, linux-kernel, linux-mm, Ulrich Drepper

Nick Piggin wrote:
> Eric Dumazet wrote:
> 

>>> Two things can happen here.
>>>
>>> If this program used the pages before the kernel needed
>>> them, the program will be reusing its old pages.
>>
>>
>>
>> ah ok, this is because accessed/dirty bits are set by hardware and not 
>> a page fault.
> 
> 
> No it isn't.

That is to say, it isn't required for correctness. But if the
question was about avoiding a fault, then yes ;)

But as Linus recently said, even hardware handled faults still
take expensive microarchitectural traps.

> 
>> Is it true for all architectures ?
> 
> 
> No.
> 


-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] make MADV_FREE lazily free memory
  2007-04-12  6:12         ` Nick Piggin
@ 2007-04-12  7:22           ` Rik van Riel
  2007-04-12 13:14             ` Nick Piggin
  2007-04-16 16:10             ` Anton Blanchard
  0 siblings, 2 replies; 13+ messages in thread
From: Rik van Riel @ 2007-04-12  7:22 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Eric Dumazet, linux-kernel, linux-mm, Ulrich Drepper

Nick Piggin wrote:
> Nick Piggin wrote:
>> Eric Dumazet wrote:
>>
> 
>>>> Two things can happen here.
>>>>
>>>> If this program used the pages before the kernel needed
>>>> them, the program will be reusing its old pages.
>>>
>>>
>>>
>>> ah ok, this is because accessed/dirty bits are set by hardware and 
>>> not a page fault.
>>
>>
>> No it isn't.
> 
> That is to say, it isn't required for correctness. But if the
> question was about avoiding a fault, then yes ;)

Making the pte clean also needs to clear the hardware writable
bit on architectures where we do pte dirtying in software.

If we don't, we would have corruption problems all over the VM,
for example in the code around pte_clean_one :)

> But as Linus recently said, even hardware handled faults still
> take expensive microarchitectural traps.

Nowhere near as expensive as a full page fault, though...

The lazy freeing is aimed at avoiding page faults on memory
that is freed and later realloced, which is quite a common
thing in many workloads.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] make MADV_FREE lazily free memory
  2007-04-12  7:22           ` Rik van Riel
@ 2007-04-12 13:14             ` Nick Piggin
  2007-04-12 20:58               ` Rik van Riel
  2007-04-16 16:10             ` Anton Blanchard
  1 sibling, 1 reply; 13+ messages in thread
From: Nick Piggin @ 2007-04-12 13:14 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Eric Dumazet, linux-kernel, linux-mm, Ulrich Drepper

Rik van Riel wrote:
> Nick Piggin wrote:
> 
>> Nick Piggin wrote:
>>
>>> Eric Dumazet wrote:

>>>> ah ok, this is because accessed/dirty bits are set by hardware and 
>>>> not a page fault.
>>>
>>>
>>>
>>> No it isn't.
>>
>>
>> That is to say, it isn't required for correctness. But if the
>> question was about avoiding a fault, then yes ;)
> 
> 
> Making the pte clean also needs to clear the hardware writable
> bit on architectures where we do pte dirtying in software.
> 
> If we don't, we would have corruption problems all over the VM,
> for example in the code around pte_clean_one :)

Sure. Hence why I say that having hardware set a/d bits are not
required for correctness ;)

>> But as Linus recently said, even hardware handled faults still
>> take expensive microarchitectural traps.
> 
> 
> Nowhere near as expensive as a full page fault, though...

I don't doubt that. Do you know rough numbers?


> The lazy freeing is aimed at avoiding page faults on memory
> that is freed and later realloced, which is quite a common
> thing in many workloads.

I would be interested to see how it performs and what these
workloads look like, although we do need to fix the basic glibc and
madvise locking problems first.

The obvious concerns I have with the patch are complexity (versus
payoff), behaviour under reclaim, and behaviour when freed memory
isn't reallocated very quickly (eg. degrading cache performance).

We'll see, I guess...

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] make MADV_FREE lazily free memory
  2007-04-12 13:14             ` Nick Piggin
@ 2007-04-12 20:58               ` Rik van Riel
  2007-04-13  0:34                 ` Nick Piggin
  0 siblings, 1 reply; 13+ messages in thread
From: Rik van Riel @ 2007-04-12 20:58 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Eric Dumazet, linux-kernel, linux-mm, Ulrich Drepper

[-- Attachment #1: Type: text/plain, Size: 1042 bytes --]

Nick Piggin wrote:

>> The lazy freeing is aimed at avoiding page faults on memory
>> that is freed and later realloced, which is quite a common
>> thing in many workloads.
> 
> I would be interested to see how it performs and what these
> workloads look like, although we do need to fix the basic glibc and
> madvise locking problems first.

The attached graph are results of running the MySQL sysbench
workload on my quad core system.  As you can see, performance
with #threads == #cpus (4) almost doubles from 1070 transactions
per second to 2014 transactions/second.

On the high end (16 threads on 4 cpus), performance increases
from 778 transactions/second on vanilla to 1310 transactions/second.

I have also benchmarked running Ulrich's changed glibc on a vanilla
kernel, which gives results somewhere in-between, but much closer to
just the vanilla kernel.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

[-- Attachment #2: linux-2.6-madv_free.patch --]
[-- Type: text/x-patch, Size: 12014 bytes --]

--- linux-2.6.20.noarch/include/asm-alpha/mman.h.madvise	2007-04-04 16:44:50.000000000 -0400
+++ linux-2.6.20.noarch/include/asm-alpha/mman.h	2007-04-04 16:56:24.000000000 -0400
@@ -42,6 +42,7 @@
 #define MADV_WILLNEED	3		/* will need these pages */
 #define	MADV_SPACEAVAIL	5		/* ensure resources are available */
 #define MADV_DONTNEED	6		/* don't need these pages */
+#define MADV_FREE	7		/* don't need the pages or the data */
 
 /* common/generic parameters */
 #define MADV_REMOVE	9		/* remove these pages & resources */
--- linux-2.6.20.noarch/include/asm-generic/mman.h.madvise	2007-04-04 16:44:50.000000000 -0400
+++ linux-2.6.20.noarch/include/asm-generic/mman.h	2007-04-04 16:56:53.000000000 -0400
@@ -29,6 +29,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* don't need the pages or the data */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
--- linux-2.6.20.noarch/include/asm-mips/mman.h.madvise	2007-04-04 16:44:50.000000000 -0400
+++ linux-2.6.20.noarch/include/asm-mips/mman.h	2007-04-04 16:58:02.000000000 -0400
@@ -65,6 +65,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* don't need the pages or the data */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
--- linux-2.6.20.noarch/include/asm-parisc/mman.h.madvise	2007-04-04 16:44:50.000000000 -0400
+++ linux-2.6.20.noarch/include/asm-parisc/mman.h	2007-04-04 16:58:40.000000000 -0400
@@ -38,6 +38,7 @@
 #define MADV_SPACEAVAIL 5               /* insure that resources are reserved */
 #define MADV_VPS_PURGE  6               /* Purge pages from VM page cache */
 #define MADV_VPS_INHERIT 7              /* Inherit parents page size */
+#define MADV_FREE	8		/* don't need the pages or the data */
 
 /* common/generic parameters */
 #define MADV_REMOVE	9		/* remove these pages & resources */
--- linux-2.6.20.noarch/include/asm-xtensa/mman.h.madvise	2007-04-04 16:44:51.000000000 -0400
+++ linux-2.6.20.noarch/include/asm-xtensa/mman.h	2007-04-04 16:59:14.000000000 -0400
@@ -72,6 +72,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* don't need the pages or the data */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
--- linux-2.6.20.noarch/include/linux/mm_inline.h.madvise	2007-04-03 22:53:25.000000000 -0400
+++ linux-2.6.20.noarch/include/linux/mm_inline.h	2007-04-04 22:19:24.000000000 -0400
@@ -13,6 +13,13 @@ add_page_to_inactive_list(struct zone *z
 }
 
 static inline void
+add_page_to_inactive_list_tail(struct zone *zone, struct page *page)
+{
+	list_add_tail(&page->lru, &zone->inactive_list);
+	__inc_zone_state(zone, NR_INACTIVE);
+}
+
+static inline void
 del_page_from_active_list(struct zone *zone, struct page *page)
 {
 	list_del(&page->lru);
--- linux-2.6.20.noarch/include/linux/mm.h.madvise	2007-04-03 22:53:25.000000000 -0400
+++ linux-2.6.20.noarch/include/linux/mm.h	2007-04-04 22:06:45.000000000 -0400
@@ -716,6 +716,7 @@ struct zap_details {
 	pgoff_t last_index;			/* Highest page->index to unmap */
 	spinlock_t *i_mmap_lock;		/* For unmap_mapping_range: */
 	unsigned long truncate_count;		/* Compare vm_truncate_count */
+	short madv_free;			/* MADV_FREE anonymous memory */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *, unsigned long, pte_t);
--- linux-2.6.20.noarch/include/linux/page-flags.h.madvise	2007-04-03 22:54:58.000000000 -0400
+++ linux-2.6.20.noarch/include/linux/page-flags.h	2007-04-05 01:27:38.000000000 -0400
@@ -91,6 +91,8 @@
 #define PG_nosave_free		18	/* Used for system suspend/resume */
 #define PG_buddy		19	/* Page is free, on buddy lists */
 
+#define PG_lazyfree		20	/* MADV_FREE potential throwaway */
+
 /* PG_owner_priv_1 users should have descriptive aliases */
 #define PG_checked		PG_owner_priv_1 /* Used by some filesystems */
 
@@ -237,6 +239,11 @@ static inline void SetPageUptodate(struc
 #define ClearPageReclaim(page)	clear_bit(PG_reclaim, &(page)->flags)
 #define TestClearPageReclaim(page) test_and_clear_bit(PG_reclaim, &(page)->flags)
 
+#define PageLazyFree(page)	test_bit(PG_lazyfree, &(page)->flags)
+#define SetPageLazyFree(page)	set_bit(PG_lazyfree, &(page)->flags)
+#define ClearPageLazyFree(page)	clear_bit(PG_lazyfree, &(page)->flags)
+#define __ClearPageLazyFree(page) __clear_bit(PG_lazyfree, &(page)->flags)
+
 #define PageCompound(page)	test_bit(PG_compound, &(page)->flags)
 #define __SetPageCompound(page)	__set_bit(PG_compound, &(page)->flags)
 #define __ClearPageCompound(page) __clear_bit(PG_compound, &(page)->flags)
--- linux-2.6.20.noarch/include/linux/swap.h.madvise	2007-04-05 00:29:40.000000000 -0400
+++ linux-2.6.20.noarch/include/linux/swap.h	2007-04-06 17:19:20.000000000 -0400
@@ -181,6 +181,7 @@ extern unsigned int nr_free_pagecache_pa
 extern void FASTCALL(lru_cache_add(struct page *));
 extern void FASTCALL(lru_cache_add_active(struct page *));
 extern void FASTCALL(activate_page(struct page *));
+extern void FASTCALL(deactivate_tail_page(struct page *));
 extern void FASTCALL(mark_page_accessed(struct page *));
 extern void lru_add_drain(void);
 extern int lru_add_drain_all(void);
--- linux-2.6.20.noarch/mm/madvise.c.madvise	2007-04-03 21:53:47.000000000 -0400
+++ linux-2.6.20.noarch/mm/madvise.c	2007-04-04 23:48:34.000000000 -0400
@@ -142,8 +142,12 @@ static long madvise_dontneed(struct vm_a
 			.last_index = ULONG_MAX,
 		};
 		zap_page_range(vma, start, end - start, &details);
-	} else
-		zap_page_range(vma, start, end - start, NULL);
+	} else {
+		struct zap_details details = {
+			.madv_free = 1,
+		};
+		zap_page_range(vma, start, end - start, &details);
+	}
 	return 0;
 }
 
@@ -209,7 +213,9 @@ madvise_vma(struct vm_area_struct *vma, 
 		error = madvise_willneed(vma, prev, start, end);
 		break;
 
+	/* FIXME: POSIX says that MADV_DONTNEED cannot throw away data. */
 	case MADV_DONTNEED:
+	case MADV_FREE:
 		error = madvise_dontneed(vma, prev, start, end);
 		break;
 
--- linux-2.6.20.noarch/mm/memory.c.madvise	2007-04-03 21:53:47.000000000 -0400
+++ linux-2.6.20.noarch/mm/memory.c	2007-04-06 17:18:23.000000000 -0400
@@ -432,6 +432,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
 	unsigned long vm_flags = vma->vm_flags;
 	pte_t pte = *src_pte;
 	struct page *page;
+	int dirty = 0;
 
 	/* pte contains position in swap or file, so copy. */
 	if (unlikely(!pte_present(pte))) {
@@ -466,6 +467,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
 	 * in the parent and the child
 	 */
 	if (is_cow_mapping(vm_flags)) {
+		dirty = pte_dirty(pte);
 		ptep_set_wrprotect(src_mm, addr, src_pte);
 		pte = pte_wrprotect(pte);
 	}
@@ -483,6 +485,8 @@ copy_one_pte(struct mm_struct *dst_mm, s
 		get_page(page);
 		page_dup_rmap(page);
 		rss[!!PageAnon(page)]++;
+		if (dirty && PageLazyFree(page))
+			ClearPageLazyFree(page);
 	}
 
 out_set_pte:
@@ -661,6 +665,26 @@ static unsigned long zap_pte_range(struc
 				    (page->index < details->first_index ||
 				     page->index > details->last_index))
 					continue;
+
+				/*
+				 * MADV_FREE is used to lazily recycle
+				 * anon memory.  The process no longer
+				 * needs the data and wants to avoid IO.
+				 */
+				if (details->madv_free && PageAnon(page)) {
+					if (unlikely(PageSwapCache(page)) &&
+					    !TestSetPageLocked(page)) {
+						remove_exclusive_swap_page(page);
+						unlock_page(page);
+					}
+					/* Optimize this... */
+					ptep_clear_flush_dirty(vma, addr, pte);
+					ptep_clear_flush_young(vma, addr, pte);
+					SetPageLazyFree(page);
+					if (PageActive(page))
+						deactivate_tail_page(page);
+					continue;
+				}
 			}
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
 							tlb->fullmm);
@@ -689,7 +713,8 @@ static unsigned long zap_pte_range(struc
 		 * If details->check_mapping, we leave swap entries;
 		 * if details->nonlinear_vma, we leave file entries.
 		 */
-		if (unlikely(details))
+		if (unlikely(!details->check_mapping &&
+				!details->nonlinear_vma))
 			continue;
 		if (!pte_file(ptent))
 			free_swap_and_cache(pte_to_swp_entry(ptent));
@@ -755,7 +780,8 @@ static unsigned long unmap_page_range(st
 	pgd_t *pgd;
 	unsigned long next;
 
-	if (details && !details->check_mapping && !details->nonlinear_vma)
+	if (details && !details->check_mapping && !details->nonlinear_vma
+			&& !details->madv_free)
 		details = NULL;
 
 	BUG_ON(addr >= end);
--- linux-2.6.20.noarch/mm/page_alloc.c.madvise	2007-04-03 21:53:47.000000000 -0400
+++ linux-2.6.20.noarch/mm/page_alloc.c	2007-04-05 01:27:55.000000000 -0400
@@ -203,6 +203,7 @@ static void bad_page(struct page *page)
 			1 << PG_slab    |
 			1 << PG_swapcache |
 			1 << PG_writeback |
+			1 << PG_lazyfree |
 			1 << PG_buddy );
 	set_page_count(page, 0);
 	reset_page_mapcount(page);
@@ -442,6 +443,8 @@ static inline int free_pages_check(struc
 		bad_page(page);
 	if (PageDirty(page))
 		__ClearPageDirty(page);
+	if (PageLazyFree(page))
+		__ClearPageLazyFree(page);
 	/*
 	 * For now, we report if PG_reserved was found set, but do not
 	 * clear it, and do not free the page.  But we shall soon need
@@ -588,6 +591,7 @@ static int prep_new_page(struct page *pa
 			1 << PG_swapcache |
 			1 << PG_writeback |
 			1 << PG_reserved |
+			1 << PG_lazyfree |
 			1 << PG_buddy ))))
 		bad_page(page);
 
--- linux-2.6.20.noarch/mm/rmap.c.madvise	2007-04-03 21:53:47.000000000 -0400
+++ linux-2.6.20.noarch/mm/rmap.c	2007-04-04 23:53:29.000000000 -0400
@@ -656,7 +656,17 @@ static int try_to_unmap_one(struct page 
 	/* Update high watermark before we lower rss */
 	update_hiwater_rss(mm);
 
-	if (PageAnon(page)) {
+	/* MADV_FREE is used to lazily free memory from userspace. */
+	if (PageLazyFree(page) && !migration) {
+		/* There is new data in the page.  Reinstate it. */
+		if (unlikely(pte_dirty(pteval))) {
+			set_pte_at(mm, address, pte, pteval);
+			ret = SWAP_FAIL;
+			goto out_unmap;
+		}
+		/* Throw the page away. */
+		dec_mm_counter(mm, anon_rss);
+	} else if (PageAnon(page)) {
 		swp_entry_t entry = { .val = page_private(page) };
 
 		if (PageSwapCache(page)) {
--- linux-2.6.20.noarch/mm/swap.c.madvise	2007-04-03 21:53:47.000000000 -0400
+++ linux-2.6.20.noarch/mm/swap.c	2007-04-04 23:33:27.000000000 -0400
@@ -151,6 +151,20 @@ void fastcall activate_page(struct page 
 	spin_unlock_irq(&zone->lru_lock);
 }
 
+void fastcall deactivate_tail_page(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+
+	spin_lock_irq(&zone->lru_lock);
+	if (PageLRU(page) && PageActive(page)) {
+		del_page_from_active_list(zone, page);
+		ClearPageActive(page);
+		add_page_to_inactive_list_tail(zone, page);
+		__count_vm_event(PGDEACTIVATE);
+	}
+	spin_unlock_irq(&zone->lru_lock);
+}
+
 /*
  * Mark a page as having seen activity.
  *
--- linux-2.6.20.noarch/mm/vmscan.c.madvise	2007-04-03 21:53:47.000000000 -0400
+++ linux-2.6.20.noarch/mm/vmscan.c	2007-04-04 03:34:56.000000000 -0400
@@ -473,6 +473,24 @@ static unsigned long shrink_page_list(st
 
 		sc->nr_scanned++;
 
+		/* 
+		 * MADV_DONTNEED pages get reclaimed lazily, unless the
+		 * process reuses it before we get to it.
+		 */
+		if (PageLazyFree(page)) {
+			switch (try_to_unmap(page, 0)) {
+			case SWAP_FAIL:
+				ClearPageLazyFree(page);
+				goto activate_locked;
+			case SWAP_AGAIN:
+				ClearPageLazyFree(page);
+				goto keep_locked;
+			case SWAP_SUCCESS:
+				ClearPageLazyFree(page);
+				goto free_it;
+			}
+		}
+
 		if (!sc->may_swap && page_mapped(page))
 			goto keep_locked;
 

[-- Attachment #3: mysql.png --]
[-- Type: image/png, Size: 5126 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] make MADV_FREE lazily free memory
  2007-04-12 20:58               ` Rik van Riel
@ 2007-04-13  0:34                 ` Nick Piggin
  0 siblings, 0 replies; 13+ messages in thread
From: Nick Piggin @ 2007-04-13  0:34 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Eric Dumazet, linux-kernel, linux-mm, Ulrich Drepper

Rik van Riel wrote:
> Nick Piggin wrote:
> 
>>> The lazy freeing is aimed at avoiding page faults on memory
>>> that is freed and later realloced, which is quite a common
>>> thing in many workloads.
>>
>>
>> I would be interested to see how it performs and what these
>> workloads look like, although we do need to fix the basic glibc and
>> madvise locking problems first.
> 
> 
> The attached graph are results of running the MySQL sysbench
> workload on my quad core system.  As you can see, performance
> with #threads == #cpus (4) almost doubles from 1070 transactions
> per second to 2014 transactions/second.
> 
> On the high end (16 threads on 4 cpus), performance increases
> from 778 transactions/second on vanilla to 1310 transactions/second.
> 
> I have also benchmarked running Ulrich's changed glibc on a vanilla
> kernel, which gives results somewhere in-between, but much closer to
> just the vanilla kernel.

Looks like the idle time issue is still biting for those guys.

Hmm, maybe MySQL is actually _touching_ the memory inside a more
critical lock, so the faults get tangled up on mmap_sem there. I
wonder if making malloc call memset right afterwards would hide
that ;) Or the madvise exclusive mmap_sem avoidance.

Seems like with perfect scaling we should get to the 2400 mark.
It would be nice to be able to not degrade under load. Of course
some of that will be MySQL scaling issues.

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] make MADV_FREE lazily free memory
  2007-04-12  7:22           ` Rik van Riel
  2007-04-12 13:14             ` Nick Piggin
@ 2007-04-16 16:10             ` Anton Blanchard
  2007-04-16 16:30               ` Jakub Jelinek
  1 sibling, 1 reply; 13+ messages in thread
From: Anton Blanchard @ 2007-04-16 16:10 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Nick Piggin, Eric Dumazet, linux-kernel, linux-mm, Ulrich Drepper

 
Hi,

> Making the pte clean also needs to clear the hardware writable
> bit on architectures where we do pte dirtying in software.
> 
> If we don't, we would have corruption problems all over the VM,
> for example in the code around pte_clean_one :)
> 
> >But as Linus recently said, even hardware handled faults still
> >take expensive microarchitectural traps.
> 
> Nowhere near as expensive as a full page fault, though...

Unfortunately it will be expensive on architectures that have software
referenced and changed. It would be great if we could just leave them
dirty in the pagetables and transition between a clean and dirty state
via madvise calls, but thats just wishful thinking on my part :)

Anton

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] make MADV_FREE lazily free memory
  2007-04-16 16:10             ` Anton Blanchard
@ 2007-04-16 16:30               ` Jakub Jelinek
  2007-04-16 18:39                 ` Anton Blanchard
  0 siblings, 1 reply; 13+ messages in thread
From: Jakub Jelinek @ 2007-04-16 16:30 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Rik van Riel, Nick Piggin, Eric Dumazet, linux-kernel, linux-mm,
	Ulrich Drepper

On Mon, Apr 16, 2007 at 11:10:39AM -0500, Anton Blanchard wrote:
> > Making the pte clean also needs to clear the hardware writable
> > bit on architectures where we do pte dirtying in software.
> > 
> > If we don't, we would have corruption problems all over the VM,
> > for example in the code around pte_clean_one :)
> > 
> > >But as Linus recently said, even hardware handled faults still
> > >take expensive microarchitectural traps.
> > 
> > Nowhere near as expensive as a full page fault, though...
> 
> Unfortunately it will be expensive on architectures that have software
> referenced and changed. It would be great if we could just leave them
> dirty in the pagetables and transition between a clean and dirty state
> via madvise calls, but thats just wishful thinking on my part :)

That would mean an additional syscall.  Furthermore, if you allocate a big
chunk of memory, dirty it, then free (with madvise (MADV_FREE)) it and soon
allocate the same size of memory again, it is better to start that with
non-dirty memory, it might be that this time you e.g. don't modify a big
part of the chunk.  If all that memory was kept dirty all the time and
just marked/unmarked for lazy reuse with MADV_FREE/MADV_UNDO_FREE, all that
memory would need to be saved to disk when paging out as it was marked
dirty, while with current Rik's MADV_FREE that will happen only for pages
that were actually dirtied after the last malloc.

	Jakub

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] make MADV_FREE lazily free memory
  2007-04-16 16:30               ` Jakub Jelinek
@ 2007-04-16 18:39                 ` Anton Blanchard
  0 siblings, 0 replies; 13+ messages in thread
From: Anton Blanchard @ 2007-04-16 18:39 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Rik van Riel, Nick Piggin, Eric Dumazet, linux-kernel, linux-mm,
	Ulrich Drepper

 
Hi Jakub,

> That would mean an additional syscall.  Furthermore, if you allocate a big
> chunk of memory, dirty it, then free (with madvise (MADV_FREE)) it and soon
> allocate the same size of memory again, it is better to start that with
> non-dirty memory, it might be that this time you e.g. don't modify a big
> part of the chunk.  If all that memory was kept dirty all the time and
> just marked/unmarked for lazy reuse with MADV_FREE/MADV_UNDO_FREE, all that
> memory would need to be saved to disk when paging out as it was marked
> dirty, while with current Rik's MADV_FREE that will happen only for pages
> that were actually dirtied after the last malloc.

Yep this all makes sense. I was looking at it from the other angle where
on some workloads we have to force malloc to use brk for best
performance. Im sure the MADV_FREE changes will close that gap but it
would be interesting to see if there is still a gap on the problem
workloads. Maybe Im worrying about nothing.

Anton

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2007-04-16 18:39 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-04-11  4:30 [PATCH] make MADV_FREE lazily free memory Rik van Riel
2007-04-11 22:41 ` Eric Dumazet
2007-04-11 22:56   ` Rik van Riel
2007-04-12  5:44     ` Eric Dumazet
2007-04-12  6:08       ` Nick Piggin
2007-04-12  6:12         ` Nick Piggin
2007-04-12  7:22           ` Rik van Riel
2007-04-12 13:14             ` Nick Piggin
2007-04-12 20:58               ` Rik van Riel
2007-04-13  0:34                 ` Nick Piggin
2007-04-16 16:10             ` Anton Blanchard
2007-04-16 16:30               ` Jakub Jelinek
2007-04-16 18:39                 ` Anton Blanchard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox