linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* removing flush_tlb_mm as a generic hook ?
@ 2007-07-09  3:47 Benjamin Herrenschmidt
  2007-07-09  6:36 ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 12+ messages in thread
From: Benjamin Herrenschmidt @ 2007-07-09  3:47 UTC (permalink / raw)
  To: linux-mm; +Cc: Linux Kernel list

Hi folks !

While toying around with various MM callbacks, I found out that
flush_tlb_mm() as a generic hook provided by the archs has been mostly
obsoleted by the mmu_gather stuff.

(I'm not talking about archs internally wanting to implement it and use
it as a tlb_flush(), I'm talking about possibly making that optional :-)

I see two remaining users:

 - fs/proc/task_mmu.c, which I easily converted to use the mmu_gather
(I'll send a patch if people agree it's worth doing)

 - kernel/fork.c uses it to flush the "old" mm. That's the "meat".

I wonder if it's worth pursuing, that is converting copy_page_range to
use an mmu_gather on the source instead of using flush_tlb_mm. It might
allow some archs that can't just "flush all" easily but have to go
through every PTE individually to improve things a bit on fork, and it
allow them to remove the flush_tlb_mm() logic.

There is one reason why it's not a trivial conversion though, is that
copy_page_range() calls copy_hugetlb_page_range() for huge pages, and
I'm not sure about mixing up the hugetlb stuff with the mmu_gather
stuff, I need to do a bit more code auditing to figure out whether
that's an ok thing to do.

Nothing very urgent or important, it's just that one less hook seems
like a good idea ;-)

Cheers,
Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: removing flush_tlb_mm as a generic hook ?
  2007-07-09  3:47 removing flush_tlb_mm as a generic hook ? Benjamin Herrenschmidt
@ 2007-07-09  6:36 ` Benjamin Herrenschmidt
  2007-07-09  6:45   ` [RFC/PATCH] Use mmu_gather for fork() instead of flush_tlb_mm() Benjamin Herrenschmidt
  2007-07-09  6:46   ` [RFC/PATCH] Use mmu_gather for /proc stuff " Benjamin Herrenschmidt
  0 siblings, 2 replies; 12+ messages in thread
From: Benjamin Herrenschmidt @ 2007-07-09  6:36 UTC (permalink / raw)
  To: linux-mm; +Cc: Linux Kernel list

On Mon, 2007-07-09 at 13:47 +1000, Benjamin Herrenschmidt wrote:
> Hi folks !
> 
> While toying around with various MM callbacks, I found out that
> flush_tlb_mm() as a generic hook provided by the archs has been mostly
> obsoleted by the mmu_gather stuff.

And since life is always better with patches... here are two that
do fork and proc/fs/task_mmu. There should be an improvement on archs
like hash-table based ppc32 where flush_tlb_mm() currently has to walk
the page tables, which means an additional walk pass in fork. With this
patch, there will be only one pass, and it will only hit the pages that
have actually been marked RO.

I need to do some proper testing, but in copy to this, I'm posting the
patches anyway for review / comments.

Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC/PATCH] Use mmu_gather for fork() instead of flush_tlb_mm()
  2007-07-09  6:36 ` Benjamin Herrenschmidt
@ 2007-07-09  6:45   ` Benjamin Herrenschmidt
  2007-07-09  7:39     ` Nick Piggin
  2007-07-09  6:46   ` [RFC/PATCH] Use mmu_gather for /proc stuff " Benjamin Herrenschmidt
  1 sibling, 1 reply; 12+ messages in thread
From: Benjamin Herrenschmidt @ 2007-07-09  6:45 UTC (permalink / raw)
  To: linux-mm; +Cc: Linux Kernel list

Use mmu_gather for fork() instead of flush_tlb_mm()

This patch uses an mmu_gather for copying page tables instead of
flush_tlb_mm(). This allows archs like ppc32 with hash table to
avoid walking the page tables a second time to invalidate hash
entries, and to only flush PTEs that have actually been changed
from RW to RO.

Note that this contain a small change to the mmu gather stuff,
it must not call free_pages_and_swap_cache() if no page have been
queued up for freeing (if we are only invalidating PTEs). Calling
it on fork can deadlock (I haven't dug why but it looks like a
good idea to test anyway if we're going to use the mmu_gather for
more than just removing pages).

If the patch gets accepted, I will split that bit from the rest
of the patch and send it separately.

The main possible issue I see is with huge pages. Arch code might
have relied on flush_tlb_mm() and might not cope with
tlb_remove_tlb_entry() called for huge PTEs.

Other possible issues are if archs make assumptions about
flush_tlb_mm() being called in fork for different unrelated reasons.

Ah also, we could probably improve the tracking of start/end, in
the case of lock breaking, the outside function will still finish
the batch with the entire range. It doesn't matter on ppc and x86
I think though.

Index: linux-work/include/linux/hugetlb.h
===================================================================
--- linux-work.orig/include/linux/hugetlb.h	2007-07-09 16:17:04.000000000 +1000
+++ linux-work/include/linux/hugetlb.h	2007-07-09 16:26:38.000000000 +1000
@@ -15,7 +15,7 @@ static inline int is_vm_hugetlb_page(str
 }
 
 int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *);
-int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
+int copy_hugetlb_page_range(struct mmu_gather **tlbp, struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
 int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int);
 void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long);
 void __unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long);
@@ -107,7 +107,7 @@ static inline unsigned long hugetlb_tota
 
 #define follow_hugetlb_page(m,v,p,vs,a,b,i)	({ BUG(); 0; })
 #define follow_huge_addr(mm, addr, write)	ERR_PTR(-EINVAL)
-#define copy_hugetlb_page_range(src, dst, vma)	({ BUG(); 0; })
+#define copy_hugetlb_page_range(tlbp, src, dst, vma)	({ BUG(); 0; })
 #define hugetlb_prefault(mapping, vma)		({ BUG(); 0; })
 #define unmap_hugepage_range(vma, start, end)	BUG()
 #define hugetlb_report_meminfo(buf)		0
Index: linux-work/include/linux/mm.h
===================================================================
--- linux-work.orig/include/linux/mm.h	2007-07-09 16:17:04.000000000 +1000
+++ linux-work/include/linux/mm.h	2007-07-09 16:26:38.000000000 +1000
@@ -748,8 +748,8 @@ void free_pgd_range(struct mmu_gather **
 		unsigned long end, unsigned long floor, unsigned long ceiling);
 void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
-int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
-			struct vm_area_struct *vma);
+int copy_page_range(struct mmu_gather **tlbp, struct mm_struct *dst,
+		    struct mm_struct *src, struct vm_area_struct *vma);
 int zeromap_page_range(struct vm_area_struct *vma, unsigned long from,
 			unsigned long size, pgprot_t prot);
 void unmap_mapping_range(struct address_space *mapping,
Index: linux-work/kernel/fork.c
===================================================================
--- linux-work.orig/kernel/fork.c	2007-07-09 16:17:04.000000000 +1000
+++ linux-work/kernel/fork.c	2007-07-09 16:26:38.000000000 +1000
@@ -56,6 +56,7 @@
 #include <asm/mmu_context.h>
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
+#include <asm/tlb.h>
 
 /*
  * Protected counters by write_lock_irq(&tasklist_lock)
@@ -199,8 +200,9 @@ static inline int dup_mmap(struct mm_str
 	struct vm_area_struct *mpnt, *tmp, **pprev;
 	struct rb_node **rb_link, *rb_parent;
 	int retval;
-	unsigned long charge;
+	unsigned long charge, tlb_start, tlb_end;
 	struct mempolicy *pol;
+	struct mmu_gather *tlb;
 
 	down_write(&oldmm->mmap_sem);
 	flush_cache_dup_mm(oldmm);
@@ -220,6 +222,10 @@ static inline int dup_mmap(struct mm_str
 	rb_link = &mm->mm_rb.rb_node;

 	rb_parent = NULL;
 	pprev = &mm->mmap;
+	tlb_start = -1;
+	tlb_end = 0;
+
+	tlb = tlb_gather_mmu(oldmm, 1);
 
 	for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) {
 		struct file *file;
@@ -242,6 +248,11 @@ static inline int dup_mmap(struct mm_str
 		if (!tmp)
 			goto fail_nomem;
 		*tmp = *mpnt;
+
+		if (unlikely(tlb_start == -1))
+			tlb_start = mpnt->vm_start;
+		tlb_end = mpnt->vm_end;
+
 		pol = mpol_copy(vma_policy(mpnt));
 		retval = PTR_ERR(pol);
 		if (IS_ERR(pol))
@@ -278,7 +289,7 @@ static inline int dup_mmap(struct mm_str
 		rb_parent = &tmp->vm_rb;
 
 		mm->map_count++;
-		retval = copy_page_range(mm, oldmm, mpnt);
+		retval = copy_page_range(&tlb, mm, oldmm, mpnt);
 
 		if (tmp->vm_ops && tmp->vm_ops->open)
 			tmp->vm_ops->open(tmp);
@@ -291,12 +302,15 @@ static inline int dup_mmap(struct mm_str
 	retval = 0;
 out:
 	up_write(&mm->mmap_sem);
-	flush_tlb_mm(oldmm);
+	if (tlb && tlb_start < tlb_end)
+		tlb_finish_mmu(tlb, tlb_start, tlb_end);
 	up_write(&oldmm->mmap_sem);
 	return retval;
 fail_nomem_policy:
 	kmem_cache_free(vm_area_cachep, tmp);
 fail_nomem:
+	if (tlb && tlb_start < tlb_end)
+		tlb_finish_mmu(tlb, tlb_start, tlb_end);
 	retval = -ENOMEM;
 	vm_unacct_memory(charge);
 	goto out;
Index: linux-work/mm/hugetlb.c
===================================================================
--- linux-work.orig/mm/hugetlb.c	2007-07-09 16:17:04.000000000 +1000
+++ linux-work/mm/hugetlb.c	2007-07-09 16:26:38.000000000 +1000
@@ -333,8 +333,8 @@ static void set_huge_ptep_writable(struc
 }
 
 
-int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
-			    struct vm_area_struct *vma)
+int copy_hugetlb_page_range(struct mmu_gather **tlbp, struct mm_struct *dst,
+			    struct mm_struct *src, struct vm_area_struct *vma)
 {
 	pte_t *src_pte, *dst_pte, entry;
 	struct page *ptepage;
@@ -353,8 +353,10 @@ int copy_hugetlb_page_range(struct mm_st
 		spin_lock(&dst->page_table_lock);
 		spin_lock(&src->page_table_lock);
 		if (!pte_none(*src_pte)) {
-			if (cow)
+			if (cow) {
 				ptep_set_wrprotect(src, addr, src_pte);
+				tlb_remove_tlb_entry((*tlbp), src_pte, addr);
+			}
 			entry = *src_pte;
 			ptepage = pte_page(entry);
 			get_page(ptepage);
@@ -363,6 +365,7 @@ int copy_hugetlb_page_range(struct mm_st
 		spin_unlock(&src->page_table_lock);
 		spin_unlock(&dst->page_table_lock);
 	}
+
 	return 0;
 
 nomem:
Index: linux-work/mm/memory.c
===================================================================
--- linux-work.orig/mm/memory.c	2007-07-09 16:17:04.000000000 +1000
+++ linux-work/mm/memory.c	2007-07-09 16:34:54.000000000 +1000
@@ -425,9 +425,9 @@ struct page *vm_normal_page(struct vm_ar
  */
 
 static inline void
-copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
-		unsigned long addr, int *rss)
+copy_one_pte(struct mmu_gather *tlb, struct mm_struct *dst_mm,
+	     struct mm_struct *src_mm, pte_t *dst_pte, pte_t *src_pte,
+	     struct vm_area_struct *vma, unsigned long addr, int *rss)
 {
 	unsigned long vm_flags = vma->vm_flags;
 	pte_t pte = *src_pte;
@@ -466,8 +466,11 @@ copy_one_pte(struct mm_struct *dst_mm, s
 	 * in the parent and the child
 	 */
 	if (is_cow_mapping(vm_flags)) {
+		pte_t old = *src_pte;
 		ptep_set_wrprotect(src_mm, addr, src_pte);
 		pte = pte_wrprotect(pte);
+		if (tlb && !pte_same(old, *src_pte))
+			tlb_remove_tlb_entry(tlb, src_pte, addr);
 	}
 
 	/*
@@ -489,13 +492,15 @@ out_set_pte:
 	set_pte_at(dst_mm, addr, dst_pte, pte);
 }
 
-static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+static int copy_pte_range(struct mmu_gather **tlbp,
+		struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
 		unsigned long addr, unsigned long end)
 {
 	pte_t *src_pte, *dst_pte;
 	spinlock_t *src_ptl, *dst_ptl;
-	int progress = 0;
+	unsigned long start_addr = addr;
+	int fullmm, progress = 0;
 	int rss[2];
 
 again:
@@ -524,7 +529,8 @@ again:
 			progress++;
 			continue;
 		}
-		copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss);
+		copy_one_pte(*tlbp, dst_mm, src_mm, dst_pte, src_pte,
+			     vma, addr, rss);
 		progress += 8;
 	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
 
@@ -533,13 +539,19 @@ again:
 	pte_unmap_nested(src_pte - 1);
 	add_mm_rss(dst_mm, rss[0], rss[1]);
 	pte_unmap_unlock(dst_pte - 1, dst_ptl);
+	fullmm = (*tlbp)->fullmm;
+	tlb_finish_mmu(*tlbp, start_addr, addr);
 	cond_resched();
-	if (addr != end)
+	if (addr != end) {
+		*tlbp = tlb_gather_mmu(src_mm, fullmm);
+		start_addr = addr;
 		goto again;
+	}
 	return 0;
 }
 
-static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+static inline int copy_pmd_range(struct mmu_gather **tlbp,
+		struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pud_t *dst_pud, pud_t *src_pud, struct vm_area_struct *vma,
 		unsigned long addr, unsigned long end)
 {
@@ -554,14 +566,15 @@ static inline int copy_pmd_range(struct 
 		next = pmd_addr_end(addr, end);
 		if (pmd_none_or_clear_bad(src_pmd))
 			continue;
-		if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
-						vma, addr, next))
+		if (copy_pte_range(tlbp, dst_mm, src_mm, dst_pmd, src_pmd,
+				   vma, addr, next))
 			return -ENOMEM;
 	} while (dst_pmd++, src_pmd++, addr = next, addr != end);
 	return 0;
 }
 
-static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+static inline int copy_pud_range(struct mmu_gather **tlbp,
+		struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pgd_t *dst_pgd, pgd_t *src_pgd, struct vm_area_struct *vma,
 		unsigned long addr, unsigned long end)
 {
@@ -576,15 +589,15 @@ static inline int copy_pud_range(struct 
 		next = pud_addr_end(addr, end);
 		if (pud_none_or_clear_bad(src_pud))
 			continue;
-		if (copy_pmd_range(dst_mm, src_mm, dst_pud, src_pud,
-						vma, addr, next))
+		if (copy_pmd_range(tlbp, dst_mm, src_mm, dst_pud, src_pud,
+				   vma, addr, next))
 			return -ENOMEM;
 	} while (dst_pud++, src_pud++, addr = next, addr != end);
 	return 0;
 }
 
-int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		struct vm_area_struct *vma)
+int copy_page_range(struct mmu_gather **tlbp, struct mm_struct *dst_mm,
+		    struct mm_struct *src_mm, struct vm_area_struct *vma)
 {
 	pgd_t *src_pgd, *dst_pgd;
 	unsigned long next;
@@ -603,7 +616,7 @@ int copy_page_range(struct mm_struct *ds
 	}
 
 	if (is_vm_hugetlb_page(vma))
-		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
+		return copy_hugetlb_page_range(tlbp, dst_mm, src_mm, vma);
 
 	dst_pgd = pgd_offset(dst_mm, addr);
 	src_pgd = pgd_offset(src_mm, addr);
@@ -611,8 +624,8 @@ int copy_page_range(struct mm_struct *ds
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(src_pgd))
 			continue;
-		if (copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
-						vma, addr, next))
+		if (copy_pud_range(tlbp, dst_mm, src_mm, dst_pgd, src_pgd,
+				   vma, addr, next))
 			return -ENOMEM;
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 	return 0;
Index: linux-work/include/asm-generic/tlb.h
===================================================================
--- linux-work.orig/include/asm-generic/tlb.h	2007-07-09 16:17:04.000000000 +1000
+++ linux-work/include/asm-generic/tlb.h	2007-07-09 16:26:38.000000000 +1000
@@ -72,7 +72,7 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
 		return;
 	tlb->need_flush = 0;
 	tlb_flush(tlb);
-	if (!tlb_fast_mode(tlb)) {
+	if (!tlb_fast_mode(tlb) && tlb->nr) {
 		free_pages_and_swap_cache(tlb->pages, tlb->nr);
 		tlb->nr = 0;
 	}


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC/PATCH] Use mmu_gather for /proc stuff instead of flush_tlb_mm()
  2007-07-09  6:36 ` Benjamin Herrenschmidt
  2007-07-09  6:45   ` [RFC/PATCH] Use mmu_gather for fork() instead of flush_tlb_mm() Benjamin Herrenschmidt
@ 2007-07-09  6:46   ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 12+ messages in thread
From: Benjamin Herrenschmidt @ 2007-07-09  6:46 UTC (permalink / raw)
  To: linux-mm; +Cc: Linux Kernel list

Use mmu_gather for fs/proc/task_mmu.c

This removes the use of flush_tlb_mm() from that proc file, using
an mmu_gather instead.

No signed-off yet, not to be merged at this point.


Index: linux-work/fs/proc/task_mmu.c
===================================================================
--- linux-work.orig/fs/proc/task_mmu.c	2007-07-09 16:27:09.000000000 +1000
+++ linux-work/fs/proc/task_mmu.c	2007-07-09 16:27:47.000000000 +1000
@@ -9,7 +9,7 @@
 
 #include <asm/elf.h>
 #include <asm/uaccess.h>
-#include <asm/tlbflush.h>
+#include <asm/tlb.h>
 #include "internal.h"
 
 char *task_mem(struct mm_struct *mm, char *buffer)
@@ -124,11 +124,13 @@ struct mem_size_stats
 	unsigned long referenced;
 };
 
+typedef void (*pmd_action_t)(struct mmu_gather *tlb, struct vm_area_struct *,
+			     pmd_t *, unsigned long, unsigned long, void *);
 struct pmd_walker {
+	struct mmu_gather *tlb;
 	struct vm_area_struct *vma;
 	void *private;
-	void (*action)(struct vm_area_struct *, pmd_t *, unsigned long,
-		       unsigned long, void *);
+	pmd_action_t action;
 };
 
 static int show_map_internal(struct seq_file *m, void *v, struct mem_size_stats *mss)
@@ -218,7 +220,8 @@ static int show_map(struct seq_file *m, 
 	return show_map_internal(m, v, NULL);
 }
 
-static void smaps_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
+static void smaps_pte_range(struct mmu_gather *tlb,
+			    struct vm_area_struct *vma, pmd_t *pmd,
 			    unsigned long addr, unsigned long end,
 			    void *private)
 {
@@ -258,7 +261,8 @@ static void smaps_pte_range(struct vm_ar
 	cond_resched();
 }
 
-static void clear_refs_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
+static void clear_refs_pte_range(struct mmu_gather *tlb,
+				 struct vm_area_struct *vma, pmd_t *pmd,
 				 unsigned long addr, unsigned long end,
 				 void *private)
 {
@@ -279,6 +283,8 @@ static void clear_refs_pte_range(struct 
 		/* Clear accessed and referenced bits. */
 		ptep_test_and_clear_young(vma, addr, pte);
 		ClearPageReferenced(page);
+		if (tlb)
+			tlb_remove_tlb_entry(tlb, pte, addr);
 	}
 	pte_unmap_unlock(pte - 1, ptl);
 	cond_resched();
@@ -295,7 +301,8 @@ static inline void walk_pmd_range(struct
 		next = pmd_addr_end(addr, end);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
-		walker->action(walker->vma, pmd, addr, next, walker->private);
+		walker->action(walker->tlb, walker->vma, pmd, addr, next,
+			       walker->private);
 	}
 }
 
@@ -323,11 +330,9 @@ static inline void walk_pud_range(struct
  * Recursively walk the page table for the memory area in a VMA, calling
  * a callback for every bottom-level (PTE) page table.
  */
-static inline void walk_page_range(struct vm_area_struct *vma,
-				   void (*action)(struct vm_area_struct *,
-						  pmd_t *, unsigned long,
-						  unsigned long, void *),
-				   void *private)
+static inline void walk_page_range(struct mmu_gather *tlb,
+				   struct vm_area_struct *vma,
+				   pmd_action_t	action, void *private)
 {
 	unsigned long addr = vma->vm_start;
 	unsigned long end = vma->vm_end;
@@ -335,6 +340,7 @@ static inline void walk_page_range(struc
 		.vma		= vma,
 		.private	= private,
 		.action		= action,
+		.tlb		= tlb,
 	};
 	pgd_t *pgd;
 	unsigned long next;
@@ -355,19 +361,25 @@ static int show_smap(struct seq_file *m,
 
 	memset(&mss, 0, sizeof mss);
 	if (vma->vm_mm && !is_vm_hugetlb_page(vma))
-		walk_page_range(vma, smaps_pte_range, &mss);
+		walk_page_range(NULL, vma, smaps_pte_range, &mss);
 	return show_map_internal(m, v, &mss);
 }
 
 void clear_refs_smap(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
+	struct mmu_gather *tlb;
+	unsigned long end_addr = 0;
 
 	down_read(&mm->mmap_sem);
+	tlb = tlb_gather_mmu(mm, 0);
 	for (vma = mm->mmap; vma; vma = vma->vm_next)
-		if (vma->vm_mm && !is_vm_hugetlb_page(vma))
-			walk_page_range(vma, clear_refs_pte_range, NULL);
-	flush_tlb_mm(mm);
+		if (vma->vm_mm && !is_vm_hugetlb_page(vma)) {
+			end_addr = max(vma->vm_end, end_addr);
+			walk_page_range(tlb, vma, clear_refs_pte_range, NULL);
+		}
+	if (tlb)
+		tlb_finish_mmu(tlb, 0, end_addr);
 	up_read(&mm->mmap_sem);
 }
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC/PATCH] Use mmu_gather for fork() instead of flush_tlb_mm()
  2007-07-09  6:45   ` [RFC/PATCH] Use mmu_gather for fork() instead of flush_tlb_mm() Benjamin Herrenschmidt
@ 2007-07-09  7:39     ` Nick Piggin
  2007-07-09  9:12       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 12+ messages in thread
From: Nick Piggin @ 2007-07-09  7:39 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linux-mm, Linux Kernel list

Benjamin Herrenschmidt wrote:
> Use mmu_gather for fork() instead of flush_tlb_mm()
> 
> This patch uses an mmu_gather for copying page tables instead of
> flush_tlb_mm(). This allows archs like ppc32 with hash table to
> avoid walking the page tables a second time to invalidate hash
> entries, and to only flush PTEs that have actually been changed
> from RW to RO.
> 
> Note that this contain a small change to the mmu gather stuff,
> it must not call free_pages_and_swap_cache() if no page have been
> queued up for freeing (if we are only invalidating PTEs). Calling
> it on fork can deadlock (I haven't dug why but it looks like a
> good idea to test anyway if we're going to use the mmu_gather for
> more than just removing pages).
> 
> If the patch gets accepted, I will split that bit from the rest
> of the patch and send it separately.
> 
> The main possible issue I see is with huge pages. Arch code might
> have relied on flush_tlb_mm() and might not cope with
> tlb_remove_tlb_entry() called for huge PTEs.
> 
> Other possible issues are if archs make assumptions about
> flush_tlb_mm() being called in fork for different unrelated reasons.
> 
> Ah also, we could probably improve the tracking of start/end, in
> the case of lock breaking, the outside function will still finish
> the batch with the entire range. It doesn't matter on ppc and x86
> I think though.

Would it be better off to start off with a new API for this? The
mmu gather I think is traditionally entirely for dealing with
page removal...

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC/PATCH] Use mmu_gather for fork() instead of flush_tlb_mm()
  2007-07-09  7:39     ` Nick Piggin
@ 2007-07-09  9:12       ` Benjamin Herrenschmidt
  2007-07-09  9:29         ` Nick Piggin
  0 siblings, 1 reply; 12+ messages in thread
From: Benjamin Herrenschmidt @ 2007-07-09  9:12 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-mm, Linux Kernel list

On Mon, 2007-07-09 at 17:39 +1000, Nick Piggin wrote:

> Would it be better off to start off with a new API for this? The
> mmu gather I think is traditionally entirely for dealing with
> page removal...

It would be weird because the new API would mostly duplicate this one,
and we would end up with duplicated hooks..

I think it's fine to have one mmu_gather construct to gather changes to
PTEs, it doesn't have to contain freed pages, though it can. Appart from
that simple nr test, it's entirely the same code and the existing
implementation for all archs should just work (well, should, I haven't
actually looked in details yet :-)

Maybe we can use a separate call than tlb_remove_tlb_entry() tho,
something like tlb_invalidate_entry(), which by default would do the
same but that archs can override if they want to distinguish page
freeing and simple invalidations at that level.

That means adding a suitable default __tlb_invalidate_entry() to all
archs but that shouldn't be too hard with a bit of help from the various
maintainers.

But I think the rest of the mmu_gather interface should stay the same.

I would like to add a few more things to it next, mostly:

 - tlb_gather_lockdrop() (or find a better name) called just before we
drop the page table / PTE lock. That would allow me to bring back ppc64
to use the normal mmu_gather API instead of hijacking the lazy mmu stuff
as it's doing now by flushing my batches before the lock is dropped.

 - start moving over pretty much everything that walks page tables to it

So that in the end, we basically go down to:

 - flush_tlb_page() for single page invalidates (protection faults for
example)

 - mmu_gather batches for everything else userland

 - possibly stick to something else for kernel mappings, to be
discussed. I'm find with doing batches there too :-)

The current situation is just too messy imho, and generalizing batches
will be useful to platforms like hash table ppc's or funky TLBs.

Ben.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC/PATCH] Use mmu_gather for fork() instead of flush_tlb_mm()
  2007-07-09  9:12       ` Benjamin Herrenschmidt
@ 2007-07-09  9:29         ` Nick Piggin
  2007-07-09  9:47           ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 12+ messages in thread
From: Nick Piggin @ 2007-07-09  9:29 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linux-mm, Linux Kernel list

Benjamin Herrenschmidt wrote:
> On Mon, 2007-07-09 at 17:39 +1000, Nick Piggin wrote:
> 
> 
>>Would it be better off to start off with a new API for this? The
>>mmu gather I think is traditionally entirely for dealing with
>>page removal...
> 
> 
> It would be weird because the new API would mostly duplicate this one,
> and we would end up with duplicated hooks..

They could just #define one to the other though, there are only a small
number of them. Is there a downside to not making them distinct? i386
for example probably would just keep doing a tlb flush for fork and not
want to worry about touching the tlb gather stuff.

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC/PATCH] Use mmu_gather for fork() instead of flush_tlb_mm()
  2007-07-09  9:29         ` Nick Piggin
@ 2007-07-09  9:47           ` Benjamin Herrenschmidt
  2007-07-09 10:12             ` Nick Piggin
  0 siblings, 1 reply; 12+ messages in thread
From: Benjamin Herrenschmidt @ 2007-07-09  9:47 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-mm, Linux Kernel list

On Mon, 2007-07-09 at 19:29 +1000, Nick Piggin wrote:
> They could just #define one to the other though, there are only a
> small
> number of them. Is there a downside to not making them distinct? i386
> for example probably would just keep doing a tlb flush for fork and
> not
> want to worry about touching the tlb gather stuff.

But the tlb gather stuff just does ... a flush_tlb_mm() on x86 :-)

I really think it's the right API

Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC/PATCH] Use mmu_gather for fork() instead of flush_tlb_mm()
  2007-07-09  9:47           ` Benjamin Herrenschmidt
@ 2007-07-09 10:12             ` Nick Piggin
  2007-07-09 10:18               ` Nick Piggin
  2007-07-09 12:32               ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 12+ messages in thread
From: Nick Piggin @ 2007-07-09 10:12 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linux-mm, Linux Kernel list

Benjamin Herrenschmidt wrote:
> On Mon, 2007-07-09 at 19:29 +1000, Nick Piggin wrote:
> 
>>They could just #define one to the other though, there are only a
>>small
>>number of them. Is there a downside to not making them distinct? i386
>>for example probably would just keep doing a tlb flush for fork and
>>not
>>want to worry about touching the tlb gather stuff.
> 
> 
> But the tlb gather stuff just does ... a flush_tlb_mm() on x86 :-)

But it still does the get_cpu of the mmu gather data structure and
has to look in there and touch the cacheline. You're also having to
do more work when unlocking/relocking the ptl etc.


> I really think it's the right API



-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC/PATCH] Use mmu_gather for fork() instead of flush_tlb_mm()
  2007-07-09 10:12             ` Nick Piggin
@ 2007-07-09 10:18               ` Nick Piggin
  2007-07-09 12:37                 ` Benjamin Herrenschmidt
  2007-07-09 12:32               ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 12+ messages in thread
From: Nick Piggin @ 2007-07-09 10:18 UTC (permalink / raw)
  Cc: Benjamin Herrenschmidt, linux-mm, Linux Kernel list

Nick Piggin wrote:
> Benjamin Herrenschmidt wrote:
> 
>> On Mon, 2007-07-09 at 19:29 +1000, Nick Piggin wrote:
>>
>>> They could just #define one to the other though, there are only a
>>> small
>>> number of them. Is there a downside to not making them distinct? i386
>>> for example probably would just keep doing a tlb flush for fork and
>>> not
>>> want to worry about touching the tlb gather stuff.
>>
>>
>>
>> But the tlb gather stuff just does ... a flush_tlb_mm() on x86 :-)
> 
> 
> But it still does the get_cpu of the mmu gather data structure and

To elaborate on this one... I realise for this one that in the kernel
where this is currently used everything is non-preemptible anyway
because of the ptl. And I also realise that -rt kernel issues don't
really have a bearing on mainline kernel.. but the generic
implementation of this API is fundamentally used to operate on a
per-cpu data structure that is only required when tearing down page
tables. That makes this necessarily non-preemptible.

Which shows that it adds more restrictions that may not otherwise be
required.


> has to look in there and touch the cacheline. You're also having to
> do more work when unlocking/relocking the ptl etc.
> 
> 
>> I really think it's the right API

OK, the *form* of the API is fine, I have no arguments. I just don't
know why you have to reuse the same thing. If you provided a new set of
names then you can trivially do a generic implementation which compiles
to exactly the same code for all architectures right now. That seems to
me like the right way to go...

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC/PATCH] Use mmu_gather for fork() instead of flush_tlb_mm()
  2007-07-09 10:12             ` Nick Piggin
  2007-07-09 10:18               ` Nick Piggin
@ 2007-07-09 12:32               ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 12+ messages in thread
From: Benjamin Herrenschmidt @ 2007-07-09 12:32 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-mm, Linux Kernel list

On Mon, 2007-07-09 at 20:12 +1000, Nick Piggin wrote:
> Benjamin Herrenschmidt wrote:
> > On Mon, 2007-07-09 at 19:29 +1000, Nick Piggin wrote:
> > 
> >>They could just #define one to the other though, there are only a
> >>small
> >>number of them. Is there a downside to not making them distinct? i386
> >>for example probably would just keep doing a tlb flush for fork and
> >>not
> >>want to worry about touching the tlb gather stuff.
> > 
> > 
> > But the tlb gather stuff just does ... a flush_tlb_mm() on x86 :-)
> 
> But it still does the get_cpu of the mmu gather data structure and
> has to look in there and touch the cacheline. You're also having to
> do more work when unlocking/relocking the ptl etc.

Hrm... true. I forgot about the cost of get_cpu. Do you think it will by
measurable at all in practice ? I doubt it but heh...

The place where I see a possible issue is indeed when dropping the lock,
in things like copy_pte_range, we would want to flush the batch in order
to be able to schedule.

That means we would end up probably doing flush_tlb_mm() once for every
lock drop instead of just once on x86, unless there's a smart way to
deal with that case... After all, when we do such lock dropping, we
don't actually need to dismiss the batch, the only reason we do so is to
re-enable preempt, because we may be migrated to another CPU.

But I wonder if it's worth bothering.... we drop the lock when have
need_resched() or there is contention on the lock. In both of these
cases, I doubt the added flush will matter noticeably...

If you think it will, then we could probably make the implementation a
bit more subtle, and allow to "put" a current batch (unblock
preemption), and only actually complete/flush it if a context switch
happens. It's not totally trivial to do with the current APIs though
mostly because of the passing of start/end when completing the batch.

Technically, on x86, I believe we don't even need to do anything but the
-last- flush in fact. So we could just add a pair of tlb_pause/resume
for the lock dropping :-)

But if we're going to do a spearate API, then what would you have it
look like ? It would have all of the same issues no ? 

I suppose best is to do a few tests to see if there's any measurable
performance regression with my patch on x86 (btw, it may not build with
hugetlbfs, I forgot a #include). Do you have some test gear around ? I
lack x86 hardware myself...

I'm also interested in the possible impact on ia64. I wonder if they can
benefit from more targetted flushing in fork()

Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC/PATCH] Use mmu_gather for fork() instead of flush_tlb_mm()
  2007-07-09 10:18               ` Nick Piggin
@ 2007-07-09 12:37                 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 12+ messages in thread
From: Benjamin Herrenschmidt @ 2007-07-09 12:37 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-mm, Linux Kernel list

> To elaborate on this one... I realise for this one that in the kernel
> where this is currently used everything is non-preemptible anyway
> because of the ptl. And I also realise that -rt kernel issues don't
> really have a bearing on mainline kernel.. but the generic
> implementation of this API is fundamentally used to operate on a
> per-cpu data structure that is only required when tearing down page
> tables. That makes this necessarily non-preemptible.
> 
> Which shows that it adds more restrictions that may not otherwise be
> required.

Yes, it's a bit annoying but not necessarily that bad. In fact, we don't
have to make it non-preemptible, we did it because it was easier that way
I strongly suspect. In fact, the batch could actually be attached to the
mm rather than the CPU for that matter, no ? Or is there a fudamental
reason I'm not seeing why it -has- to be per-cpu ?

> > has to look in there and touch the cacheline. You're also having to
> > do more work when unlocking/relocking the ptl etc.
> > 
> > 
> >> I really think it's the right API
> 
> OK, the *form* of the API is fine, I have no arguments. I just don't
> know why you have to reuse the same thing. If you provided a new set of
> names then you can trivially do a generic implementation which compiles
> to exactly the same code for all architectures right now. That seems to
> me like the right way to go...

But that means two different APIs for almost the same thing. I'm trying
to clean up the mess, not add more :-) Beside, that "other" API would
have overall much of the same issues no ? Or do you want to have that
"other" API not actually provide a percpu "mmu_gather" type structure at
all in asm-generic (but basically just boil down to an empty inline for
creating the "other" batch and flush_tlb_mm() for finishing it with an
empty inline for "adding" a PTE to the list of invalidation targets ?)

Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2007-07-09 12:37 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-07-09  3:47 removing flush_tlb_mm as a generic hook ? Benjamin Herrenschmidt
2007-07-09  6:36 ` Benjamin Herrenschmidt
2007-07-09  6:45   ` [RFC/PATCH] Use mmu_gather for fork() instead of flush_tlb_mm() Benjamin Herrenschmidt
2007-07-09  7:39     ` Nick Piggin
2007-07-09  9:12       ` Benjamin Herrenschmidt
2007-07-09  9:29         ` Nick Piggin
2007-07-09  9:47           ` Benjamin Herrenschmidt
2007-07-09 10:12             ` Nick Piggin
2007-07-09 10:18               ` Nick Piggin
2007-07-09 12:37                 ` Benjamin Herrenschmidt
2007-07-09 12:32               ` Benjamin Herrenschmidt
2007-07-09  6:46   ` [RFC/PATCH] Use mmu_gather for /proc stuff " Benjamin Herrenschmidt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox