* mmu_gather changes & generalization @ 2007-07-10 5:46 Benjamin Herrenschmidt 2007-07-11 20:45 ` Hugh Dickins 0 siblings, 1 reply; 8+ messages in thread From: Benjamin Herrenschmidt @ 2007-07-10 5:46 UTC (permalink / raw) To: Hugh Dickins; +Cc: linux-mm, Nick Piggin So to make things simple: I want to generalize the tlb batch interfaces to all flushing, except single pages and possible kernel page table flushing. I've discussed a bit with Nick today, and came up with this idea as a first step toward possible bigger changes/cleanups. He told me you have been working around the same lines, so I'd like your feedback there and possibly whatever patches you are already cooking :-) First, the situation/problems: - The problems with using the current mmu_gather is the fact that it's per-cpu, thus needs to be flushed when we do lock dropping and might schedule. That means more work than necessary on things like x86 when using it for fork or mprotect for example. - Essentially, a simple batch data structure doesn't need to be per-CPU, it could just be on the stack. However, the current one is per-cpu because of this massive list of struct page's which is too big for a stack allocation. Now the idea is to turn mmu_gather into a small stack based data structure, with an optional pointer to the list of pages which remains, for now, per-cpu. The initializer for it (tlb_gather_init ?) would then take a flag/type argument saying whether it is to be used for simple invalidations, or invalidations + pages freeing. If used for page freeing, that pointer points to the per-cpu list of pages and we do get_cpu (and put_cpu when finishing the batch). If used for simple invalidations, we set that pointer to NULL and don't do get_cpu/put_cpu. That way, we don't have to finish/restart the batch unless we are freeing pages. Thus users like fork() don't need to finish/restart the batch, and thus, we have no overhead on x86 compared to the current implementation (well, other than setting need_flush to 1 but that's probably not close to measurable). Thus, the implementation remains as far as unmap_vmas is concerned, essentially the same. We just make it stack based at the top-level and change the init call, and we can avoid passing double indirections down the call chain, which is a nice cleanup. An additional cleanup that it directly leads to is rather than finish/init when doing lock-break, when can introduce a reinit call that restarts a batch keeping the existing "settings" (We would still call finish, it's just that the call pair would be finish/reinit). That way, we don't have to "remember" things like fullmm like we have to do currently. Since it's no longer per-cpu, things like fullmm or mm are still valid in the batch structure, and so we don't have to carry "fullmm" around like we do in unmap_vmas (and like we would have to do in other users). In fact, arch implementations can carry around even more state that they might need and keep it around lock breaks that way. That would provide a good ground for then looking into changing the per-cpu list of pages to something else, as Nick told me you were working on. Any comment, idea, suggestions ? I will give a go at implementing that sometime this week I hope (I have some urgent stuff to do first) unless you guys convince me it's worthless :-) Note that I expect some perf. improvements on things like ppc32 on fork due to being able to target for shooting only hash entries for PTEs that have actually be turned into RO. The current ppc32 hash code just basically re-walks the page tables in flush_tlb_mm() and shoots down all PTEs that have been hashed. Cheers, Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: mmu_gather changes & generalization 2007-07-10 5:46 mmu_gather changes & generalization Benjamin Herrenschmidt @ 2007-07-11 20:45 ` Hugh Dickins 2007-07-11 23:18 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 8+ messages in thread From: Hugh Dickins @ 2007-07-11 20:45 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: linux-mm, Nick Piggin On Tue, 10 Jul 2007, Benjamin Herrenschmidt wrote: > So to make things simple: I want to generalize the tlb batch interfaces > to all flushing, except single pages and possible kernel page table > flushing. > > Note that I expect some perf. improvements on things like ppc32 on fork > due to being able to target for shooting only hash entries for PTEs that > have actually be turned into RO. The current ppc32 hash code just > basically re-walks the page tables in flush_tlb_mm() and shoots down all > PTEs that have been hashed. I've moved your last paragraph up here: that last sentence makes sense of the whole thing, and I'm now much happier with what you're intending, than when I first just thought you were trying to complicate flush_tlb_mm. > > I've discussed a bit with Nick today, and came up with this idea as a > first step toward possible bigger changes/cleanups. He told me you have > been working around the same lines, so I'd like your feedback there and > possibly whatever patches you are already cooking :-) I worked on it around 2.6.16, but wasn't satisfied with the result, and then got stalled. What I should do now is update what I had to 2.6.22, and in doing so remind myself of the limitations, and send the results off to you - from what you say, I've a few days for that before you get to work on it. I think there were two issues that stalled me. One, I was mainly trying to remove that horrid ZAP_BLOCK_SIZE from unmap_vmas, allowing preemption more naturally; but failed to solve the truncation case, when i_mmap_lock is held. Two, I needed to understand the different arches better: though it's grand if you're coming aboard, because powerpc (along with the seemingly similar sparc64) was one of the exceptions, deferring the flush to context switch (I need to remind myself why that was an issue). The other arches, even if not using asm-generic, seemed pretty much generic: arm a little simpler than generic, ia64 a little more baroque but more similar than it looked. Sounds like Martin may be about to take s390 in its own direction. The only arches I actually converted over were i386 and x86_64 (knowing others would keep changing while I worked on the patch). > > First, the situation/problems: > > - The problems with using the current mmu_gather is the fact that it's > per-cpu, thus needs to be flushed when we do lock dropping and might > schedule. That means more work than necessary on things like x86 when > using it for fork or mprotect for example. Yes, it dates from early 2.4, long before preemption latency placed limits on our use of per-cpu areas. > > - Essentially, a simple batch data structure doesn't need to be > per-CPU, it could just be on the stack. However, the current one is > per-cpu because of this massive list of struct page's which is too big > for a stack allocation. > > Now the idea is to turn mmu_gather into a small stack based data > structure, with an optional pointer to the list of pages which remains, > for now, per-cpu. What I had was the small stack based data structure, with a small fallback array of struct page pointers built in, and attempts to allocate a full page atomically when this array not big enough - just go slower with the small array when that allocation fails. There may be cleverer approaches, but it seems good enough. > > The initializer for it (tlb_gather_init ?) would then take a flag/type > argument saying whether it is to be used for simple invalidations, or > invalidations + pages freeing. Yes, I had some flags too. > > If used for page freeing, that pointer points to the per-cpu list of > pages and we do get_cpu (and put_cpu when finishing the batch). If used > for simple invalidations, we set that pointer to NULL and don't do > get_cpu/put_cpu. The particularly bad thing about get_cpu/put_cpu there, is that the efficiently big array stores up a lot of work for the future (when swapcached pages are freed), which still has to be done with preemption disabled. Could the migrate_disable now proposed help there? At the time I had that same idea, but discarded it because of the complication of different tasks (different mms) needing the same per-cpu buffer; but perhaps that isn't much of a complication in fact. > > That way, we don't have to finish/restart the batch unless we are > freeing pages. Thus users like fork() don't need to finish/restart the > batch, and thus, we have no overhead on x86 compared to the current > implementation (well, other than setting need_flush to 1 but that's > probably not close to measurable). ;) > > Thus, the implementation remains as far as unmap_vmas is concerned, > essentially the same. We just make it stack based at the top-level and > change the init call, and we can avoid passing double indirections down > the call chain, which is a nice cleanup. Yes, that cleanup I did do. > > An additional cleanup that it directly leads to is rather than > finish/init when doing lock-break, when can introduce a reinit call that > restarts a batch keeping the existing "settings" (We would still call > finish, it's just that the call pair would be finish/reinit). That way, > we don't have to "remember" things like fullmm like we have to do > currently. > > Since it's no longer per-cpu, things like fullmm or mm are still valid > in the batch structure, and so we don't have to carry "fullmm" around > like we do in unmap_vmas (and like we would have to do in other users). > In fact, arch implementations can carry around even more state that they > might need and keep it around lock breaks that way. Yes, more good cleanup that fell out naturally. > > That would provide a good ground for then looking into changing the > per-cpu list of pages to something else, as Nick told me you were > working on. > > Any comment, idea, suggestions ? I will give a go at implementing that > sometime this week I hope (I have some urgent stuff to do first) unless > you guys convince me it's worthless :-) So ignore my initial distrust, it all seems reasonable. But please remind me, what other than dup_mmap would you be extending this to? Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: mmu_gather changes & generalization 2007-07-11 20:45 ` Hugh Dickins @ 2007-07-11 23:18 ` Benjamin Herrenschmidt 2007-07-12 16:42 ` Hugh Dickins 0 siblings, 1 reply; 8+ messages in thread From: Benjamin Herrenschmidt @ 2007-07-11 23:18 UTC (permalink / raw) To: Hugh Dickins; +Cc: linux-mm, Nick Piggin > I think there were two issues that stalled me. One, I was mainly > trying to remove that horrid ZAP_BLOCK_SIZE from unmap_vmas, allowing > preemption more naturally; but failed to solve the truncation case, > when i_mmap_lock is held. Two, I needed to understand the different > arches better: though it's grand if you're coming aboard, because > powerpc (along with the seemingly similar sparc64) was one of the > exceptions, deferring the flush to context switch (I need to remind > myself why that was an issue). Actually, that was broken on ppc64. It really needs to flush before we drop the PTE lock or you may end up with duplicate entries in the hash table which is fatal. I fixed it recently by using a completely different mechanism there. I now use the lazy mmu hooks to start/stop batching of invalidations. But that's temporary. One of the thing I want to do with the batches is to add a hook for use by ppc64 to be called before releasing the PTE lock :-) That or I may do things a bit differently to make it safe to defer the flush. In any case, that's orthogonal to the changes I'm thinking about. > The other arches, even if not using > asm-generic, seemed pretty much generic: arm a little simpler than > generic, ia64 a little more baroque but more similar than it looked. > Sounds like Martin may be about to take s390 in its own direction. arm and sparc64 have a simpler version, which could be moved to asm-generic/tlb-simple.h or so, for arch that either don't care much or use a different batching mechanism (such as sparc64). > The only arches I actually converted over were i386 and x86_64 > (knowing others would keep changing while I worked on the patch). That's allright. I can take care of the ppc's and maybe sparc64 too. .../... > > - Essentially, a simple batch data structure doesn't need to be > > per-CPU, it could just be on the stack. However, the current one is > > per-cpu because of this massive list of struct page's which is too big > > for a stack allocation. > > > > Now the idea is to turn mmu_gather into a small stack based data > > structure, with an optional pointer to the list of pages which remains, > > for now, per-cpu. > > What I had was the small stack based data structure, with a small > fallback array of struct page pointers built in, and attempts to > allocate a full page atomically when this array not big enough - > just go slower with the small array when that allocation fails. > There may be cleverer approaches, but it seems good enough. Yes, that's what Nick described. I had in mind an incremental approach, starting with just splitting the batch into the stack based structure and the page list and keeping the per-cpu page list, and then, letting you change that too separately, but we can do it the other way around. .../... > The particularly bad thing about get_cpu/put_cpu there, is that > the efficiently big array stores up a lot of work for the future > (when swapcached pages are freed), which still has to be done > with preemption disabled. > > Could the migrate_disable now proposed help there? At the time > I had that same idea, but discarded it because of the complication > of different tasks (different mms) needing the same per-cpu buffer; > but perhaps that isn't much of a complication in fact. I haven't looked at that migrate_disable thing yet. Google time :-) > So ignore my initial distrust, it all seems reasonable. But please > remind me, what other than dup_mmap would you be extending this to? Initially, just that and that gremlin in fs/proc/task_mmu.c... (that is users of flush_tlb_mm(), thus removing it as a generic->arch hook). Though I was thinking of also taking care of flush_tlb_range(), which would then add mprotect to the list, and some hugetlb stuff. BTW, talking about MMU interfaces.... I've had a quick look yesterday and there's a load of stuff in the various pgtable.h imeplemtations that isn't used at all anymore ! For example, ptep_test_and_clear_dirty() is no longer used by rmap, and there's a whole lot of others like that. Also, there are some archs whose implementation is identical to asm-generic for some of these. I was thinking about doing pass through the whole tree getting rid of everything that's not used or duplicate of asm-generic while at it, unless you have reasons not to do that or you know somebody already doing it. Cheers, Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: mmu_gather changes & generalization 2007-07-11 23:18 ` Benjamin Herrenschmidt @ 2007-07-12 16:42 ` Hugh Dickins 2007-07-13 0:51 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 8+ messages in thread From: Hugh Dickins @ 2007-07-12 16:42 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: linux-mm, Nick Piggin On Thu, 12 Jul 2007, Benjamin Herrenschmidt wrote: > > > > What I had was the small stack based data structure, with a small > > fallback array of struct page pointers built in, and attempts to > > allocate a full page atomically when this array not big enough - > > just go slower with the small array when that allocation fails. > > There may be cleverer approaches, but it seems good enough. > > Yes, that's what Nick described. I had in mind an incremental approach, > starting with just splitting the batch into the stack based structure > and the page list and keeping the per-cpu page list, and then, letting > you change that too separately, but we can do it the other way around. Oh, whatever I send, you just take it forward if it does look useful to you, or forget it if it's just getting in your way, or mixing what you'd prefer to be separate steps, or more trouble to follow someone else's than do your own: no problems. There is an overlap of cleanup between what I have and what you're intending (e.g. **tlb -> *tlb), but it's hardly beyond your capability to do that without my patch ;) > BTW, talking about MMU interfaces.... I've had a quick look yesterday > and there's a load of stuff in the various pgtable.h imeplemtations that > isn't used at all anymore ! For example, ptep_test_and_clear_dirty() is > no longer used by rmap, and there's a whole lot of others like that. > > Also, there are some archs whose implementation is identical to > asm-generic for some of these. > > I was thinking about doing pass through the whole tree getting rid of > everything that's not used or duplicate of asm-generic while at it, > unless you have reasons not to do that or you know somebody already > doing it. If you wait for next -mm, I think you'll find Martin Schwidefsky has done a little cleanup (including removing ptep_test_and_clear_dirty, which did indeed pose some problem when it had no examples of use); and Jan Beulich some other cleanups already in the last -mm (removing some unused macros like pte_exec). But it sounds like you want to go a lot further. Hmm, well, if your cross-building environment is good enough that you won't waste any of Andrew's time with the results, I guess go ahead. Personally, I'm not in favour of removing every last unused macro: if only from a debugging or learning point of view, it can be useful to see what pte_exec is on each architecture, and it might be needed again tomorrow. But I am very much in favour of reducing the spread of unnecessary difference between architectures, the quantity of evidence you have to wade through when considering them for changes. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: mmu_gather changes & generalization 2007-07-12 16:42 ` Hugh Dickins @ 2007-07-13 0:51 ` Benjamin Herrenschmidt 2007-07-13 20:39 ` Hugh Dickins 0 siblings, 1 reply; 8+ messages in thread From: Benjamin Herrenschmidt @ 2007-07-13 0:51 UTC (permalink / raw) To: Hugh Dickins; +Cc: linux-mm, Nick Piggin > If you wait for next -mm, I think you'll find Martin Schwidefsky has > done a little cleanup (including removing ptep_test_and_clear_dirty, > which did indeed pose some problem when it had no examples of use); > and Jan Beulich some other cleanups already in the last -mm (removing > some unused macros like pte_exec). But it sounds like you want to go > a lot further. > > Hmm, well, if your cross-building environment is good enough that you > won't waste any of Andrew's time with the results, I guess go ahead. I have compilers for x86 (&64) and sparc(&64) at hand (in addition to ppc flavors of course), I'm not sure I have anything else but I can always ask our local toolchain guru to setup something up :-) I suppose I need at least ia64 and possibly mips & arm (though the later seem to be harder to get the right version of the toolchain). > Personally, I'm not in favour of removing every last unused macro: > if only from a debugging or learning point of view, it can be useful > to see what pte_exec is on each architecture, and it might be needed > again tomorrow. But I am very much in favour of reducing the spread > of unnecessary difference between architectures, the quantity of > evidence you have to wade through when considering them for changes. I don't care about the small macros that just set/test bits like pte_exec. I want to remove the ones that do more than that and are unused (ptep_test_and_clear_dirty() was a good example, there was some semantics subtleties vs. flushing or not flusing, etc...). Those things need to go if they aren't used. I'll have a look after the next -mm to see what's left. There may be nothing left to cleanup :-) Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: mmu_gather changes & generalization 2007-07-13 0:51 ` Benjamin Herrenschmidt @ 2007-07-13 20:39 ` Hugh Dickins 2007-07-13 22:46 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 8+ messages in thread From: Hugh Dickins @ 2007-07-13 20:39 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: linux-mm, Nick Piggin On Fri, 13 Jul 2007, Benjamin Herrenschmidt wrote: > > I don't care about the small macros that just set/test bits like > pte_exec. I want to remove the ones that do more than that and are > unused (ptep_test_and_clear_dirty() was a good example, there was some > semantics subtleties vs. flushing or not flusing, etc...). Those things > need to go if they aren't used. Yes, David Rientjes and Zach Amsden and I kept going back and forth over its sister ptep_test_and_clear_young(): it is hard to work out where to place what kind of flush, particularly when it has no users. Martin eliminating ptep_test_and_clear_dirty looked like a good answer. > I'll have a look after the next -mm to see what's left. There may be > nothing left to cleanup :-) It sounds like I misunderstood how far your cleanup was to reach. Maybe there isn't such a big multi-arch-build deal as I implied. Here's the 2.6.22 version of what I worked on just after 2.6.16. As I said before, if you find it useful to build upon, do so; but if not, not. From something you said earlier, I've a feeling we'll be fighting over where to place the TLB flushes, inside or outside the page table lock. A few notes: Keep in mind: hard to have low preemption latency with decent throughput in zap_pte_range - easier than it once was now the ptl is taken lower down, but big problem when truncation/invalidation holds i_mmap_lock to scan the vma prio_tree - drop that lock and it has to restart. Not satisfactorily solved yet (sometimes I think we should collapse the prio_tree into a list for the duration of the unmapping: no problem putting a marker in the list). The mmu_gather of pages to be freed after TLB flush represents a signficant quantity of deferred work, particularly when those pages are in swapcache: we do want preemption enabled while freeing them, but we don't want to lose our place in the prio_tree very often. Don't be misled by inclusion of patches to ia64 and powerpc hugetlbpage.c, that's just to replace **tlb by *tlb in one function: the real mmu_gather conversion is yet to be done there. Only i386 and x86_64 have been converted, built and (inadequately) tested so far: but most arches shouldn't need more than removing their DEFINE_PER_CPU, with arm and arm26 probably just wanting to use more of the generic code. sparc64 uses a flush_tlb_pending technique which defers a lot of work until context switch, when it cannot be preempted: I've given little thought to it. powerpc appeared similar to sparc64, but you've changed it since 2.6.16. I've removed the start,end args to tlb_finish_mmu, and several levels above it: the tlb_start_valid business in unmap_vmas always seemed ugly to me, only ia64 has made use of them, and I cannot see why it shouldn't just record first and last addr when its tlb_remove_tlb_entry is called. But since ia64 isn't done yet, that end of it isn't seen in the patch. Hugh --- arch/i386/mm/init.c | 1 arch/ia64/mm/hugetlbpage.c | 2 arch/powerpc/mm/hugetlbpage.c | 8 - arch/x86_64/mm/init.c | 2 include/asm-generic/pgtable.h | 12 -- include/asm-generic/tlb.h | 109 +++++++++++---------- include/asm-x86_64/tlbflush.h | 4 include/linux/hugetlb.h | 2 include/linux/mm.h | 11 -- include/linux/swap.h | 5 - mm/fremap.c | 2 mm/memory.c | 209 ++++++++++++++++-------------------------- mm/mmap.c | 34 ++---- mm/swap_state.c | 12 -- 14 files changed, 163 insertions(+), 250 deletions(-) --- 2.6.22/arch/i386/mm/init.c 2007-07-09 00:32:17.000000000 +0100 +++ linux/arch/i386/mm/init.c 2007-07-12 19:47:28.000000000 +0100 @@ -47,7 +47,6 @@ unsigned int __VMALLOC_RESERVE = 128 << 20; -DEFINE_PER_CPU(struct mmu_gather, mmu_gathers); unsigned long highstart_pfn, highend_pfn; static int noinline do_test_wp_bit(void); --- 2.6.22/arch/ia64/mm/hugetlbpage.c 2007-07-09 00:32:17.000000000 +0100 +++ linux/arch/ia64/mm/hugetlbpage.c 2007-07-12 19:47:28.000000000 +0100 @@ -114,7 +114,7 @@ follow_huge_pmd(struct mm_struct *mm, un return NULL; } -void hugetlb_free_pgd_range(struct mmu_gather **tlb, +void hugetlb_free_pgd_range(struct mmu_gather *tlb, unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling) { --- 2.6.22/arch/powerpc/mm/hugetlbpage.c 2007-07-09 00:32:17.000000000 +0100 +++ linux/arch/powerpc/mm/hugetlbpage.c 2007-07-12 19:47:28.000000000 +0100 @@ -240,7 +240,7 @@ static void hugetlb_free_pud_range(struc * * Must be called with pagetable lock held. */ -void hugetlb_free_pgd_range(struct mmu_gather **tlb, +void hugetlb_free_pgd_range(struct mmu_gather *tlb, unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling) { @@ -300,13 +300,13 @@ void hugetlb_free_pgd_range(struct mmu_g return; start = addr; - pgd = pgd_offset((*tlb)->mm, addr); + pgd = pgd_offset(tlb->mm, addr); do { - BUG_ON(get_slice_psize((*tlb)->mm, addr) != mmu_huge_psize); + BUG_ON(get_slice_psize(tlb->mm, addr) != mmu_huge_psize); next = pgd_addr_end(addr, end); if (pgd_none_or_clear_bad(pgd)) continue; - hugetlb_free_pud_range(*tlb, pgd, addr, next, floor, ceiling); + hugetlb_free_pud_range(tlb, pgd, addr, next, floor, ceiling); } while (pgd++, addr = next, addr != end); } --- 2.6.22/arch/x86_64/mm/init.c 2007-07-09 00:32:17.000000000 +0100 +++ linux/arch/x86_64/mm/init.c 2007-07-12 19:47:28.000000000 +0100 @@ -53,8 +53,6 @@ EXPORT_SYMBOL(dma_ops); static unsigned long dma_reserve __initdata; -DEFINE_PER_CPU(struct mmu_gather, mmu_gathers); - /* * NOTE: pagetable_init alloc all the fixmap pagetables contiguous on the * physical space so we can cache the place of the first one and move --- 2.6.22/include/asm-generic/pgtable.h 2007-07-09 00:32:17.000000000 +0100 +++ linux/include/asm-generic/pgtable.h 2007-07-12 19:47:28.000000000 +0100 @@ -111,18 +111,6 @@ do { \ }) #endif -/* - * Some architectures may be able to avoid expensive synchronization - * primitives when modifications are made to PTE's which are already - * not present, or in the process of an address space destruction. - */ -#ifndef __HAVE_ARCH_PTE_CLEAR_NOT_PRESENT_FULL -#define pte_clear_not_present_full(__mm, __address, __ptep, __full) \ -do { \ - pte_clear((__mm), (__address), (__ptep)); \ -} while (0) -#endif - #ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH #define ptep_clear_flush(__vma, __address, __ptep) \ ({ \ --- 2.6.22/include/asm-generic/tlb.h 2006-11-29 21:57:37.000000000 +0000 +++ linux/include/asm-generic/tlb.h 2007-07-12 19:47:28.000000000 +0100 @@ -17,65 +17,77 @@ #include <asm/pgalloc.h> #include <asm/tlbflush.h> -/* - * For UP we don't need to worry about TLB flush - * and page free order so much.. - */ -#ifdef CONFIG_SMP - #ifdef ARCH_FREE_PTR_NR - #define FREE_PTR_NR ARCH_FREE_PTR_NR - #else - #define FREE_PTE_NR 506 - #endif - #define tlb_fast_mode(tlb) ((tlb)->nr == ~0U) -#else - #define FREE_PTE_NR 1 - #define tlb_fast_mode(tlb) 1 -#endif +#define TLB_TRUNC 0 /* i_mmap_lock is held */ +#define TLB_UNMAP 1 /* normal munmap or zap */ +#define TLB_EXIT 2 /* tearing down whole mm */ + +#define TLB_FALLBACK_PAGES 8 /* a few entries on the stack */ /* struct mmu_gather is an opaque type used by the mm code for passing around * any data needed by arch specific code for tlb_remove_page. */ struct mmu_gather { - struct mm_struct *mm; - unsigned int nr; /* set to ~0U means fast mode */ - unsigned int need_flush;/* Really unmapped some ptes? */ - unsigned int fullmm; /* non-zero means full mm flush */ - struct page * pages[FREE_PTE_NR]; + struct mm_struct *mm; + short nr; + short max; + short need_flush; /* Really unmapped some ptes? */ + short mode; + struct page ** pages; + struct page * fallback_pages[TLB_FALLBACK_PAGES]; }; -/* Users of the generic TLB shootdown code must declare this storage space. */ -DECLARE_PER_CPU(struct mmu_gather, mmu_gathers); - /* tlb_gather_mmu - * Return a pointer to an initialized struct mmu_gather. + * Initialize struct mmu_gather. */ -static inline struct mmu_gather * -tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush) +static inline void +tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, int mode) { - struct mmu_gather *tlb = &get_cpu_var(mmu_gathers); - tlb->mm = mm; - - /* Use fast mode if only one CPU is online */ - tlb->nr = num_online_cpus() > 1 ? 0U : ~0U; - - tlb->fullmm = full_mm_flush; - - return tlb; + tlb->nr = 0; + tlb->max = TLB_FALLBACK_PAGES; + tlb->need_flush = 0; + tlb->mode = mode; + tlb->pages = tlb->fallback_pages; + /* temporarily erase fallback_pages for clearer debug traces */ + memset(tlb->fallback_pages, 0, sizeof(tlb->fallback_pages)); } static inline void -tlb_flush_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) +tlb_flush_mmu(struct mmu_gather *tlb) { if (!tlb->need_flush) return; tlb->need_flush = 0; tlb_flush(tlb); - if (!tlb_fast_mode(tlb)) { - free_pages_and_swap_cache(tlb->pages, tlb->nr); - tlb->nr = 0; + free_pages_and_swap_cache(tlb->pages, tlb->nr); + tlb->nr = 0; +} + +static inline int +tlb_is_extensible(struct mmu_gather *tlb) +{ +#ifdef CONFIG_PREEMPT + return tlb->mode != TLB_TRUNC; +#else + return 1; +#endif +} + +static inline int +tlb_is_full(struct mmu_gather *tlb) +{ + if (tlb->nr < tlb->max) + return 0; + if (tlb->pages == tlb->fallback_pages && tlb_is_extensible(tlb)) { + struct page **pages = (void *)__get_free_pages(GFP_ATOMIC|__GFP_NOWARN, 0); + if (pages) { + memcpy(pages, tlb->pages, sizeof(tlb->fallback_pages)); + tlb->pages = pages; + tlb->max = PAGE_SIZE / sizeof(struct page *); + return 0; + } } + return 1; } /* tlb_finish_mmu @@ -83,14 +95,11 @@ tlb_flush_mmu(struct mmu_gather *tlb, un * that were required. */ static inline void -tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) +tlb_finish_mmu(struct mmu_gather *tlb) { - tlb_flush_mmu(tlb, start, end); - - /* keep the page table cache within bounds */ - check_pgt_cache(); - - put_cpu_var(mmu_gathers); + tlb_flush_mmu(tlb); + if (tlb->pages != tlb->fallback_pages) + free_pages((unsigned long)tlb->pages, 0); } /* tlb_remove_page @@ -100,14 +109,10 @@ tlb_finish_mmu(struct mmu_gather *tlb, u */ static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page) { + if (tlb->nr >= tlb->max) + tlb_flush_mmu(tlb); tlb->need_flush = 1; - if (tlb_fast_mode(tlb)) { - free_page_and_swap_cache(page); - return; - } tlb->pages[tlb->nr++] = page; - if (tlb->nr >= FREE_PTE_NR) - tlb_flush_mmu(tlb, 0, 0); } /** --- 2.6.22/include/asm-x86_64/tlbflush.h 2007-07-09 00:32:17.000000000 +0100 +++ linux/include/asm-x86_64/tlbflush.h 2007-07-12 19:47:28.000000000 +0100 @@ -86,10 +86,6 @@ static inline void flush_tlb_range(struc #define TLBSTATE_OK 1 #define TLBSTATE_LAZY 2 -/* Roughly an IPI every 20MB with 4k pages for freeing page table - ranges. Cost is about 42k of memory for each CPU. */ -#define ARCH_FREE_PTE_NR 5350 - #endif #define flush_tlb_kernel_range(start, end) flush_tlb_all() --- 2.6.22/include/linux/hugetlb.h 2007-07-09 00:32:17.000000000 +0100 +++ linux/include/linux/hugetlb.h 2007-07-12 19:47:28.000000000 +0100 @@ -52,7 +52,7 @@ void hugetlb_change_protection(struct vm #ifndef ARCH_HAS_HUGETLB_FREE_PGD_RANGE #define hugetlb_free_pgd_range free_pgd_range #else -void hugetlb_free_pgd_range(struct mmu_gather **tlb, unsigned long addr, +void hugetlb_free_pgd_range(struct mmu_gather *tlb, unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling); #endif --- 2.6.22/include/linux/mm.h 2007-07-09 00:32:17.000000000 +0100 +++ linux/include/linux/mm.h 2007-07-12 19:47:28.000000000 +0100 @@ -738,15 +738,12 @@ struct zap_details { }; struct page *vm_normal_page(struct vm_area_struct *, unsigned long, pte_t); -unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address, +void zap_page_range(struct vm_area_struct *vma, unsigned long address, unsigned long size, struct zap_details *); -unsigned long unmap_vmas(struct mmu_gather **tlb, - struct vm_area_struct *start_vma, unsigned long start_addr, - unsigned long end_addr, unsigned long *nr_accounted, - struct zap_details *); -void free_pgd_range(struct mmu_gather **tlb, unsigned long addr, +void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma); +void free_pgd_range(struct mmu_gather *tlb, unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling); -void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *start_vma, +void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma, unsigned long floor, unsigned long ceiling); int copy_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma); --- 2.6.22/include/linux/swap.h 2007-04-26 04:08:32.000000000 +0100 +++ linux/include/linux/swap.h 2007-07-12 19:47:28.000000000 +0100 @@ -232,7 +232,6 @@ extern void delete_from_swap_cache(struc extern int move_to_swap_cache(struct page *, swp_entry_t); extern int move_from_swap_cache(struct page *, unsigned long, struct address_space *); -extern void free_page_and_swap_cache(struct page *); extern void free_pages_and_swap_cache(struct page **, int); extern struct page * lookup_swap_cache(swp_entry_t); extern struct page * read_swap_cache_async(swp_entry_t, struct vm_area_struct *vma, @@ -287,9 +286,7 @@ static inline void disable_swap_token(vo #define si_swapinfo(val) \ do { (val)->freeswap = (val)->totalswap = 0; } while (0) /* only sparc can not include linux/pagemap.h in this file - * so leave page_cache_release and release_pages undeclared... */ -#define free_page_and_swap_cache(page) \ - page_cache_release(page) + * so leave release_pages undeclared... */ #define free_pages_and_swap_cache(pages, nr) \ release_pages((pages), (nr), 0); --- 2.6.22/mm/fremap.c 2007-02-04 18:44:54.000000000 +0000 +++ linux/mm/fremap.c 2007-07-12 19:47:28.000000000 +0100 @@ -39,7 +39,7 @@ static int zap_pte(struct mm_struct *mm, } else { if (!pte_file(pte)) free_swap_and_cache(pte_to_swp_entry(pte)); - pte_clear_not_present_full(mm, addr, ptep, 0); + pte_clear(mm, addr, ptep); } return !!page; } --- 2.6.22/mm/memory.c 2007-07-09 00:32:17.000000000 +0100 +++ linux/mm/memory.c 2007-07-12 19:47:28.000000000 +0100 @@ -203,7 +203,7 @@ static inline void free_pud_range(struct * * Must be called with pagetable lock held. */ -void free_pgd_range(struct mmu_gather **tlb, +void free_pgd_range(struct mmu_gather *tlb, unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling) { @@ -254,19 +254,19 @@ void free_pgd_range(struct mmu_gather ** return; start = addr; - pgd = pgd_offset((*tlb)->mm, addr); + pgd = pgd_offset(tlb->mm, addr); do { next = pgd_addr_end(addr, end); if (pgd_none_or_clear_bad(pgd)) continue; - free_pud_range(*tlb, pgd, addr, next, floor, ceiling); + free_pud_range(tlb, pgd, addr, next, floor, ceiling); } while (pgd++, addr = next, addr != end); - if (!(*tlb)->fullmm) - flush_tlb_pgtables((*tlb)->mm, start, end); + if (tlb->mode != TLB_EXIT) + flush_tlb_pgtables(tlb->mm, start, end); } -void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *vma, +void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long floor, unsigned long ceiling) { while (vma) { @@ -298,6 +298,9 @@ void free_pgtables(struct mmu_gather **t } vma = next; } + + /* keep the page table cache within bounds */ + check_pgt_cache(); } int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address) @@ -621,24 +624,36 @@ int copy_page_range(struct mm_struct *ds static unsigned long zap_pte_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, - long *zap_work, struct zap_details *details) + struct zap_details *details) { + spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL; struct mm_struct *mm = tlb->mm; pte_t *pte; spinlock_t *ptl; int file_rss = 0; int anon_rss = 0; + int progress; +again: + progress = 0; pte = pte_offset_map_lock(mm, pmd, addr, &ptl); arch_enter_lazy_mmu_mode(); do { - pte_t ptent = *pte; + pte_t ptent; + + if (progress >= 64) { + progress = 0; + if (need_resched() || + need_lockbreak(ptl) || + (i_mmap_lock && need_lockbreak(i_mmap_lock))) + break; + } + ptent = *pte; if (pte_none(ptent)) { - (*zap_work)--; + progress++; continue; } - - (*zap_work) -= PAGE_SIZE; + progress += 8; if (pte_present(ptent)) { struct page *page; @@ -662,8 +677,10 @@ static unsigned long zap_pte_range(struc page->index > details->last_index)) continue; } + if (tlb_is_full(tlb)) + break; ptent = ptep_get_and_clear_full(mm, addr, pte, - tlb->fullmm); + tlb->mode == TLB_EXIT); tlb_remove_tlb_entry(tlb, pte, addr); if (unlikely(!page)) continue; @@ -693,20 +710,27 @@ static unsigned long zap_pte_range(struc continue; if (!pte_file(ptent)) free_swap_and_cache(pte_to_swp_entry(ptent)); - pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); - } while (pte++, addr += PAGE_SIZE, (addr != end && *zap_work > 0)); + pte_clear(mm, addr, pte); + } while (pte++, addr += PAGE_SIZE, addr != end); add_mm_rss(mm, file_rss, anon_rss); arch_leave_lazy_mmu_mode(); pte_unmap_unlock(pte - 1, ptl); + if (!i_mmap_lock) { + cond_resched(); + if (tlb_is_full(tlb)) + tlb_flush_mmu(tlb); + if (addr != end) + goto again; + } return addr; } static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pud_t *pud, unsigned long addr, unsigned long end, - long *zap_work, struct zap_details *details) + struct zap_details *details) { pmd_t *pmd; unsigned long next; @@ -715,20 +739,18 @@ static inline unsigned long zap_pmd_rang do { next = pmd_addr_end(addr, end); if (pmd_none_or_clear_bad(pmd)) { - (*zap_work)--; + addr = next; continue; } - next = zap_pte_range(tlb, vma, pmd, addr, next, - zap_work, details); - } while (pmd++, addr = next, (addr != end && *zap_work > 0)); - + addr = zap_pte_range(tlb, vma, pmd, addr, next, details); + } while (pmd++, addr == next && addr != end); return addr; } static inline unsigned long zap_pud_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pgd_t *pgd, unsigned long addr, unsigned long end, - long *zap_work, struct zap_details *details) + struct zap_details *details) { pud_t *pud; unsigned long next; @@ -737,20 +759,18 @@ static inline unsigned long zap_pud_rang do { next = pud_addr_end(addr, end); if (pud_none_or_clear_bad(pud)) { - (*zap_work)--; + addr = next; continue; } - next = zap_pmd_range(tlb, vma, pud, addr, next, - zap_work, details); - } while (pud++, addr = next, (addr != end && *zap_work > 0)); - + addr = zap_pmd_range(tlb, vma, pud, addr, next, details); + } while (pud++, addr == next && addr != end); return addr; } static unsigned long unmap_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long addr, unsigned long end, - long *zap_work, struct zap_details *details) + struct zap_details *details) { pgd_t *pgd; unsigned long next; @@ -764,137 +784,62 @@ static unsigned long unmap_page_range(st do { next = pgd_addr_end(addr, end); if (pgd_none_or_clear_bad(pgd)) { - (*zap_work)--; + addr = next; continue; } - next = zap_pud_range(tlb, vma, pgd, addr, next, - zap_work, details); - } while (pgd++, addr = next, (addr != end && *zap_work > 0)); + addr = zap_pud_range(tlb, vma, pgd, addr, next, details); + } while (pgd++, addr == next && addr != end); tlb_end_vma(tlb, vma); - return addr; } -#ifdef CONFIG_PREEMPT -# define ZAP_BLOCK_SIZE (8 * PAGE_SIZE) -#else -/* No preempt: go for improved straight-line efficiency */ -# define ZAP_BLOCK_SIZE (1024 * PAGE_SIZE) -#endif - /** * unmap_vmas - unmap a range of memory covered by a list of vma's - * @tlbp: address of the caller's struct mmu_gather + * @tlb: address of the caller's struct mmu_gather * @vma: the starting vma - * @start_addr: virtual address at which to start unmapping - * @end_addr: virtual address at which to end unmapping - * @nr_accounted: Place number of unmapped pages in vm-accountable vma's here - * @details: details of nonlinear truncation or shared cache invalidation - * - * Returns the end address of the unmapping (restart addr if interrupted). * * Unmap all pages in the vma list. - * - * We aim to not hold locks for too long (for scheduling latency reasons). - * So zap pages in ZAP_BLOCK_SIZE bytecounts. This means we need to - * return the ending mmu_gather to the caller. - * - * Only addresses between `start' and `end' will be unmapped. - * * The VMA list must be sorted in ascending virtual address order. - * - * unmap_vmas() assumes that the caller will flush the whole unmapped address - * range after unmap_vmas() returns. So the only responsibility here is to - * ensure that any thus-far unmapped pages are flushed before unmap_vmas() - * drops the lock and schedules. - */ -unsigned long unmap_vmas(struct mmu_gather **tlbp, - struct vm_area_struct *vma, unsigned long start_addr, - unsigned long end_addr, unsigned long *nr_accounted, - struct zap_details *details) + */ +void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *vma) { - long zap_work = ZAP_BLOCK_SIZE; - unsigned long tlb_start = 0; /* For tlb_finish_mmu */ - int tlb_start_valid = 0; - unsigned long start = start_addr; - spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL; - int fullmm = (*tlbp)->fullmm; - - for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) { - unsigned long end; - - start = max(vma->vm_start, start_addr); - if (start >= vma->vm_end) - continue; - end = min(vma->vm_end, end_addr); - if (end <= vma->vm_start) - continue; + unsigned long nr_accounted = 0; + while (vma) { if (vma->vm_flags & VM_ACCOUNT) - *nr_accounted += (end - start) >> PAGE_SHIFT; - - while (start != end) { - if (!tlb_start_valid) { - tlb_start = start; - tlb_start_valid = 1; - } - - if (unlikely(is_vm_hugetlb_page(vma))) { - unmap_hugepage_range(vma, start, end); - zap_work -= (end - start) / - (HPAGE_SIZE / PAGE_SIZE); - start = end; - } else - start = unmap_page_range(*tlbp, vma, - start, end, &zap_work, details); - - if (zap_work > 0) { - BUG_ON(start != end); - break; - } + nr_accounted += vma_pages(vma); - tlb_finish_mmu(*tlbp, tlb_start, start); - - if (need_resched() || - (i_mmap_lock && need_lockbreak(i_mmap_lock))) { - if (i_mmap_lock) { - *tlbp = NULL; - goto out; - } - cond_resched(); - } - - *tlbp = tlb_gather_mmu(vma->vm_mm, fullmm); - tlb_start_valid = 0; - zap_work = ZAP_BLOCK_SIZE; - } + if (unlikely(is_vm_hugetlb_page(vma))) + unmap_hugepage_range(vma, vma->vm_start, vma->vm_end); + else + unmap_page_range(tlb, vma, vma->vm_start, vma->vm_end, NULL); + vma = vma->vm_next; } -out: - return start; /* which is now the end (or restart) address */ + + vm_unacct_memory(nr_accounted); } /** * zap_page_range - remove user pages in a given range * @vma: vm_area_struct holding the applicable pages * @address: starting address of pages to zap - * @size: number of bytes to zap + * @end: ending address of pages to zap * @details: details of nonlinear truncation or shared cache invalidation */ -unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address, +void zap_page_range(struct vm_area_struct *vma, unsigned long address, unsigned long size, struct zap_details *details) { struct mm_struct *mm = vma->vm_mm; - struct mmu_gather *tlb; + struct mmu_gather tlb; unsigned long end = address + size; - unsigned long nr_accounted = 0; - lru_add_drain(); - tlb = tlb_gather_mmu(mm, 0); + BUG_ON(is_vm_hugetlb_page(vma)); + BUG_ON(address < vma->vm_start || end > vma->vm_end); + + tlb_gather_mmu(&tlb, mm, TLB_UNMAP); update_hiwater_rss(mm); - end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details); - if (tlb) - tlb_finish_mmu(tlb, address, end); - return end; + unmap_page_range(&tlb, vma, address, end, details); + tlb_finish_mmu(&tlb); } /* @@ -1822,6 +1767,8 @@ static int unmap_mapping_range_vma(struc unsigned long start_addr, unsigned long end_addr, struct zap_details *details) { + struct mm_struct *mm = vma->vm_mm; + struct mmu_gather tlb; unsigned long restart_addr; int need_break; @@ -1836,8 +1783,12 @@ again: } } - restart_addr = zap_page_range(vma, start_addr, - end_addr - start_addr, details); + tlb_gather_mmu(&tlb, mm, TLB_TRUNC); + update_hiwater_rss(mm); + restart_addr = unmap_page_range(&tlb, vma, + start_addr, end_addr, details); + tlb_finish_mmu(&tlb); + need_break = need_resched() || need_lockbreak(details->i_mmap_lock); --- 2.6.22/mm/mmap.c 2007-07-09 00:32:17.000000000 +0100 +++ linux/mm/mmap.c 2007-07-12 19:47:28.000000000 +0100 @@ -36,8 +36,7 @@ #endif static void unmap_region(struct mm_struct *mm, - struct vm_area_struct *vma, struct vm_area_struct *prev, - unsigned long start, unsigned long end); + struct vm_area_struct *vma, struct vm_area_struct *prev); /* * WARNING: the debugging will use recursive algorithms so never enable this @@ -1165,7 +1164,7 @@ unmap_and_free_vma: fput(file); /* Undo any partial mapping done by a device driver. */ - unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end); + unmap_region(mm, vma, prev); charged = 0; free_vma: kmem_cache_free(vm_area_cachep, vma); @@ -1677,21 +1676,17 @@ static void remove_vma_list(struct mm_st * Called with the mm semaphore held. */ static void unmap_region(struct mm_struct *mm, - struct vm_area_struct *vma, struct vm_area_struct *prev, - unsigned long start, unsigned long end) + struct vm_area_struct *vma, struct vm_area_struct *prev) { struct vm_area_struct *next = prev? prev->vm_next: mm->mmap; - struct mmu_gather *tlb; - unsigned long nr_accounted = 0; + struct mmu_gather tlb; - lru_add_drain(); - tlb = tlb_gather_mmu(mm, 0); + tlb_gather_mmu(&tlb, mm, TLB_UNMAP); update_hiwater_rss(mm); - unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL); - vm_unacct_memory(nr_accounted); + unmap_vmas(&tlb, vma); free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS, next? next->vm_start: 0); - tlb_finish_mmu(tlb, start, end); + tlb_finish_mmu(&tlb); } /* @@ -1829,7 +1824,7 @@ int do_munmap(struct mm_struct *mm, unsi * Remove the vma's, and unmap the actual pages */ detach_vmas_to_be_unmapped(mm, vma, prev, end); - unmap_region(mm, vma, prev, start, end); + unmap_region(mm, vma, prev); /* Fix up all other VM information */ remove_vma_list(mm, vma); @@ -1968,23 +1963,18 @@ EXPORT_SYMBOL(do_brk); /* Release all mmaps. */ void exit_mmap(struct mm_struct *mm) { - struct mmu_gather *tlb; + struct mmu_gather tlb; struct vm_area_struct *vma = mm->mmap; - unsigned long nr_accounted = 0; - unsigned long end; /* mm's last user has gone, and its about to be pulled down */ arch_exit_mmap(mm); - lru_add_drain(); flush_cache_mm(mm); - tlb = tlb_gather_mmu(mm, 1); + tlb_gather_mmu(&tlb, mm, TLB_EXIT); /* Don't update_hiwater_rss(mm) here, do_exit already did */ - /* Use -1 here to ensure all VMAs in the mm are unmapped */ - end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL); - vm_unacct_memory(nr_accounted); + unmap_vmas(&tlb, vma); free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0); - tlb_finish_mmu(tlb, 0, end); + tlb_finish_mmu(&tlb); /* * Walk the list again, actually closing and freeing it, --- 2.6.22/mm/swap_state.c 2006-09-20 04:42:06.000000000 +0100 +++ linux/mm/swap_state.c 2007-07-12 19:47:28.000000000 +0100 @@ -258,16 +258,6 @@ static inline void free_swap_cache(struc } } -/* - * Perform a free_page(), also freeing any swap cache associated with - * this page if it is the last user of the page. - */ -void free_page_and_swap_cache(struct page *page) -{ - free_swap_cache(page); - page_cache_release(page); -} - /* * Passed an array of pages, drop them all from swapcache and then release * them. They are removed from the LRU and freed if this is their last use. @@ -286,6 +276,8 @@ void free_pages_and_swap_cache(struct pa release_pages(pagep, todo, 0); pagep += todo; nr -= todo; + if (nr && !preempt_count()) + cond_resched(); } } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: mmu_gather changes & generalization 2007-07-13 20:39 ` Hugh Dickins @ 2007-07-13 22:46 ` Benjamin Herrenschmidt 2007-07-14 15:33 ` Hugh Dickins 0 siblings, 1 reply; 8+ messages in thread From: Benjamin Herrenschmidt @ 2007-07-13 22:46 UTC (permalink / raw) To: Hugh Dickins; +Cc: linux-mm, Nick Piggin > Here's the 2.6.22 version of what I worked on just after 2.6.16. > As I said before, if you find it useful to build upon, do so; > but if not, not. From something you said earlier, I've a > feeling we'll be fighting over where to place the TLB flushes, > inside or outside the page table lock. ppc64 needs inside, but I don't want to change the behaviour for others, so I'll probably do a pair of tlb_after_pte_lock and tlb_before_pte_unlock that do nothing by default and that ppc64 can use to do the flush before unlocking. It seems like virtualization stuff needs that too, thus we could replace a whole lot of the lazy_mmu stuff in there with those 2 hooks, making things a little bit less confusing. > A few notes: > > Keep in mind: hard to have low preemption latency with decent throughput > in zap_pte_range - easier than it once was now the ptl is taken lower down, > but big problem when truncation/invalidation holds i_mmap_lock to scan the > vma prio_tree - drop that lock and it has to restart. Not satisfactorily > solved yet (sometimes I think we should collapse the prio_tree into a list > for the duration of the unmapping: no problem putting a marker in the list). I don't intend to change he behaviour at this stage, only the interfaces, though I expect the new interfaces to make it easier to toy around with the behaviour. > The mmu_gather of pages to be freed after TLB flush represents a signficant > quantity of deferred work, particularly when those pages are in swapcache: > we do want preemption enabled while freeing them, but we don't want to lose > our place in the prio_tree very often. Same comment as above :-) I understand the problem but I don't see any magical way of making things better here, so I'll concentrate on cleaning up the interfaces while keeping the exact same behaviour and then I can have a second look see if I come up with some idea on how to make things better. > Don't be misled by inclusion of patches to ia64 and powerpc hugetlbpage.c, > that's just to replace **tlb by *tlb in one function: the real mmu_gather > conversion is yet to be done there. Ok. > Only i386 and x86_64 have been converted, built and (inadequately) tested so > far: but most arches shouldn't need more than removing their DEFINE_PER_CPU, > with arm and arm26 probably just wanting to use more of the generic code. > > sparc64 uses a flush_tlb_pending technique which defers a lot of work until > context switch, when it cannot be preempted: I've given little thought to it. > powerpc appeared similar to sparc64, but you've changed it since 2.6.16. powerpc64 used to do that, but I had that massive bug because it needs to flush before the page table lock is released (or we might end up with duplicates in the hash table, which is fatal). > I've removed the start,end args to tlb_finish_mmu, and several levels above > it: the tlb_start_valid business in unmap_vmas always seemed ugly to me, > only ia64 has made use of them, and I cannot see why it shouldn't just > record first and last addr when its tlb_remove_tlb_entry is called. > But since ia64 isn't done yet, that end of it isn't seen in the patch. Agreed. I'd rather have archs that care explicitely record start/end. One thing I'm also thinking about doing is slighlty changing the way the "generic" gather interface is defined. Currently, you have some things you can define in the arch (such as tlb_start/end_vma), some things that are totally defined for you, such as the struct mmu_gather itself, etc... thus some archs have to replace the whole things, some can hook half way through, but in general, I find it confusing. I think we could do better by having the mmu_gather contain an mmu_gather_arch field (arch defined, for additional fields in there) and use for -all- the mmu_gather functions something like #ifndef tlb_start_vma static inline void tlb_start_vma(...) { ..../... } #endif Thus archs that need their own version would just do: static inline void tlb_start_vma(...) { ..../... } #define tlb_start_vma tlb_start_vma Not sure about that yet, waiting for people to flame me with "that's horrible" :-) Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: mmu_gather changes & generalization 2007-07-13 22:46 ` Benjamin Herrenschmidt @ 2007-07-14 15:33 ` Hugh Dickins 0 siblings, 0 replies; 8+ messages in thread From: Hugh Dickins @ 2007-07-14 15:33 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: linux-mm, Nick Piggin On Sat, 14 Jul 2007, Benjamin Herrenschmidt wrote: > > > Here's the 2.6.22 version of what I worked on just after 2.6.16. > > As I said before, if you find it useful to build upon, do so; > > but if not, not. From something you said earlier, I've a > > feeling we'll be fighting over where to place the TLB flushes, > > inside or outside the page table lock. > > ppc64 needs inside, but I don't want to change the behaviour for others, > so I'll probably do a pair of tlb_after_pte_lock and > tlb_before_pte_unlock that do nothing by default and that ppc64 can use > to do the flush before unlocking. Yeah, something like that, I suppose (better naming!). And I think your ppc64 implementation will do best just to flush TLB in _before, leaving the page freeing to the _after; whereas most will do them both in the _after. > > It seems like virtualization stuff needs that too, thus we could replace > a whole lot of the lazy_mmu stuff in there with those 2 hooks, making > things a little bit less confusing. That would be good, I didn't look into those lazy_mmu things at all: we're in perfect agreement that the fewer such the better. > > > A few notes: > > > > Keep in mind: hard to have low preemption latency with decent throughput > > in zap_pte_range - easier than it once was now the ptl is taken lower down, > > but big problem when truncation/invalidation holds i_mmap_lock to scan the > > vma prio_tree - drop that lock and it has to restart. Not satisfactorily > > solved yet (sometimes I think we should collapse the prio_tree into a list > > for the duration of the unmapping: no problem putting a marker in the list). > > I don't intend to change he behaviour at this stage, only the > interfaces, though I expect the new interfaces to make it easier to toy > around with the behaviour. Right, that may lead you to set aside a lot of what I did for now. ..../... (if I may echo you ;) > I think we could do better by having the mmu_gather contain an > mmu_gather_arch field (arch defined, for additional fields in there) and > use for -all- the mmu_gather functions something like > > #ifndef tlb_start_vma > static inline void tlb_start_vma(...) > { > ..../... > } > #endif > > Thus archs that need their own version would just do: > > static inline void tlb_start_vma(...) > { > ..../... > } > #define tlb_start_vma tlb_start_vma > > Not sure about that yet, waiting for people to flame me with "that's > horrible" :-) No, sounds good to me, no flame from this direction: it's exactly what Linus prefers to the __HAVE_ARCH... stuff. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2007-07-14 15:33 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2007-07-10 5:46 mmu_gather changes & generalization Benjamin Herrenschmidt 2007-07-11 20:45 ` Hugh Dickins 2007-07-11 23:18 ` Benjamin Herrenschmidt 2007-07-12 16:42 ` Hugh Dickins 2007-07-13 0:51 ` Benjamin Herrenschmidt 2007-07-13 20:39 ` Hugh Dickins 2007-07-13 22:46 ` Benjamin Herrenschmidt 2007-07-14 15:33 ` Hugh Dickins
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox