From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 3922D6B003D for ; Thu, 26 Mar 2009 12:22:17 -0400 (EDT) Message-ID: <49CBB989.2030608@goop.org> Date: Thu, 26 Mar 2009 10:21:13 -0700 From: Jeremy Fitzhardinge MIME-Version: 1.0 Subject: Re: tlb_gather_mmu() and semantics of "fullmm" References: <1238043674.25062.823.camel@pasglop> In-Reply-To: Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Hugh Dickins Cc: Benjamin Herrenschmidt , linux-mm@kvack.org, Linus Torvalds , Andrew Morton , Nick Piggin , "David S. Miller" , Zach Amsden , Alok Kataria List-ID: Hugh Dickins wrote: > On Thu, 26 Mar 2009, Benjamin Herrenschmidt wrote: > >> I'd like to clarify something about the semantics of the "full_mm_flush" >> argument of tlb_gather_mmu(). >> >> The reason is that it can either mean: >> >> - All the mappings for that mm are being flushed >> >> or >> >> - The above +plus+ the mm is dead and has no remaining user. IE, we >> can relax some of the rules because we know the mappings cannot be >> accessed concurrently, and thus the PTEs cannot be reloaded into the >> TLB. >> > > No remaining user in the sense of no longer connected to any user task, > but may still be active_mm on some cpus. > Right. >> If it means the later (which it does in practice today, since we only >> call it from exit_mmap(), unless I missed an important detail), then I >> could implement some optimisations in my own arch code, but more >> > > Yes, I'm pretty sure you can assume the latter. The whole point > of the "full mm" stuff (would have better been named "exit mm") is > to allow optimizations, and I don't see what optimization there is to > be made from knowing you're going the whole length of the mm; whereas > optimizations can be made if you know nothing can happen in parallel. > > Cc'ed DaveM who introduced it for sparc64, and Zach and Jeremy > who have delved there, in case they wish to disagree. > Yes. The specific optimisation is that we don't need to worry about racing with anyone when fetching the A/D bits, so we can avoid using expensive atomic instructions. >> importantly, I believe we might also be able to optimize the generic >> (and x86) code to avoid flushing the TLB when the batch of pages fills >> up, before freeing the pages. >> > > I'd be surprised if there are still such optimizations to be made: > maybe a whole different strategy could be more efficient, but I'd be > surprised if there's really a superfluous TLB flush to be tweaked away. > Perhaps, but I think in some cases we're over-eager with tlb flushes. Often the thing we want to achieve is "we need a tlb flush before this vaddr is remapped", not "we need a tlb flush now"; any other incidental tlb flush would be enough to get the desired outcome. This may not be an issue for process-related flushes, but I'm thinking about things like vmap. > Although it looks as if there's a TLB flush at the end of every batch, > isn't that deceptive (on x86 anyway)? I'm thinking that the first > flush_tlb_mm() will end up calling leave_mm(), and the subsequent > ones do nothing because the cpu_vm_mask is then empty. > x86 tends to flush either single pages or everything, though the CPA code has its own tlb flush machinery to allow batched cross-cpu range flushing. Given that, there doesn't seem to be a lot for the tlb gathering machinery to do (especially not on process destruction). > Hmm, but the cpu which is actually doing the flush_tlb_mm() calls > leave_mm() without considering cpu_vm_mask: won't we get repeated > unnecessary load_cr3(swapper_pg_dir)s from that? > Yes, though it would mean clearing the current cpu from cpu_vm_mask, even though the mm is currently active. It would mean that we would be strictly defining the cpu_vm_mask to mean "cpus which may have stale usermode tlb entries". But even then, could we guarantee that the current cpu won't pick up stray entries due to speculation, etc? Still, repeatedly stomping the current cpu's tlb does seem like overkill... For x86, at least, it would seem that the best strategy is to switch to init_mm before doing anything (including other cpus which may be lazily still pointing at the mm), then just tear the whole thing down without any subsequent flushing at all. The cost of doing a one-off the cross-cpu mm switch is going to be about the same as a single cross-cpu tlb flush, and certainly much better than repeated ones. Also, why do we bother with zeroing out all the ptes if we're just about to free the pages anyway? zap_pte_range seems to do too much work for the "full_mm" case. >> That would have the side effect of speeding up exit of large processes >> by limiting the number of tlb flushes they do. Since the TLB would need >> to be flushed only once at the end for archs that may carry more than >> one context in their TLB, and possibly not at all on x86 since it >> doesn't and the context isn't active any more. >> > > It's tempting to think that even that one TLB flush is one too many, > given that the next user task to run on any cpu will have to load %cr3 > for its own address space. > > But I think that leaves a danger from speculative TLB loads by kernel > threads, after the pagetables of the original mm have got freed and > reused for something else: I think they would at least need to remain > good pagetables until the last cpu's TLB has been flushed. > Yes, I think the kernel goes to a fair amount of effort to make sure that the tlb is flushed before freeing pages, though I can't remember why (I seem to remember the Intel people doing the work, and it was some kind of architectural issue). I remember it was one of the problems with the old quicklist-based pagetable allocation. And as I discovered last week, the x86 get_user_pages_fast() makes use of the tlb flush in a rather obscure way. When it is rampaging around in some process's pagetable, it disables interrupts so that if some other CPU starts freeing the pagetable it gets caught up waiting for the IPI to be handled (which causes us some heartburn because our cross-cpu tlb flushes don't send IPIs). J -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org