tlb_gather_mmu() and semantics of "fullmm"

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* tlb_gather_mmu() and semantics of "fullmm"
@ 2009-03-26  5:01 Benjamin Herrenschmidt
  2009-03-26 14:08 ` Hugh Dickins
  0 siblings, 1 reply; 14+ messages in thread
From: Benjamin Herrenschmidt @ 2009-03-26  5:01 UTC (permalink / raw)
  To: linux-mm; +Cc: Linus Torvalds, Andrew Morton, Hugh Dickins, Nick Piggin

Hi !

I'd like to clarify something about the semantics of the "full_mm_flush"
argument of tlb_gather_mmu().

The reason is that it can either mean:

 - All the mappings for that mm are being flushed

or

 - The above +plus+ the mm is dead and has no remaining user. IE, we
can relax some of the rules because we know the mappings cannot be
accessed concurrently, and thus the PTEs cannot be reloaded into the
TLB.

If it means the later (which it does in practice today, since we only
call it from exit_mmap(), unless I missed an important detail), then I
could implement some optimisations in my own arch code, but more
importantly, I believe we might also be able to optimize the generic
(and x86) code to avoid flushing the TLB when the batch of pages fills
up, before freeing the pages.

That would have the side effect of speeding up exit of large processes
by limiting the number of tlb flushes they do. Since the TLB would need
to be flushed only once at the end for archs that may carry more than
one context in their TLB, and possibly not at all on x86 since it
doesn't and the context isn't active any more.

Or am I missing something ?

Cheers,
Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tlb_gather_mmu() and semantics of "fullmm"
  2009-03-26  5:01 tlb_gather_mmu() and semantics of "fullmm" Benjamin Herrenschmidt
@ 2009-03-26 14:08 ` Hugh Dickins
  2009-03-26 16:38   ` Linus Torvalds
                     ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Hugh Dickins @ 2009-03-26 14:08 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-mm, Linus Torvalds, Andrew Morton, Nick Piggin,
	David S. Miller, Zach Amsden, Jeremy Fitzhardinge

On Thu, 26 Mar 2009, Benjamin Herrenschmidt wrote:
> 
> I'd like to clarify something about the semantics of the "full_mm_flush"
> argument of tlb_gather_mmu().
> 
> The reason is that it can either mean:
> 
>  - All the mappings for that mm are being flushed
> 
> or
> 
>  - The above +plus+ the mm is dead and has no remaining user. IE, we
> can relax some of the rules because we know the mappings cannot be
> accessed concurrently, and thus the PTEs cannot be reloaded into the
> TLB.

No remaining user in the sense of no longer connected to any user task,
but may still be active_mm on some cpus.

> 
> If it means the later (which it does in practice today, since we only
> call it from exit_mmap(), unless I missed an important detail), then I
> could implement some optimisations in my own arch code, but more

Yes, I'm pretty sure you can assume the latter.  The whole point
of the "full mm" stuff (would have better been named "exit mm") is
to allow optimizations, and I don't see what optimization there is to
be made from knowing you're going the whole length of the mm; whereas
optimizations can be made if you know nothing can happen in parallel.

Cc'ed DaveM who introduced it for sparc64, and Zach and Jeremy
who have delved there, in case they wish to disagree.

> importantly, I believe we might also be able to optimize the generic
> (and x86) code to avoid flushing the TLB when the batch of pages fills
> up, before freeing the pages.

I'd be surprised if there are still such optimizations to be made:
maybe a whole different strategy could be more efficient, but I'd be
surprised if there's really a superfluous TLB flush to be tweaked away.

Although it looks as if there's a TLB flush at the end of every batch,
isn't that deceptive (on x86 anyway)?  I'm thinking that the first
flush_tlb_mm() will end up calling leave_mm(), and the subsequent
ones do nothing because the cpu_vm_mask is then empty.

Hmm, but the cpu which is actually doing the flush_tlb_mm() calls
leave_mm() without considering cpu_vm_mask: won't we get repeated
unnecessary load_cr3(swapper_pg_dir)s from that?

> 
> That would have the side effect of speeding up exit of large processes
> by limiting the number of tlb flushes they do. Since the TLB would need
> to be flushed only once at the end for archs that may carry more than
> one context in their TLB, and possibly not at all on x86 since it
> doesn't and the context isn't active any more.

It's tempting to think that even that one TLB flush is one too many,
given that the next user task to run on any cpu will have to load %cr3
for its own address space.

But I think that leaves a danger from speculative TLB loads by kernel
threads, after the pagetables of the original mm have got freed and
reused for something else: I think they would at least need to remain
good pagetables until the last cpu's TLB has been flushed.

> 
> Or am I missing something ?

I suspect so, but please don't take my word for it: you've
probably put more thought into asking than I have in answering.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tlb_gather_mmu() and semantics of "fullmm"
  2009-03-26 14:08 ` Hugh Dickins
@ 2009-03-26 16:38   ` Linus Torvalds
  2009-03-26 23:13     ` Benjamin Herrenschmidt
  2009-03-26 17:21   ` Jeremy Fitzhardinge
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Linus Torvalds @ 2009-03-26 16:38 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Benjamin Herrenschmidt, linux-mm, Andrew Morton, Nick Piggin,
	David S. Miller, Zach Amsden, Jeremy Fitzhardinge

On Thu, 26 Mar 2009, Hugh Dickins wrote:

> On Thu, 26 Mar 2009, Benjamin Herrenschmidt wrote:
> > 
> > I'd like to clarify something about the semantics of the "full_mm_flush"
> > argument of tlb_gather_mmu().
> > 
> > The reason is that it can either mean:
> > 
> >  - All the mappings for that mm are being flushed
> > 
> > or
> > 
> >  - The above +plus+ the mm is dead and has no remaining user. IE, we
> > can relax some of the rules because we know the mappings cannot be
> > accessed concurrently, and thus the PTEs cannot be reloaded into the
> > TLB.
> 
> No remaining user in the sense of no longer connected to any user task,
> but may still be active_mm on some cpus.

Side note: this means that CPU's that do speculative TLB fills may still 
touch the user entries. They won't _care_ about what they get, though. So 
you should be able to do any optimizations you want, as long as it doesn't 
cause machine checks or similar (ie another CPU doing a speculative access 
and then being really unhappy about a totally invalid page table entry).

> Although it looks as if there's a TLB flush at the end of every batch,
> isn't that deceptive (on x86 anyway)?

You need to. Again. Even on that CPU the TLB may have gotten re-loaded 
speculatively, even if nothing _meant_ to touch user pages.

So you can't just flush the TLB once, and then expect that since you 
flushed it, and nothing else accessed those user addresses, you don't need 
to flush it again.

And doing things the other way around - only flushing once at the end - is 
incorrect because the whole point is that we can only free the page 
directory once we've flushed all the translations that used it. So we need 
to flush before the real release, and we need to flush after we've 
unmapped everything. Thus the repeated flushes.

It shouldn't be that costly, since kernel mappings should be marked 
global.

> I'm thinking that the first flush_tlb_mm() will end up calling 
> leave_mm(), and the subsequent ones do nothing because the cpu_vm_mask 
> is then empty.

The subsequent ones shouldn't need to do anything on _other_ CPU's, 
because the other CPU's will have changed their active_vm to NULL, and no 
longer use that VM at all. The unmapping process still uses the old VM in 
the general case.

(The "do_exit()" case is special, and in that case we should not need to 
do any of this at all, but on x86 doing different paths depending on the 
"full" bit is unlikely to be worth it - it shouldn't be all that 
noticeable. You could _try_, though).

> Hmm, but the cpu which is actually doing the flush_tlb_mm() calls
> leave_mm() without considering cpu_vm_mask: won't we get repeated
> unnecessary load_cr3(swapper_pg_dir)s from that?

Yes, but see above: it's necessary for the non-full case, and I doubt it 
matters much for the full case.

But nobody has done timings as far as I know.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tlb_gather_mmu() and semantics of "fullmm"
  2009-03-26 16:38   ` Linus Torvalds
@ 2009-03-26 23:13     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 14+ messages in thread
From: Benjamin Herrenschmidt @ 2009-03-26 23:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, linux-mm, Andrew Morton, Nick Piggin,
	David S. Miller, Zach Amsden, Jeremy Fitzhardinge


> Side note: this means that CPU's that do speculative TLB fills may still 
> touch the user entries.

Ok. That's what I wasn't sure of. It's fortunately not the case on SW
loaded TLBs so I may still do some optimisations on these guys.

>  They won't _care_ about what they get, though. So 
> you should be able to do any optimizations you want, as long as it doesn't 
> cause machine checks or similar (ie another CPU doing a speculative access 
> and then being really unhappy about a totally invalid page table entry).

Right.

> > Although it looks as if there's a TLB flush at the end of every batch,
> > isn't that deceptive (on x86 anyway)?
> 
> You need to. Again. Even on that CPU the TLB may have gotten re-loaded 
> speculatively, even if nothing _meant_ to touch user pages.
> 
> So you can't just flush the TLB once, and then expect that since you 
> flushed it, and nothing else accessed those user addresses, you don't need 
> to flush it again.
>
> And doing things the other way around - only flushing once at the end - is 
> incorrect because the whole point is that we can only free the page 
> directory once we've flushed all the translations that used it. So we need 
> to flush before the real release, and we need to flush after we've 
> unmapped everything. Thus the repeated flushes.
> 
> It shouldn't be that costly, since kernel mappings should be marked 
> global.

I was talking about the freeing of the individual pages, not the page
tables per-se, but yes, I see that the problem is there too.

I'll do some experiments on embedded stuffs here and see if it's worth
doing things differently. I'm trying to avoid too many IPIs typically.
The problem with our TLBs is that they cache multiple contexts, and so
they may still hold translations for contexts not currently active,
-but- we really don't need to do heavy synchronisation to flush those.

Cheers,
Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tlb_gather_mmu() and semantics of "fullmm"
  2009-03-26 14:08 ` Hugh Dickins
  2009-03-26 16:38   ` Linus Torvalds
@ 2009-03-26 17:21   ` Jeremy Fitzhardinge
  2009-03-26 20:39   ` David Miller
  2009-03-26 22:33   ` Benjamin Herrenschmidt
  3 siblings, 0 replies; 14+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-26 17:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Benjamin Herrenschmidt, linux-mm, Linus Torvalds, Andrew Morton,
	Nick Piggin, David S. Miller, Zach Amsden, Alok Kataria

Hugh Dickins wrote:
> On Thu, 26 Mar 2009, Benjamin Herrenschmidt wrote:
>   
>> I'd like to clarify something about the semantics of the "full_mm_flush"
>> argument of tlb_gather_mmu().
>>
>> The reason is that it can either mean:
>>
>>  - All the mappings for that mm are being flushed
>>
>> or
>>
>>  - The above +plus+ the mm is dead and has no remaining user. IE, we
>> can relax some of the rules because we know the mappings cannot be
>> accessed concurrently, and thus the PTEs cannot be reloaded into the
>> TLB.
>>     
>
> No remaining user in the sense of no longer connected to any user task,
> but may still be active_mm on some cpus.
>   

Right.

>> If it means the later (which it does in practice today, since we only
>> call it from exit_mmap(), unless I missed an important detail), then I
>> could implement some optimisations in my own arch code, but more
>>     
>
> Yes, I'm pretty sure you can assume the latter.  The whole point
> of the "full mm" stuff (would have better been named "exit mm") is
> to allow optimizations, and I don't see what optimization there is to
> be made from knowing you're going the whole length of the mm; whereas
> optimizations can be made if you know nothing can happen in parallel.
>
> Cc'ed DaveM who introduced it for sparc64, and Zach and Jeremy
> who have delved there, in case they wish to disagree.
>   

Yes. The specific optimisation is that we don't need to worry about 
racing with anyone when fetching the A/D bits, so we can avoid using 
expensive atomic instructions.

>> importantly, I believe we might also be able to optimize the generic
>> (and x86) code to avoid flushing the TLB when the batch of pages fills
>> up, before freeing the pages.
>>     
>
> I'd be surprised if there are still such optimizations to be made:
> maybe a whole different strategy could be more efficient, but I'd be
> surprised if there's really a superfluous TLB flush to be tweaked away.
>   

Perhaps, but I think in some cases we're over-eager with tlb flushes. 
Often the thing we want to achieve is "we need a tlb flush before this 
vaddr is remapped", not "we need a tlb flush now"; any other incidental 
tlb flush would be enough to get the desired outcome. This may not be an 
issue for process-related flushes, but I'm thinking about things like vmap.

> Although it looks as if there's a TLB flush at the end of every batch,
> isn't that deceptive (on x86 anyway)?  I'm thinking that the first
> flush_tlb_mm() will end up calling leave_mm(), and the subsequent
> ones do nothing because the cpu_vm_mask is then empty.
>   

x86 tends to flush either single pages or everything, though the CPA 
code has its own tlb flush machinery to allow batched cross-cpu range 
flushing. Given that, there doesn't seem to be a lot for the tlb 
gathering machinery to do (especially not on process destruction).

> Hmm, but the cpu which is actually doing the flush_tlb_mm() calls
> leave_mm() without considering cpu_vm_mask: won't we get repeated
> unnecessary load_cr3(swapper_pg_dir)s from that?
>   
Yes, though it would mean clearing the current cpu from cpu_vm_mask, 
even though the mm is currently active. It would mean that we would be 
strictly defining the cpu_vm_mask to mean "cpus which may have stale 
usermode tlb entries". But even then, could we guarantee that the 
current cpu won't pick up stray entries due to speculation, etc? Still, 
repeatedly stomping the current cpu's tlb does seem like overkill...

For x86, at least, it would seem that the best strategy is to switch to 
init_mm before doing anything (including other cpus which may be lazily 
still pointing at the mm), then just tear the whole thing down without 
any subsequent flushing at all. The cost of doing a one-off the 
cross-cpu mm switch is going to be about the same as a single cross-cpu 
tlb flush, and certainly much better than repeated ones.

Also, why do we bother with zeroing out all the ptes if we're just about 
to free the pages anyway? zap_pte_range seems to do too much work for 
the "full_mm" case.

>> That would have the side effect of speeding up exit of large processes
>> by limiting the number of tlb flushes they do. Since the TLB would need
>> to be flushed only once at the end for archs that may carry more than
>> one context in their TLB, and possibly not at all on x86 since it
>> doesn't and the context isn't active any more.
>>     
>
> It's tempting to think that even that one TLB flush is one too many,
> given that the next user task to run on any cpu will have to load %cr3
> for its own address space.
>
> But I think that leaves a danger from speculative TLB loads by kernel
> threads, after the pagetables of the original mm have got freed and
> reused for something else: I think they would at least need to remain
> good pagetables until the last cpu's TLB has been flushed.
>   

Yes, I think the kernel goes to a fair amount of effort to make sure 
that the tlb is flushed before freeing pages, though I can't remember 
why (I seem to remember the Intel people doing the work, and it was some 
kind of architectural issue). I remember it was one of the problems with 
the old quicklist-based pagetable allocation.

And as I discovered last week, the x86 get_user_pages_fast() makes use 
of the tlb flush in a rather obscure way. When it is rampaging around in 
some process's pagetable, it disables interrupts so that if some other 
CPU starts freeing the pagetable it gets caught up waiting for the IPI 
to be handled (which causes us some heartburn because our cross-cpu tlb 
flushes don't send IPIs).

J

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tlb_gather_mmu() and semantics of "fullmm"
  2009-03-26 14:08 ` Hugh Dickins
  2009-03-26 16:38   ` Linus Torvalds
  2009-03-26 17:21   ` Jeremy Fitzhardinge
@ 2009-03-26 20:39   ` David Miller
  2009-03-26 22:33   ` Benjamin Herrenschmidt
  3 siblings, 0 replies; 14+ messages in thread
From: David Miller @ 2009-03-26 20:39 UTC (permalink / raw)
  To: hugh; +Cc: benh, linux-mm, torvalds, akpm, npiggin, zach, jeremy

From: Hugh Dickins <hugh@veritas.com>
Date: Thu, 26 Mar 2009 14:08:17 +0000 (GMT)

> On Thu, 26 Mar 2009, Benjamin Herrenschmidt wrote:
> > If it means the later (which it does in practice today, since we only
> > call it from exit_mmap(), unless I missed an important detail), then I
> > could implement some optimisations in my own arch code, but more
> 
> Yes, I'm pretty sure you can assume the latter.  The whole point
> of the "full mm" stuff (would have better been named "exit mm") is
> to allow optimizations, and I don't see what optimization there is to
> be made from knowing you're going the whole length of the mm; whereas
> optimizations can be made if you know nothing can happen in parallel.
> 
> Cc'ed DaveM who introduced it for sparc64, and Zach and Jeremy
> who have delved there, in case they wish to disagree.

The TLBs on sparc64 have a "context flush" which removes every entry
matching the current MMU context.  This is what flush_tlb_mm() does.

So we use tlb->fullmm so that the individual page and range TLB
flushes do nothing, and instead we do a flush_tlb_mm() before we walk
through the address space to tear it down.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tlb_gather_mmu() and semantics of "fullmm"
  2009-03-26 14:08 ` Hugh Dickins
                     ` (2 preceding siblings ...)
  2009-03-26 20:39   ` David Miller
@ 2009-03-26 22:33   ` Benjamin Herrenschmidt
  2009-03-27  5:04     ` David Miller
  3 siblings, 1 reply; 14+ messages in thread
From: Benjamin Herrenschmidt @ 2009-03-26 22:33 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-mm, Linus Torvalds, Andrew Morton, Nick Piggin,
	David S. Miller, Zach Amsden, Jeremy Fitzhardinge

> No remaining user in the sense of no longer connected to any user task,
> but may still be active_mm on some cpus.

Right, I see, an Linus point about speculative TLB activity stands here,
though I suspect that is a non issue on SW loaded TLB processors for
example... 

I wonder how often we are in this situation and whether we could
optimize for the case when fullmm && mm_count == 1...

> I'd be surprised if there are still such optimizations to be made:
> maybe a whole different strategy could be more efficient, but I'd be
> surprised if there's really a superfluous TLB flush to be tweaked away.
> 
> Although it looks as if there's a TLB flush at the end of every batch,
> isn't that deceptive (on x86 anyway)?  I'm thinking that the first
> flush_tlb_mm() will end up calling leave_mm(), and the subsequent
> ones do nothing because the cpu_vm_mask is then empty.

Ok, well, that's a bit different on other archs like powerpc where we virtually
never remove bits from cpu_vm_mask... (though we probably could... to be looked
at).

> Hmm, but the cpu which is actually doing the flush_tlb_mm() calls
> leave_mm() without considering cpu_vm_mask: won't we get repeated
> unnecessary load_cr3(swapper_pg_dir)s from that?

That's x86 voodoo that I'll leave to you guys :-)

> It's tempting to think that even that one TLB flush is one too many,
> given that the next user task to run on any cpu will have to load %cr3
> for its own address space.

But we can't free the pages until we have flushed the TLB.

> But I think that leaves a danger from speculative TLB loads by kernel
> threads, after the pagetables of the original mm have got freed and
> reused for something else: I think they would at least need to remain
> good pagetables until the last cpu's TLB has been flushed.

Page tables being good is a separate problem. Pages themselves can't be
freed while a TLB potentially points to them, we agree on that.

> I suspect so, but please don't take my word for it: you've
> probably put more thought into asking than I have in answering.

Well, I'm thinking there may be ways to improve things a little bit but
that's no big deal right now.

Mostly the deal with SW loaded TLBs is that once it's been flushed once,
there should be no speculative access to worry about anymore and we can
switch the batch to 'fast mode' if fullmm is set, because those CPUs (at
least the ones I'm working with) can't take TLB miss interrupts as a
result of a speculative access.

Cheers,
Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tlb_gather_mmu() and semantics of "fullmm"
  2009-03-26 22:33   ` Benjamin Herrenschmidt
@ 2009-03-27  5:04     ` David Miller
  2009-03-27  5:38       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 14+ messages in thread
From: David Miller @ 2009-03-27  5:04 UTC (permalink / raw)
  To: benh; +Cc: hugh, linux-mm, torvalds, akpm, npiggin, zach, jeremy

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date: Fri, 27 Mar 2009 09:33:44 +1100

> > I'd be surprised if there are still such optimizations to be made:
> > maybe a whole different strategy could be more efficient, but I'd be
> > surprised if there's really a superfluous TLB flush to be tweaked away.
> > 
> > Although it looks as if there's a TLB flush at the end of every batch,
> > isn't that deceptive (on x86 anyway)?  I'm thinking that the first
> > flush_tlb_mm() will end up calling leave_mm(), and the subsequent
> > ones do nothing because the cpu_vm_mask is then empty.
> 
> Ok, well, that's a bit different on other archs like powerpc where we virtually
> never remove bits from cpu_vm_mask... (though we probably could... to be looked
> at).

We do this on sparc64 when the mm->mm_users == 1 and 'mm' is the
current->active_mm

See arch/sparc/kernel/smp_64.c:smp_flush_tlb_pending() where we go:

	if (mm == current->active_mm && atomic_read(&mm->mm_users) == 1)
		mm->cpu_vm_mask = cpumask_of_cpu(cpu);
	else
		smp_cross_call_masked(&xcall_flush_tlb_pending,
				      ctx, nr, (unsigned long) vaddrs,
				      &mm->cpu_vm_mask);

	__flush_tlb_pending(ctx, nr, vaddrs);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tlb_gather_mmu() and semantics of "fullmm"
  2009-03-27  5:04     ` David Miller
@ 2009-03-27  5:38       ` Benjamin Herrenschmidt
  2009-03-27  5:44         ` David Miller
  0 siblings, 1 reply; 14+ messages in thread
From: Benjamin Herrenschmidt @ 2009-03-27  5:38 UTC (permalink / raw)
  To: David Miller; +Cc: hugh, linux-mm, torvalds, akpm, npiggin, zach, jeremy

On Thu, 2009-03-26 at 22:04 -0700, David Miller wrote:
> From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Date: Fri, 27 Mar 2009 09:33:44 +1100
> 
> > > I'd be surprised if there are still such optimizations to be made:
> > > maybe a whole different strategy could be more efficient, but I'd be
> > > surprised if there's really a superfluous TLB flush to be tweaked away.
> > > 
> > > Although it looks as if there's a TLB flush at the end of every batch,
> > > isn't that deceptive (on x86 anyway)?  I'm thinking that the first
> > > flush_tlb_mm() will end up calling leave_mm(), and the subsequent
> > > ones do nothing because the cpu_vm_mask is then empty.
> > 
> > Ok, well, that's a bit different on other archs like powerpc where we virtually
> > never remove bits from cpu_vm_mask... (though we probably could... to be looked
> > at).
> 
> We do this on sparc64 when the mm->mm_users == 1 and 'mm' is the
> current->active_mm

That doesn't sound right ... mm_user seems to represent how many tasks
have task->mm set to this mm, but now how many processors have it as
the "active_mm" due to lazy switching.

If you look at context_switch() in kernel/sched.c, it increments
mm_count when using the pevious guy's mm as the "active_mm" of a kernel
thread, not mm_user.

So effectively, mm_user can be any value, that doesn't represent how
many processors can have the mm currently active on them.

You could have mm_user be 1 due to the mm being active and in userspace
on another CPU, and have it locally be the active_mm because your local
CPU is in keventd or similar, flushing the other guy's mm as a result
of some unmap_mapping_range() call due to a network filesystem doing
coherency stuff for example.

Cheers,
Ben.

> See arch/sparc/kernel/smp_64.c:smp_flush_tlb_pending() where we go:
> 
> 	if (mm == current->active_mm && atomic_read(&mm->mm_users) == 1)
> 		mm->cpu_vm_mask = cpumask_of_cpu(cpu);
> 	else
> 		smp_cross_call_masked(&xcall_flush_tlb_pending,
> 				      ctx, nr, (unsigned long) vaddrs,
> 				      &mm->cpu_vm_mask);
> 
> 	__flush_tlb_pending(ctx, nr, vaddrs);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tlb_gather_mmu() and semantics of "fullmm"
  2009-03-27  5:38       ` Benjamin Herrenschmidt
@ 2009-03-27  5:44         ` David Miller
  2009-03-27  5:54           ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 14+ messages in thread
From: David Miller @ 2009-03-27  5:44 UTC (permalink / raw)
  To: benh; +Cc: hugh, linux-mm, torvalds, akpm, npiggin, zach, jeremy

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date: Fri, 27 Mar 2009 16:38:07 +1100

> If you look at context_switch() in kernel/sched.c, it increments
> mm_count when using the pevious guy's mm as the "active_mm" of a kernel
> thread, not mm_user.

Yawn...

arch/sparc/include/asm/mmu_context_64.h:
static inline void switch_mm(struct mm_struct *old_mm, struct mm_struct *mm, struct task_struct *tsk)
{
...
	spin_lock_irqsave(&mm->context.lock, flags);
	ctx_valid = CTX_VALID(mm->context);
	if (!ctx_valid)
		get_new_mmu_context(mm);
 ...
	cpu = smp_processor_id();
	if (!ctx_valid || !cpu_isset(cpu, mm->cpu_vm_mask)) {
		cpu_set(cpu, mm->cpu_vm_mask);
		__flush_tlb_mm(CTX_HWBITS(mm->context),
			       SECONDARY_CONTEXT);
	}
	spin_unlock_irqrestore(&mm->context.lock, flags);
...

We unconditionally check if the CPU is set in the mask, even
when the mm isn't changing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tlb_gather_mmu() and semantics of "fullmm"
  2009-03-27  5:44         ` David Miller
@ 2009-03-27  5:54           ` Benjamin Herrenschmidt
  2009-03-27  5:57             ` David Miller
  0 siblings, 1 reply; 14+ messages in thread
From: Benjamin Herrenschmidt @ 2009-03-27  5:54 UTC (permalink / raw)
  To: David Miller; +Cc: hugh, linux-mm, torvalds, akpm, npiggin, zach, jeremy

On Thu, 2009-03-26 at 22:44 -0700, David Miller wrote:
> From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Date: Fri, 27 Mar 2009 16:38:07 +1100
> 
> > If you look at context_switch() in kernel/sched.c, it increments
> > mm_count when using the pevious guy's mm as the "active_mm" of a kernel
> > thread, not mm_user.
> 
> Yawn...

Yeah it's late over there :-)

> We unconditionally check if the CPU is set in the mask, even
> when the mm isn't changing.

Ok, so you do lazy flushing at context switch time, which is nice,
but I'm still wondering if the code you showed is right. Feel free
to reply tomorrow after a good night of sleep though :-)

The scenario I have in mind is as follow:

CPU 0 is running the context, task->mm == task->active_mm == your
context. The CPU is in userspace happily churning things.

CPU 1 used to run it, not anymore, it's now running fancyfsd which
is a kernel thread, but current->active_mm still points to that
same context.

Because there's only one "real" user, mm_users is 1 (but mm_count is
elevated, it's just that the presence on CPU 1 as active_mm has no
effect on mm_count().

At this point, fancyfsd decides to invalidate a mapping currently mapped
by that context, for example because a networked file has changed
remotely or something like that, using unmap_mapping_ranges().

So CPU 1 goes into the zapping code, which eventually ends up calling
flush_tlb_pending(). Your test will succeed, as current->active_mm is
indeed the target mm for the flush, and mm_users is indeed 1. So you
will -not- send an IPI to the other CPU, and CPU 0 will continue happily
accessing the pages that should have been unmapped.

Or did I miss something ?

Cheers,
Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tlb_gather_mmu() and semantics of "fullmm"
  2009-03-27  5:54           ` Benjamin Herrenschmidt
@ 2009-03-27  5:57             ` David Miller
  2009-03-27  6:10               ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 14+ messages in thread
From: David Miller @ 2009-03-27  5:57 UTC (permalink / raw)
  To: benh; +Cc: hugh, linux-mm, torvalds, akpm, npiggin, zach, jeremy

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date: Fri, 27 Mar 2009 16:54:27 +1100

> CPU 0 is running the context, task->mm == task->active_mm == your
> context. The CPU is in userspace happily churning things.
> 
> CPU 1 used to run it, not anymore, it's now running fancyfsd which
> is a kernel thread, but current->active_mm still points to that
> same context.
> 
> Because there's only one "real" user, mm_users is 1 (but mm_count is
> elevated, it's just that the presence on CPU 1 as active_mm has no
> effect on mm_count().
> 
> At this point, fancyfsd decides to invalidate a mapping currently mapped
> by that context, for example because a networked file has changed
> remotely or something like that, using unmap_mapping_ranges().
> 
> So CPU 1 goes into the zapping code, which eventually ends up calling
> flush_tlb_pending(). Your test will succeed, as current->active_mm is
> indeed the target mm for the flush, and mm_users is indeed 1. So you
> will -not- send an IPI to the other CPU, and CPU 0 will continue happily
> accessing the pages that should have been unmapped.
> 
> Or did I miss something ?

Good point.

Maybe it would work out correctly if I used current->mm?

Because if I tested it that way, only something really executing
in userland could force the cpumask bit clears.

Any kernel thread would flush the TLB if and when it switched
back into a real task using that mm.

Sound good?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tlb_gather_mmu() and semantics of "fullmm"
  2009-03-27  5:57             ` David Miller
@ 2009-03-27  6:10               ` Benjamin Herrenschmidt
  2009-03-27  8:05                 ` David Miller
  0 siblings, 1 reply; 14+ messages in thread
From: Benjamin Herrenschmidt @ 2009-03-27  6:10 UTC (permalink / raw)
  To: David Miller; +Cc: hugh, linux-mm, torvalds, akpm, npiggin, zach, jeremy

On Thu, 2009-03-26 at 22:57 -0700, David Miller wrote:

> Good point.
> 
> Maybe it would work out correctly if I used current->mm?
> 
> Because if I tested it that way, only something really executing
> in userland could force the cpumask bit clears.
> 
> Any kernel thread would flush the TLB if and when it switched
> back into a real task using that mm.
> 
> Sound good?

/me thinks (not as late here but I'm getting tired regardless ;-)

So if you test current->mm, you effectively account for mm_users == 1,
so the only way the mm can be active on another processor is as a lazy
mm for a kernel thread. So your test should work properly as long
as you don't have a HW that will do speculative TLB reloads into the
TLB on that other CPU (and even if you do, you flush-on-switch-in should
get rid of any crap here).

Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tlb_gather_mmu() and semantics of "fullmm"
  2009-03-27  6:10               ` Benjamin Herrenschmidt
@ 2009-03-27  8:05                 ` David Miller
  0 siblings, 0 replies; 14+ messages in thread
From: David Miller @ 2009-03-27  8:05 UTC (permalink / raw)
  To: benh; +Cc: hugh, linux-mm, torvalds, akpm, npiggin, jeremy

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date: Fri, 27 Mar 2009 17:10:35 +1100

[ zach@vmware.com removed from CC:, it bounces... ]

> So if you test current->mm, you effectively account for mm_users == 1,
> so the only way the mm can be active on another processor is as a lazy
> mm for a kernel thread. So your test should work properly as long
> as you don't have a HW that will do speculative TLB reloads into the
> TLB on that other CPU (and even if you do, you flush-on-switch-in should
> get rid of any crap here).

It seems that way.  I'll make this fix, thanks Ben!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2009-03-27  7:55 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-26  5:01 tlb_gather_mmu() and semantics of "fullmm" Benjamin Herrenschmidt
2009-03-26 14:08 ` Hugh Dickins
2009-03-26 16:38   ` Linus Torvalds
2009-03-26 23:13     ` Benjamin Herrenschmidt
2009-03-26 17:21   ` Jeremy Fitzhardinge
2009-03-26 20:39   ` David Miller
2009-03-26 22:33   ` Benjamin Herrenschmidt
2009-03-27  5:04     ` David Miller
2009-03-27  5:38       ` Benjamin Herrenschmidt
2009-03-27  5:44         ` David Miller
2009-03-27  5:54           ` Benjamin Herrenschmidt
2009-03-27  5:57             ` David Miller
2009-03-27  6:10               ` Benjamin Herrenschmidt
2009-03-27  8:05                 ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox