PTE access rules & abstraction

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* PTE access rules & abstraction
@ 2008-09-19 17:42 Benjamin Herrenschmidt
  2008-09-22  6:22 ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 25+ messages in thread
From: Benjamin Herrenschmidt @ 2008-09-19 17:42 UTC (permalink / raw)
  To: Linux Memory Management List; +Cc: Linux Kernel list, Nick Piggin, Hugh Dickins

Hi !

Just yesterday, I was browsing through the users of set_pte_at() to
check something, and stumbled on a (new ?) bug that will introduce
subtle problems on at least powerpc and s390.

No big deal, I'll send a fix, but I'm becoming concerned with how
fragile our page table & PTE access has become.

(The bug btw is that we ptep_get_and_clear followed by a set_pte_at, at
least on those architectures, you -must- flush before you put something
new after you have cleared a PTE, I'll have to fixup our implementation
of the new pte_modify_start/commit).

With the need of the various virtual machines on x86, we've seen new
page table accessors being created like there is no tomorrow, changes in
the PTEs are accessed that may or may not be things we can rely on being
stable in arch code, etc...

Unfortunately, the arch code often has a very intimate relationship to
how page tables are handled. The rules for locking, what can and cannot
be done within a single PTE lock section, what can or cannot be done on
a PTE, for example after it's been cleared, etc... vary in subtle ways
and the way the things are today, the situation is very messy and
fragile.

Maybe it's time to have one head in "charge" of the page table access to
try to keep some sanity, maybe it's time to write down some rules (for
example, can we rely now and forever that set_pte_at() will -never- be
called to write on top of an already valid PTE ?, etc...).

But maybe it's time to try to move the abstraction up a bit, maybe along
the lines of what Nick proposed a while ago, some kind of transactional
model. That would give a lot more freedom to architectures to have their
own PTE access rules and optimisations. 

Comments ? Ideas ?

Cheers,
Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: PTE access rules & abstraction
  2008-09-19 17:42 PTE access rules & abstraction Benjamin Herrenschmidt
@ 2008-09-22  6:22 ` Jeremy Fitzhardinge
  2008-09-22 21:05   ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 25+ messages in thread
From: Jeremy Fitzhardinge @ 2008-09-22  6:22 UTC (permalink / raw)
  To: benh
  Cc: Linux Memory Management List, Linux Kernel list, Nick Piggin,
	Hugh Dickins

Benjamin Herrenschmidt wrote:
> Just yesterday, I was browsing through the users of set_pte_at() to
> check something, and stumbled on a (new ?) bug that will introduce
> subtle problems on at least powerpc and s390.
>
> No big deal, I'll send a fix, but I'm becoming concerned with how
> fragile our page table & PTE access has become.
>
> (The bug btw is that we ptep_get_and_clear followed by a set_pte_at, at
> least on those architectures, you -must- flush before you put something
> new after you have cleared a PTE, I'll have to fixup our implementation
> of the new pte_modify_start/commit).
>   

When I added the ptep_modify_* interface, it occurred to me that
assuming that ptep_get_and_clear would always prevent async pte updates
was a bit optimistic, or at least presumptuous.  And certainly not
flushing the tlb seems like something that just happens to work on x86
(in fact I'm not quite sure how it does work on x86).

I didn't change the function of that code when I made the change, so the
bug was pre-existing; I think it has been like that for quite a while
(though I haven't done any git archaeology to back that up).  So I don't
think there's a new bug here.

What's the consequence of not flushing for you?

> With the need of the various virtual machines on x86, we've seen new
> page table accessors being created like there is no tomorrow, changes in
> the PTEs are accessed that may or may not be things we can rely on being
> stable in arch code, etc...
>   

There have been a few optional-extras calls, which are all fine to leave
as no-ops.  I don't think we've added or changed any must-have interfaces.

> Unfortunately, the arch code often has a very intimate relationship to
> how page tables are handled. The rules for locking, what can and cannot
> be done within a single PTE lock section, what can or cannot be done on
> a PTE, for example after it's been cleared, etc... vary in subtle ways
> and the way the things are today, the situation is very messy and
> fragile.
>   

It seems to me that the rules are roughly "anything goes while you're
holding the pte lock, but it all must be sane by the time you release
it".  But a bit more precision would be useful.

> Maybe it's time to have one head in "charge" of the page table access to
> try to keep some sanity, maybe it's time to write down some rules (for
> example, can we rely now and forever that set_pte_at() will -never- be
> called to write on top of an already valid PTE ?, etc...).
>   

I don't know if there's any such rule regarding set_pte_at().  Certainly
if you were overwriting an existing valid entry, you'd have to arrange
for a tlb flush.

> But maybe it's time to try to move the abstraction up a bit, maybe along
> the lines of what Nick proposed a while ago, some kind of transactional
> model. That would give a lot more freedom to architectures to have their
> own PTE access rules and optimisations. 

Do you have a reference to Nick's proposals?

A higher level interface might give us virtualization people more scope
to play with things.  But given that the current interface mostly works
for everyone, a high level interface would tend to result in a lot of
duplicated code unless you pretty strictly stipluate something like
"implement the highest level interface you need *and no higher*; use
common library code for everything else".

I think documenting the real rules for the current interface would be a
more immediately fruitful path to start with.

    J

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: PTE access rules & abstraction
  2008-09-22  6:22 ` Jeremy Fitzhardinge
@ 2008-09-22 21:05   ` Benjamin Herrenschmidt
  2008-09-23  3:10     ` Nick Piggin
  2008-09-24 18:45     ` Hugh Dickins
  0 siblings, 2 replies; 25+ messages in thread
From: Benjamin Herrenschmidt @ 2008-09-22 21:05 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Linux Memory Management List, Linux Kernel list, Nick Piggin,
	Hugh Dickins

> I didn't change the function of that code when I made the change, so the
> bug was pre-existing; I think it has been like that for quite a while
> (though I haven't done any git archaeology to back that up).  So I don't
> think there's a new bug here.
> 
> What's the consequence of not flushing for you?

The bug may have been there, as I said, lots of unwritten rules...
sometimes broken. I'm not necessarily blaming you, but there have been
lots of changes to the PTE accessors over the last 2 years and not
always under any control :-)

In our case, the consequence is that the entry can be re-hashed because
the fact that it was already hashed and where it was hashed, which is
encoded in the PTE, gets lost by the clear. That means a potential
duplicate entry in the hash. A hard to hit race, but possible. Such a
condition is architecturally illegal and can cause things ranging from
incorrect translation to machine checks or checkstops (generally, on
LPAR machines, what will happen is your partition will get killed).

I know s390 has different issues & constraints. Martin told me during
Plumbers that mprotect was probably also broken for him.

> There have been a few optional-extras calls, which are all fine to leave
> as no-ops.  I don't think we've added or changed any must-have interfaces.

Again, I'm not necessarily talking about the very latest round of
changes that you did here... More like a general approach as trying to
find a better overall interface to PTE access which currently is a total
mess as far as I'm concerned.

> It seems to me that the rules are roughly "anything goes while you're
> holding the pte lock, but it all must be sane by the time you release
> it".  But a bit more precision would be useful.

Well, unfortunately it's a lot more complex than that. For example, on
powerpc, we have to keep the hash table in sync and hash misses don't
take the PTE lock. They are aking to a non-atomic HW walk of the page
tables if you prefer.

So we must be careful about things like never replacing a PTE without
first flushing the hash table entry for the previous one (if it was
hashed). We also need to ensure we flush before we unlock, but we do
that by re-using the new lazy MMU hooks, as to not insert a new PTE
(which can potentially be hashed).

s390 has more complex sets of rules.

> I don't know if there's any such rule regarding set_pte_at().  Certainly
> if you were overwriting an existing valid entry, you'd have to arrange
> for a tlb flush.

Well, for example, the generic copy-on-write case used to do just that,
ie set_pte_at() over the previous one, then flush. That is not good for
us and thus our set_pte_at() has special code to check whether there's
already a present PTE there and does a synchronous flush if there is.

However, that also changed, and nowadays, afaik, set_pte_at() is never
called anymore to override a already present PTE. 

We are doing some work on 32 bits code that will need similar
constraints and It would be nice if we could rely on the above and make
it part of the rules.

> Do you have a reference to Nick's proposals?

Not off hand, Nick ? How far did you go down that path ?

> A higher level interface might give us virtualization people more scope
> to play with things.  But given that the current interface mostly works
> for everyone, a high level interface would tend to result in a lot of
> duplicated code unless you pretty strictly stipluate something like
> "implement the highest level interface you need *and no higher*; use
> common library code for everything else".
> 
> I think documenting the real rules for the current interface would be a
> more immediately fruitful path to start with.

Well, part of the problem is that with the current interface, some
architectures are really mostly hacking around, and in some case,
possibly losing potential performance benefits by not having a slightly
more abstracted interface.

I don't think we should ditch the page tables in favor of some kind of
abstract memory objects, don't get me wrong :-) But having a more
generic higher level set of transactional interfaces to modifying and
flushing PTEs would probably help. Instead we are doing as-hoc hacks
left and right mostly based on how things work on x86.

Ben.

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: PTE access rules & abstraction
  2008-09-22 21:05   ` Benjamin Herrenschmidt
@ 2008-09-23  3:10     ` Nick Piggin
  2008-09-23  3:16       ` David Miller, Nick Piggin
  2008-09-23  5:31       ` Benjamin Herrenschmidt
  2008-09-24 18:45     ` Hugh Dickins
  1 sibling, 2 replies; 25+ messages in thread
From: Nick Piggin @ 2008-09-23  3:10 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Jeremy Fitzhardinge, Linux Memory Management List,
	Linux Kernel list, Hugh Dickins

On Tue, Sep 23, 2008 at 07:05:51AM +1000, Benjamin Herrenschmidt wrote:
> 
> > I didn't change the function of that code when I made the change, so the
> > bug was pre-existing; I think it has been like that for quite a while
> > (though I haven't done any git archaeology to back that up).  So I don't
> > think there's a new bug here.
> > 
> > What's the consequence of not flushing for you?
> 
> The bug may have been there, as I said, lots of unwritten rules...
> sometimes broken. I'm not necessarily blaming you, but there have been
> lots of changes to the PTE accessors over the last 2 years and not
> always under any control :-)
> 
> In our case, the consequence is that the entry can be re-hashed because
> the fact that it was already hashed and where it was hashed, which is
> encoded in the PTE, gets lost by the clear. That means a potential
> duplicate entry in the hash. A hard to hit race, but possible. Such a
> condition is architecturally illegal and can cause things ranging from
> incorrect translation to machine checks or checkstops (generally, on
> LPAR machines, what will happen is your partition will get killed).
> 
> I know s390 has different issues & constraints. Martin told me during
> Plumbers that mprotect was probably also broken for him.
> 
> > There have been a few optional-extras calls, which are all fine to leave
> > as no-ops.  I don't think we've added or changed any must-have interfaces.
> 
> Again, I'm not necessarily talking about the very latest round of
> changes that you did here... More like a general approach as trying to
> find a better overall interface to PTE access which currently is a total
> mess as far as I'm concerned.
> 
> > It seems to me that the rules are roughly "anything goes while you're
> > holding the pte lock, but it all must be sane by the time you release
> > it".  But a bit more precision would be useful.
> 
> Well, unfortunately it's a lot more complex than that. For example, on
> powerpc, we have to keep the hash table in sync and hash misses don't
> take the PTE lock. They are aking to a non-atomic HW walk of the page
> tables if you prefer.
> 
> So we must be careful about things like never replacing a PTE without
> first flushing the hash table entry for the previous one (if it was
> hashed). We also need to ensure we flush before we unlock, but we do
> that by re-using the new lazy MMU hooks, as to not insert a new PTE
> (which can potentially be hashed).
> 
> s390 has more complex sets of rules.
> 
> > I don't know if there's any such rule regarding set_pte_at().  Certainly
> > if you were overwriting an existing valid entry, you'd have to arrange
> > for a tlb flush.
> 
> Well, for example, the generic copy-on-write case used to do just that,
> ie set_pte_at() over the previous one, then flush. That is not good for
> us and thus our set_pte_at() has special code to check whether there's
> already a present PTE there and does a synchronous flush if there is.
> 
> However, that also changed, and nowadays, afaik, set_pte_at() is never
> called anymore to override a already present PTE. 
> 
> We are doing some work on 32 bits code that will need similar
> constraints and It would be nice if we could rely on the above and make
> it part of the rules.
> 
> > Do you have a reference to Nick's proposals?
> 
> Not off hand, Nick ? How far did you go down that path ?

I got something far enough to compile and boot (iirc on x86 and maybe
ia64). Basically it was designed with fine grained pte locking in
mind which is why I stopped working on it after the pmd locking from
Hugh, but I imagine it could be used in some cases to do more
intelligent things with pte manipulation.

IIRC, basically you would have a pte_modify_start, which gives back a
cookie and the pte value, and pte_modify_end which takes a new pte
value and the cookie.

This would allow the arch to see exactly how the pte was changed, but
I imagine in some cases you would still prefer to do that with
optimised direct functions rather than test and branches. So I don't
know if it would be a magic cure ;)

Part of the problem with the pte API (as well as the cache flush and
tlb flush APIs) is that it often involves the core mm code telling
the arch how it thinks ptes,tlbs,caches should be managed, rather than
I think the better approach would be telling the arch what it wants to
do.

We are getting better slowly I think (eg. you note that set_pte_at is
no longer used as a generic "do anything"), but I won't dispute that
this whole area could use an overhaul; a document for all the rules,
a single person or point of responsibility for those rules...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: PTE access rules & abstraction
  2008-09-23  3:10     ` Nick Piggin
@ 2008-09-23  3:16       ` David Miller, Nick Piggin
  2008-09-23  5:35         ` Benjamin Herrenschmidt
  2008-09-23  5:31       ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 25+ messages in thread
From: David Miller, Nick Piggin @ 2008-09-23  3:16 UTC (permalink / raw)
  To: npiggin; +Cc: benh, jeremy, linux-mm, linux-kernel, hugh

> Part of the problem with the pte API (as well as the cache flush and
> tlb flush APIs) is that it often involves the core mm code telling
> the arch how it thinks ptes,tlbs,caches should be managed, rather than
> I think the better approach would be telling the arch what it wants to
> do.
> 
> We are getting better slowly I think (eg. you note that set_pte_at is
> no longer used as a generic "do anything"), but I won't dispute that
> this whole area could use an overhaul; a document for all the rules,
> a single person or point of responsibility for those rules...

I agree.

To a certain extent this is what BSD does in it's pmap layer, except
that they don't have the page table datastructure abstraction like
Linus does in the generic code, and which I think was a smart design
decision on our side.

All of the pmap modules in BSD are pretty big and duplicate a lot of
code that arch's don't have to be mindful about under Linux.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: PTE access rules & abstraction
  2008-09-23  3:16       ` David Miller, Nick Piggin
@ 2008-09-23  5:35         ` Benjamin Herrenschmidt
  2008-09-23  6:18           ` Nick Piggin
  0 siblings, 1 reply; 25+ messages in thread
From: Benjamin Herrenschmidt @ 2008-09-23  5:35 UTC (permalink / raw)
  To: David Miller; +Cc: npiggin, jeremy, linux-mm, linux-kernel, hugh

On Mon, 2008-09-22 at 20:16 -0700, David Miller wrote:
> 
> To a certain extent this is what BSD does in it's pmap layer, except
> that they don't have the page table datastructure abstraction like
> Linus does in the generic code, and which I think was a smart design
> decision on our side.
> 
> All of the pmap modules in BSD are pretty big and duplicate a lot of
> code that arch's don't have to be mindful about under Linux.

I definitely agree, I don't think we want to go away from the page table
as being the abstraction :-) But I'm wondering if we can do a little bit
better with the accessors to those page tables.

BTW. am I the only one to have got one copy of David's reply (that I'm
quoting) coming with a From: Nick Piggin in the headers ? (apparently
coming from kvack).

Cheers,
Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: PTE access rules & abstraction
  2008-09-23  5:35         ` Benjamin Herrenschmidt
@ 2008-09-23  6:18           ` Nick Piggin
  0 siblings, 0 replies; 25+ messages in thread
From: Nick Piggin @ 2008-09-23  6:18 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: David Miller, jeremy, linux-mm, linux-kernel, hugh

On Tue, Sep 23, 2008 at 03:35:06PM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2008-09-22 at 20:16 -0700, David Miller wrote:
> > 
> > To a certain extent this is what BSD does in it's pmap layer, except
> > that they don't have the page table datastructure abstraction like
> > Linus does in the generic code, and which I think was a smart design
> > decision on our side.
> > 
> > All of the pmap modules in BSD are pretty big and duplicate a lot of
> > code that arch's don't have to be mindful about under Linux.
> 
> I definitely agree, I don't think we want to go away from the page table
> as being the abstraction :-) But I'm wondering if we can do a little bit
> better with the accessors to those page tables.
> 
> BTW. am I the only one to have got one copy of David's reply (that I'm
> quoting) coming with a From: Nick Piggin in the headers ? (apparently
> coming from kvack).

No. I see that, and so does marc.

http://marc.info/?t=122184627700007&r=1&w=2

And I've only ever seen it from Dave on kvack.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: PTE access rules & abstraction
  2008-09-23  3:10     ` Nick Piggin
  2008-09-23  3:16       ` David Miller, Nick Piggin
@ 2008-09-23  5:31       ` Benjamin Herrenschmidt
  2008-09-23  6:13         ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 25+ messages in thread
From: Benjamin Herrenschmidt @ 2008-09-23  5:31 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jeremy Fitzhardinge, Linux Memory Management List,
	Linux Kernel list, Hugh Dickins

On Tue, 2008-09-23 at 05:10 +0200, Nick Piggin wrote:
> We are getting better slowly I think (eg. you note that set_pte_at is
> no longer used as a generic "do anything"), but I won't dispute that
> this whole area could use an overhaul; a document for all the rules,
> a single person or point of responsibility for those rules...

Can we nowadays -rely- on set_pte_at() never being called to overwrite
an already valid PTE ? I mean, it looks like the generic code doesn't do
it anymore but I wonder if it's reasonable to forbid that from coming
back ? That would allow me to remove some hacks in ppc64 and simplify
some upcoming ppc32 code.

Cheers,
Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: PTE access rules & abstraction
  2008-09-23  5:31       ` Benjamin Herrenschmidt
@ 2008-09-23  6:13         ` Jeremy Fitzhardinge
  2008-09-23  6:49           ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 25+ messages in thread
From: Jeremy Fitzhardinge @ 2008-09-23  6:13 UTC (permalink / raw)
  To: benh
  Cc: Nick Piggin, Linux Memory Management List, Linux Kernel list,
	Hugh Dickins

Benjamin Herrenschmidt wrote:
> On Tue, 2008-09-23 at 05:10 +0200, Nick Piggin wrote:
>   
>> We are getting better slowly I think (eg. you note that set_pte_at is
>> no longer used as a generic "do anything"), but I won't dispute that
>> this whole area could use an overhaul; a document for all the rules,
>> a single person or point of responsibility for those rules...
>>     
>
> Can we nowadays -rely- on set_pte_at() never being called to overwrite
> an already valid PTE ? I mean, it looks like the generic code doesn't do
> it anymore but I wonder if it's reasonable to forbid that from coming
> back ? That would allow me to remove some hacks in ppc64 and simplify
> some upcoming ppc32 code.
>   

A good first step might be to define some conventions.  For example,
define that set_pte*() *always* means setting a non-valid pte to either
a new non-valid state (like a swap reference) or to a valid state. 
modify_pte() would modify the flags of a valid
pte, giving a new valid pte.  etc...

It may be that a given architecture collapses some or all of these down
to the same underlying functionality, but it would allow the core intent
to be clearly expressed.

What is the complete set of primitives we need?  I also noticed that a
number of the existing pagetable operations are used only once or twice
in the core code; I wonder if we really need such special cases, or
whether we can make each arch pte operation carry a bit more weight?

Also, rather than leaving all the rule enforcing to documentation and a
maintainer, we should also consider having a debug mode which adds
enough paranoid checks to each operation so that any rule breakage will
fail obviously on all architectures.

    J

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: PTE access rules & abstraction
  2008-09-23  6:13         ` Jeremy Fitzhardinge
@ 2008-09-23  6:49           ` Benjamin Herrenschmidt
  2008-09-23  9:50             ` Nick Piggin
  0 siblings, 1 reply; 25+ messages in thread
From: Benjamin Herrenschmidt @ 2008-09-23  6:49 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nick Piggin, Linux Memory Management List, Linux Kernel list,
	Hugh Dickins

> A good first step might be to define some conventions.  For example,
> define that set_pte*() *always* means setting a non-valid pte to either
> a new non-valid state (like a swap reference) or to a valid state. 
> modify_pte() would modify the flags of a valid
> pte, giving a new valid pte.  etc...

Yup. Or make it clear that ptep_set_access_flags() should only be used
to -relax- access (ie, set dirty, writeable, accessed, ... but not
remove any of them).

> It may be that a given architecture collapses some or all of these down
> to the same underlying functionality, but it would allow the core intent
> to be clearly expressed.
> 
> What is the complete set of primitives we need?  I also noticed that a
> number of the existing pagetable operations are used only once or twice
> in the core code; I wonder if we really need such special cases, or
> whether we can make each arch pte operation carry a bit more weight?

Yes, that was some of my concern. It's getting close to having one API
per call site :-)

> Also, rather than leaving all the rule enforcing to documentation and a
> maintainer, we should also consider having a debug mode which adds
> enough paranoid checks to each operation so that any rule breakage will
> fail obviously on all architectures.

We could do both.

Now, regarding operations, let's first find the major call sites, see
what I miss. I'm omitting free_* in memory.c as those are for freeing
pte pages, not accessing PTEs themselves. I'm also ignoring read-only
call sites and hugetlb for now.

* None-iterative accessors

 - handle_pte_fault in memory.c, on "fixup" faults (pte is present and
it's not a COW), for fixing up DIRTY and ACCESSED (btw, could we make
that also fixup EXEC ? I would like this for some stuff I'm working on
at the moment, ie set it if the vma has VM_EXEC and it was lost from the
PTE as I might want to mask it out of PTEs under some circumstances).
Textbook usage of ptep_set_access_flags(), so that's fine.

 - do_wp_page() in memory.c for COW or fixup of shared writeable mapping
writeable-ness. Doesn't overwrite existing PTE for COW anymore, it uses
clear_flush nowadays and fixup of shared writeable mapping uses
ptep_set_access_flags() as it should, so that's all good.

 - insert_pfn() and insert_page() still in memory.c for fancy page
faults. Just a trivial set_pte_at() of a !present one, no big deal here

  - RMAP ones ? Some ad-hoc stuff due to _notify thingies.

* Iterative accessors (some don't batch, maybe they could/should).

 - zapping a mapping (zap_p*) in memory.c
 - fork (copy_p*) in memory.c could batch better maybe ?
 - setting linear user mappings (remap_p*) in memory.c, trivial
set_pte_at() on a range, pte's should be !present I think.
 - mprotect (change_p*) in memory.c, which has the problem I mentioned
 - moving page tables (move_p*), pretty trivial clear_flush + set_pte_at
 - clear_regs_pte_range via walk_page_range in fs/proc/task_mmu.c, does
a test_and_clear_young, flushes mm afterward, could use some lazy stuff
so we can batch properly on ppc64.
 - vmalloc, that's a bit special and kernel only, doesn't have nasty
races between creating/tearing down mappings vs. using them
 - highmem I leave alone for now, it's mostly trivial set_pte_at &
flushing for normal kmap but kmap_atomic can be nasty, though it's arch
specific.
 - some stuff in fremap I'm not too familiar with and I need to run...

What did I miss ?

Cheers,
Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: PTE access rules & abstraction
  2008-09-23  6:49           ` Benjamin Herrenschmidt
@ 2008-09-23  9:50             ` Nick Piggin
  2008-09-23 11:54               ` peter
  0 siblings, 1 reply; 25+ messages in thread
From: Nick Piggin @ 2008-09-23  9:50 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Jeremy Fitzhardinge, Linux Memory Management List,
	Linux Kernel list, Hugh Dickins

On Tue, Sep 23, 2008 at 04:49:32PM +1000, Benjamin Herrenschmidt wrote:
> 
> > A good first step might be to define some conventions.  For example,
> > define that set_pte*() *always* means setting a non-valid pte to either
> > a new non-valid state (like a swap reference) or to a valid state. 
> > modify_pte() would modify the flags of a valid
> > pte, giving a new valid pte.  etc...
> 
> Yup. Or make it clear that ptep_set_access_flags() should only be used
> to -relax- access (ie, set dirty, writeable, accessed, ... but not
> remove any of them).
> 
> > It may be that a given architecture collapses some or all of these down
> > to the same underlying functionality, but it would allow the core intent
> > to be clearly expressed.
> > 
> > What is the complete set of primitives we need?  I also noticed that a
> > number of the existing pagetable operations are used only once or twice
> > in the core code; I wonder if we really need such special cases, or
> > whether we can make each arch pte operation carry a bit more weight?
> 
> Yes, that was some of my concern. It's getting close to having one API
> per call site :-)

I don't think that is a huge problem as such... if there was lots of
repeated uses of the API I'd also be concerned about mm/ code not being
well factored :)

My concern is that things aren't well documented, maybe not consistent
enough, and are usually named according to what their implementation looks
like on the favourite arch of the person who introduced them, rather than
what the VM needs to get done :P

Which leads some architectures (I'm looking at ia64 ;)) to reinvent things
or just add new primitives rather than finding existing common code. Which
makes the problem worse.

And it makes generic VM developers not be able to follow what's going on
with all the different architectures...


 
> > Also, rather than leaving all the rule enforcing to documentation and a
> > maintainer, we should also consider having a debug mode which adds
> > enough paranoid checks to each operation so that any rule breakage will
> > fail obviously on all architectures.
> 
> We could do both.
> 
> Now, regarding operations, let's first find the major call sites, see
> what I miss. I'm omitting free_* in memory.c as those are for freeing
> pte pages, not accessing PTEs themselves. I'm also ignoring read-only
> call sites and hugetlb for now.
> 
> * None-iterative accessors
> 
>  - handle_pte_fault in memory.c, on "fixup" faults (pte is present and
> it's not a COW), for fixing up DIRTY and ACCESSED (btw, could we make
> that also fixup EXEC ? I would like this for some stuff I'm working on
> at the moment, ie set it if the vma has VM_EXEC and it was lost from the
> PTE as I might want to mask it out of PTEs under some circumstances).
> Textbook usage of ptep_set_access_flags(), so that's fine.
> 
>  - do_wp_page() in memory.c for COW or fixup of shared writeable mapping
> writeable-ness. Doesn't overwrite existing PTE for COW anymore, it uses
> clear_flush nowadays and fixup of shared writeable mapping uses
> ptep_set_access_flags() as it should, so that's all good.

This is one example of being too low level. It wouldn't be hard to have
an arch where the SMC race does not apply, or even it is probably possible
to avoid the flush in the case of single-threaded mm. Can't do that however,
because the call asks for a flush so we must flush.

 
>  - insert_pfn() and insert_page() still in memory.c for fancy page
> faults. Just a trivial set_pte_at() of a !present one, no big deal here
> 
>   - RMAP ones ? Some ad-hoc stuff due to _notify thingies.

rmap ones should only do set_pte_at after a clear_flush I think.
 

> * Iterative accessors (some don't batch, maybe they could/should).
> 
>  - zapping a mapping (zap_p*) in memory.c
>  - fork (copy_p*) in memory.c could batch better maybe ?
>  - setting linear user mappings (remap_p*) in memory.c, trivial
> set_pte_at() on a range, pte's should be !present I think.

Yes.


>  - mprotect (change_p*) in memory.c, which has the problem I mentioned
>  - moving page tables (move_p*), pretty trivial clear_flush + set_pte_at
>  - clear_regs_pte_range via walk_page_range in fs/proc/task_mmu.c, does
> a test_and_clear_young, flushes mm afterward, could use some lazy stuff
> so we can batch properly on ppc64.
>  - vmalloc, that's a bit special and kernel only, doesn't have nasty
> races between creating/tearing down mappings vs. using them
>  - highmem I leave alone for now, it's mostly trivial set_pte_at &
> flushing for normal kmap but kmap_atomic can be nasty, though it's arch
> specific.

It's also for kva rather than uva, so it's a bit different too...


>  - some stuff in fremap I'm not too familiar with and I need to run...

fremap is only ever doing clear_flush on present ptes, or set_pte_at on
pte_none_ptes (and in this case it is going from !present to !present,
which is probably also unusual in some cases for arch code to deal with,
although it is not restricted to fremap I guess)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: PTE access rules & abstraction
  2008-09-23  9:50             ` Nick Piggin
@ 2008-09-23 11:54               ` peter
  0 siblings, 0 replies; 25+ messages in thread
From: peter @ 2008-09-23 11:54 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Benjamin Herrenschmidt, Jeremy Fitzhardinge,
	Linux Memory Management List, Linux Kernel list, Hugh Dickins

>>>>> "Nick" == Nick Piggin <npiggin@suse.de> writes:

Nick> On Tue, Sep 23, 2008 at 04:49:32PM +1000, Benjamin Herrenschmidt
Nick> wrote:

>> > What is the complete set of primitives we need?  I also noticed
>> that a > number of the existing pagetable operations are used only
>> once or twice > in the core code; I wonder if we really need such
>> special cases, or > whether we can make each arch pte operation
>> carry a bit more weight?
>> 
>> Yes, that was some of my concern. It's getting close to having one
>> API per call site :-)

Nick> I don't think that is a huge problem as such... if there was
Nick> lots of repeated uses of the API I'd also be concerned about mm/
Nick> code not being well factored :)

Is it worth taking another look at the page-table abstraction layer
that Darren Williams posted here last year?

PeterC

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: PTE access rules & abstraction
  2008-09-22 21:05   ` Benjamin Herrenschmidt
  2008-09-23  3:10     ` Nick Piggin
@ 2008-09-24 18:45     ` Hugh Dickins
  2008-09-24 21:20       ` Benjamin Herrenschmidt
  2008-09-24 22:17       ` Martin Schwidefsky
  1 sibling, 2 replies; 25+ messages in thread
From: Hugh Dickins @ 2008-09-24 18:45 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Jeremy Fitzhardinge, Linux Memory Management List,
	Linux Kernel list, Nick Piggin, Martin Schwidefsky

On Tue, 23 Sep 2008, Benjamin Herrenschmidt wrote:
> 
> The bug may have been there, as I said, lots of unwritten rules...
> sometimes broken. I'm not necessarily blaming you, but there have been
> lots of changes to the PTE accessors over the last 2 years and not
> always under any control :-)
> 
> In our case, the consequence is that the entry can be re-hashed because
> the fact that it was already hashed and where it was hashed, which is
> encoded in the PTE, gets lost by the clear. That means a potential
> duplicate entry in the hash. A hard to hit race, but possible. Such a
> condition is architecturally illegal and can cause things ranging from
> incorrect translation to machine checks or checkstops (generally, on
> LPAR machines, what will happen is your partition will get killed).

The powerpc bug whereof you write appears to have been there since ...
linux-2.4.0 or earlier:
			entry = ptep_get_and_clear(pte);
			set_pte(pte, pte_modify(entry, newprot));

But perhaps powerpc was slightly different back in those days.
It sounds to me like a bug in your current ptep_get_and_clear(),
not checking if already hashed?

> I know s390 has different issues & constraints. Martin told me during
> Plumbers that mprotect was probably also broken for him.

Then I hope he will probably send Linus the fix.

Though what we already have falls somewhat short of perfection,
I've much more enthusiasm for fixing its bugs, than for any fancy
redesign introducing its own bugs.  Others have more stamina!

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: PTE access rules & abstraction
  2008-09-24 18:45     ` Hugh Dickins
@ 2008-09-24 21:20       ` Benjamin Herrenschmidt
  2008-09-24 21:57         ` Jeremy Fitzhardinge
  2008-09-24 23:55         ` Hugh Dickins
  2008-09-24 22:17       ` Martin Schwidefsky
  1 sibling, 2 replies; 25+ messages in thread
From: Benjamin Herrenschmidt @ 2008-09-24 21:20 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Jeremy Fitzhardinge, Linux Memory Management List,
	Linux Kernel list, Nick Piggin, Martin Schwidefsky

On Wed, 2008-09-24 at 19:45 +0100, Hugh Dickins wrote:

> The powerpc bug whereof you write appears to have been there since ...
> linux-2.4.0 or earlier:
> 			entry = ptep_get_and_clear(pte);
> 			set_pte(pte, pte_modify(entry, newprot));
> 
> But perhaps powerpc was slightly different back in those days.
> It sounds to me like a bug in your current ptep_get_and_clear(),
> not checking if already hashed?

Yes, I figured out the bug was already there. And no, it's not the
right approach to have ptep_get_and_clear() flush because it would
mean that call cannot batch flushes, and thus we would lose ability to
batch in zap_pte_range().

> Though what we already have falls somewhat short of perfection,
> I've much more enthusiasm for fixing its bugs, than for any fancy
> redesign introducing its own bugs.  Others have more stamina!

Well, the current set accessor, as far as I'm concerned is a big pile of
steaming shit that evolved from x86-specific gunk raped in different
horrible ways to make it looks like it fits on other architectures and
additionally mashed with goo to make it somewhat palatable by
virtualization stuff. Yes, bugs can be fixed but it's still an horrible
mess.

Now, regarding the above bug, I'm afraid the only approaches I see that
would work would be to have either a ptep_get_and_clear_flush(), which I
suppose x86 virt. people will hate, or maybe to actually have a powerpc
specific variant of the new start/commit hooks that does the flush.

Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: PTE access rules & abstraction
  2008-09-24 21:20       ` Benjamin Herrenschmidt
@ 2008-09-24 21:57         ` Jeremy Fitzhardinge
  2008-09-24 22:07           ` Benjamin Herrenschmidt
  2008-09-24 23:55         ` Hugh Dickins
  1 sibling, 1 reply; 25+ messages in thread
From: Jeremy Fitzhardinge @ 2008-09-24 21:57 UTC (permalink / raw)
  To: benh
  Cc: Hugh Dickins, Linux Memory Management List, Linux Kernel list,
	Nick Piggin, Martin Schwidefsky, Peter Chubb

Benjamin Herrenschmidt wrote:
> Well, the current set accessor, as far as I'm concerned is a big pile of
> steaming shit that evolved from x86-specific gunk raped in different
> horrible ways to make it looks like it fits on other architectures and
> additionally mashed with goo to make it somewhat palatable by
> virtualization stuff. Yes, bugs can be fixed but it's still an horrible
> mess.
>   

What do you propose then?  Ideally one would like to get something that
works for powerpc, s390, all the wacky ia64 modes as well as x86.  The
ia64 folks proposed something, but I've not looked at it closely.  From
an x86 virtualization perspective, something that's basically x86 with
as much scope for batching and deferring as possible would be fine.

As a start, what's the state machine for a pte?  What states can it be
in, and how does it move from state to state?  It sounds like powerpc
has at least one extra state above x86 (hashed, with the hash key stored
in the pte itself?).

> Now, regarding the above bug, I'm afraid the only approaches I see that
> would work would be to have either a ptep_get_and_clear_flush(), which I
> suppose x86 virt. people will hate, or maybe to actually have a powerpc
> specific variant of the new start/commit hooks that does the flush.
>   

ptep_get_and_clear() is not batchable anyway, because the x86
implementation requires an atomic xchg on the pte, which will likely
result in some sort of trap (and if it doesn't then it doesn't need
batching).  The start/commit API was specifically so that we can do the
mprotect (and fork COW updates) in a batchable way (in Xen its
implemented with a pte update hypercall which updates the pte without
affecting the A/D bits).

    J

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: PTE access rules & abstraction
  2008-09-24 21:57         ` Jeremy Fitzhardinge
@ 2008-09-24 22:07           ` Benjamin Herrenschmidt
  2008-09-24 22:43             ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 25+ messages in thread
From: Benjamin Herrenschmidt @ 2008-09-24 22:07 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Hugh Dickins, Linux Memory Management List, Linux Kernel list,
	Nick Piggin, Martin Schwidefsky, Peter Chubb

> What do you propose then?  Ideally one would like to get something that
> works for powerpc, s390, all the wacky ia64 modes as well as x86.  The
> ia64 folks proposed something, but I've not looked at it closely.  From
> an x86 virtualization perspective, something that's basically x86 with
> as much scope for batching and deferring as possible would be fine.

That's where things get interesting. I liked Nick ideas of doing
something transactional that could encompass the lock, bach and flushing
but that may be too much at this stage...

> As a start, what's the state machine for a pte?  What states can it be
> in, and how does it move from state to state?  It sounds like powerpc
> has at least one extra state above x86 (hashed, with the hash key stored
> in the pte itself?).

We store in the PTE whether it was hashed, and the location within a
hash bucket. (For each hash value, there's 8 buckets, or rather 16 if
you count our secondary hashing).

We must never write a new valid PTE after we cleared a hashed one
without having a flush in between.

On 32 bits we have less state (only the 'hashed' bit) but the problem is
similar, though we handle it differently: we never clear the hash bit
until we flush the hash, ie, pte_clear doesn't clear the hash bit. On
64-bit we do things differently, we do clear PTEs and pile up in a
per-cpu batch what needs to be flushed, the flush then happens when
leaving lazy mode.

> ptep_get_and_clear() is not batchable anyway, because the x86
> implementation requires an atomic xchg on the pte, which will likely
> result in some sort of trap (and if it doesn't then it doesn't need
> batching).

Well, ptep_get_and_clear() used to be used by zap_pte_range() which I
_HOPE_ was batchable on x86 :-)

Nowadays, there's this new ptep_get_and_clear_full() (yet another
totally meaningless name for an ad-hoc API added for some random special
purpose) that zap_pte_range() uses. Maybe that one is now subtly
different such as it can be used to batch on x86 ?

In any case, powerpc batches -everything- (unless it's called *_flush in
which case the flush is immediate) in a private per-cpu batch and
flushes the hash when leaving lazy mode.

>   The start/commit API was specifically so that we can do the
> mprotect (and fork COW updates) in a batchable way (in Xen its
> implemented with a pte update hypercall which updates the pte without
> affecting the A/D bits).

I think we have different ideas of what batch means but yeah, we do
batch everything including these on powerpc without the new start/commit
interface.

Ben.

>     J
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: PTE access rules & abstraction
  2008-09-24 22:07           ` Benjamin Herrenschmidt
@ 2008-09-24 22:43             ` Jeremy Fitzhardinge
  2008-09-24 22:53               ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 25+ messages in thread
From: Jeremy Fitzhardinge @ 2008-09-24 22:43 UTC (permalink / raw)
  To: benh
  Cc: Hugh Dickins, Linux Memory Management List, Linux Kernel list,
	Nick Piggin, Martin Schwidefsky, Peter Chubb

Benjamin Herrenschmidt wrote:
>> What do you propose then?  Ideally one would like to get something that
>> works for powerpc, s390, all the wacky ia64 modes as well as x86.  The
>> ia64 folks proposed something, but I've not looked at it closely.  From
>> an x86 virtualization perspective, something that's basically x86 with
>> as much scope for batching and deferring as possible would be fine.
>>     
>
> That's where things get interesting. I liked Nick ideas of doing
> something transactional that could encompass the lock, bach and flushing
> but that may be too much at this stage...
>   

Yes, that sounds fine in principle, but the practise gets tricky.  The
trouble with that kind of interface is that it can be fairly heavyweight
unless you amortise the cost of the transaction setup/commit with
multiple operations.

>> ptep_get_and_clear() is not batchable anyway, because the x86
>> implementation requires an atomic xchg on the pte, which will likely
>> result in some sort of trap (and if it doesn't then it doesn't need
>> batching).
>>     
>
> Well, ptep_get_and_clear() used to be used by zap_pte_range() which I
> _HOPE_ was batchable on x86 :-)
>
> Nowadays, there's this new ptep_get_and_clear_full() (yet another
> totally meaningless name for an ad-hoc API added for some random special
> purpose) that zap_pte_range() uses. Maybe that one is now subtly
> different such as it can be used to batch on x86 ?
>   

Yeah.  zap_pte_range isn't great when doing a munmap, but the _full gunk
lets it special-case process teardown and it ends up not being a
performance problem (in the Xen case, we've already switch to another
pagetable at that point, so it isn't really a pagetable any more and
needs no hypercalls).

> In any case, powerpc batches -everything- (unless it's called *_flush in
> which case the flush is immediate) in a private per-cpu batch and
> flushes the hash when leaving lazy mode.
>   

Are you using the lazy_mmu hooks we put in, or something else?

>>   The start/commit API was specifically so that we can do the
>> mprotect (and fork COW updates) in a batchable way (in Xen its
>> implemented with a pte update hypercall which updates the pte without
>> affecting the A/D bits).
>>     
>
> I think we have different ideas of what batch means but yeah, we do
> batch everything including these on powerpc without the new start/commit
> interface.

Likely; it's one of those overused generic words.  The specific meaning
I'm using is "we can roll a bunch of updates into a single hypercall". 
Operations which do atomic RMW  or fetch-set are typically not batchable
because the caller wants the result *now* and can't wait for a deferred
result.

    J

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: PTE access rules & abstraction
  2008-09-24 22:43             ` Jeremy Fitzhardinge
@ 2008-09-24 22:53               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 25+ messages in thread
From: Benjamin Herrenschmidt @ 2008-09-24 22:53 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Hugh Dickins, Linux Memory Management List, Linux Kernel list,
	Nick Piggin, Martin Schwidefsky, Peter Chubb

> Yes, that sounds fine in principle, but the practise gets tricky.  The
> trouble with that kind of interface is that it can be fairly heavyweight
> unless you amortise the cost of the transaction setup/commit with
> multiple operations.

I agree. Anyway, I'm sure if we get everybody around the same table, we
can find something, maybe not that high level, but a better set of
accessors that fits all needs with a more precise semantic. Anyway, I'll
see if I can come up with ideas later that we can discuss.

It's also all inline so it does give us flexibility in having things
compile down to nothing on archs where they aren't needed.

> Are you using the lazy_mmu hooks we put in, or something else?

Yes. We used to do it differently but it was actually racy in a subtle
way, and I switched it to use your lazy_mmu hooks a while ago.

> Likely; it's one of those overused generic words.  The specific meaning
> I'm using is "we can roll a bunch of updates into a single hypercall". 
> Operations which do atomic RMW  or fetch-set are typically not batchable
> because the caller wants the result *now* and can't wait for a deferred
> result.

Right. For us it's mostly about flushing the hash entries, though there
is no fundamental difference I believe here, it's about updating a
cache, whether it's the TLB, our hash table, hypervisor shadows, etc...
and we -should- be able to find a more sensible common ground.

Cheers,
Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: PTE access rules & abstraction
  2008-09-24 21:20       ` Benjamin Herrenschmidt
  2008-09-24 21:57         ` Jeremy Fitzhardinge
@ 2008-09-24 23:55         ` Hugh Dickins
  2008-09-25  1:04           ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 25+ messages in thread
From: Hugh Dickins @ 2008-09-24 23:55 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Jeremy Fitzhardinge, Linux Memory Management List,
	Linux Kernel list, Nick Piggin, Martin Schwidefsky

On Thu, 25 Sep 2008, Benjamin Herrenschmidt wrote:
> 
> Now, regarding the above bug, I'm afraid the only approaches I see that
> would work would be to have either a ptep_get_and_clear_flush(), which I
> suppose x86 virt. people will hate, or maybe to actually have a powerpc
> specific variant of the new start/commit hooks that does the flush.

Whyever not the latter?  Jeremy seems to have gifted that to you,
for precisely such a purpose.

Hugh

p.s. I surely agree with you over the name ptep_get_and_clear_full():
horrid, even more confusing than the tlb->fullmm from which it derives
its name.  I expect I'd agree with you over a lot more too, but please,
bugfixes first.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: PTE access rules & abstraction
  2008-09-24 23:55         ` Hugh Dickins
@ 2008-09-25  1:04           ` Benjamin Herrenschmidt
  2008-09-25 18:15             ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 25+ messages in thread
From: Benjamin Herrenschmidt @ 2008-09-25  1:04 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Jeremy Fitzhardinge, Linux Memory Management List,
	Linux Kernel list, Nick Piggin, Martin Schwidefsky

On Thu, 2008-09-25 at 00:55 +0100, Hugh Dickins wrote:
> 
> Whyever not the latter?  Jeremy seems to have gifted that to you,
> for precisely such a purpose.

Yeah. Not that I don't quite understand what the point of the
start/modify/commit thing the way it's currently used in mprotect since
we are doing the whole transaction for a single PTE change, ie how does
that help with hypervisors vs. a single ptep_modify_protection() for
example is beyond me :-)

When I think about transactions, I think about starting a transaction,
changing a -bunch- of PTEs, then commiting... Essentially I see the PTE
lock thing as being a transaction.

Cheers,
Ben.

> Hugh
> 
> p.s. I surely agree with you over the name ptep_get_and_clear_full():
> horrid, even more confusing than the tlb->fullmm from which it derives
> its name.  I expect I'd agree with you over a lot more too, but
> please, bugfixes first.

Sure.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: PTE access rules & abstraction
  2008-09-25  1:04           ` Benjamin Herrenschmidt
@ 2008-09-25 18:15             ` Jeremy Fitzhardinge
  2008-09-25 21:44               ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 25+ messages in thread
From: Jeremy Fitzhardinge @ 2008-09-25 18:15 UTC (permalink / raw)
  To: benh
  Cc: Hugh Dickins, Linux Memory Management List, Linux Kernel list,
	Nick Piggin, Martin Schwidefsky

Benjamin Herrenschmidt wrote:
> On Thu, 2008-09-25 at 00:55 +0100, Hugh Dickins wrote:
>   
>> Whyever not the latter?  Jeremy seems to have gifted that to you,
>> for precisely such a purpose.
>>     
>
> Yeah. Not that I don't quite understand what the point of the
> start/modify/commit thing the way it's currently used in mprotect since
> we are doing the whole transaction for a single PTE change, ie how does
> that help with hypervisors vs. a single ptep_modify_protection() for
> example is beyond me :-)
>
> When I think about transactions, I think about starting a transaction,
> changing a -bunch- of PTEs, then commiting... Essentially I see the PTE
> lock thing as being a transaction.

I think we need to be a bit clearer about the terminology:

A batch is a bunch of things that can be optionally deferred so they can
be done in chunks.

A transaction is something that must either complete successfully, or
have no effect at all.  Doing multiple things in a transaction means
that they must all complete or none.  (In general we assume there's
nothing about these low-level pagetable operations which can fail, so we
can ignore the failure part of transactions.)

In this case, a batch is not a transaction.  Doing things between
arch_enter_lazy_mmu_mode/arch_leave_lazy_mmu_mode makes no guarantees
about when operations are performed, other than guaranteeing that
they'll all be done by the time arch_leave_lazy_mmu_mode returns.  
Everything about how things are chunked into batches are up to the
underlying architecture, and the calling code can't make any assumptions
about it (the specific problem we've had to fix in a couple of places is
things expecting to be able to read back their recent pagetable
modifications immediately after issuing the call).

The ptep_modify_prot_start/commit pair specifies a single pte update in
such a way to allow more implementation flexibility - ie, there's no
naked requirement for an atomic fetch-and-clear operation.  I chose the
transaction-like terminology to emphasize that the start/commit
functions must be strictly paired; there's no way to fail or abort the
"transaction".  A whole group of those start/commit pairs can be batched
together without affecting their semantics.

    J

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: PTE access rules & abstraction
  2008-09-25 18:15             ` Jeremy Fitzhardinge
@ 2008-09-25 21:44               ` Benjamin Herrenschmidt
  2008-09-25 22:27                 ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 25+ messages in thread
From: Benjamin Herrenschmidt @ 2008-09-25 21:44 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Hugh Dickins, Linux Memory Management List, Linux Kernel list,
	Nick Piggin, Martin Schwidefsky

On Thu, 2008-09-25 at 11:15 -0700, Jeremy Fitzhardinge wrote:
> The ptep_modify_prot_start/commit pair specifies a single pte update in
> such a way to allow more implementation flexibility - ie, there's no
> naked requirement for an atomic fetch-and-clear operation.  I chose the
> transaction-like terminology to emphasize that the start/commit
> functions must be strictly paired; there's no way to fail or abort the
> "transaction".  A whole group of those start/commit pairs can be batched
> together without affecting their semantics.

I still can't see the point of having now 3 functions instead of just
one such as ptep_modify_protection(). I don't see what it buys you other
than adding gratuituous new interfaces.

Ben;.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: PTE access rules & abstraction
  2008-09-25 21:44               ` Benjamin Herrenschmidt
@ 2008-09-25 22:27                 ` Jeremy Fitzhardinge
  2008-09-25 23:02                   ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 25+ messages in thread
From: Jeremy Fitzhardinge @ 2008-09-25 22:27 UTC (permalink / raw)
  To: benh
  Cc: Hugh Dickins, Linux Memory Management List, Linux Kernel list,
	Nick Piggin, Martin Schwidefsky

Benjamin Herrenschmidt wrote:
> On Thu, 2008-09-25 at 11:15 -0700, Jeremy Fitzhardinge wrote:
>   
>> The ptep_modify_prot_start/commit pair specifies a single pte update in
>> such a way to allow more implementation flexibility - ie, there's no
>> naked requirement for an atomic fetch-and-clear operation.  I chose the
>> transaction-like terminology to emphasize that the start/commit
>> functions must be strictly paired; there's no way to fail or abort the
>> "transaction".  A whole group of those start/commit pairs can be batched
>> together without affecting their semantics.
>>     
>
> I still can't see the point of having now 3 functions instead of just
> one such as ptep_modify_protection(). I don't see what it buys you other
> than adding gratuituous new interfaces.
>   

Yeah, that would work too; that's pretty much how Xen implements it
anyway.  The main advantage of the start/commit pair is that the
resulting code was completely unchanged from the old code.  The mprotect
sequence using ptep_modify_protection would end up reading the pte twice
before writing it.

    J

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: PTE access rules & abstraction
  2008-09-25 22:27                 ` Jeremy Fitzhardinge
@ 2008-09-25 23:02                   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 25+ messages in thread
From: Benjamin Herrenschmidt @ 2008-09-25 23:02 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Hugh Dickins, Linux Memory Management List, Linux Kernel list,
	Nick Piggin, Martin Schwidefsky

On Thu, 2008-09-25 at 15:27 -0700, Jeremy Fitzhardinge wrote:
> Yeah, that would work too; that's pretty much how Xen implements it
> anyway.  The main advantage of the start/commit pair is that the
> resulting code was completely unchanged from the old code.  The mprotect
> sequence using ptep_modify_protection would end up reading the pte twice
> before writing it.

Not necessarily .. depends how you factor out the interface to it.

Anyway, not a big deal now. I'll do a patch to fix the hole on powerpc,
and if my brain clicks, over the next few weeks, I'll see if I can come
up with an overall nicer API covering all usages. In many case might
just be a matter of giving a saner name to existing calls and
documenting them properly tho :-)

Cheers,
Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: PTE access rules & abstraction
  2008-09-24 18:45     ` Hugh Dickins
  2008-09-24 21:20       ` Benjamin Herrenschmidt
@ 2008-09-24 22:17       ` Martin Schwidefsky
  1 sibling, 0 replies; 25+ messages in thread
From: Martin Schwidefsky @ 2008-09-24 22:17 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Benjamin Herrenschmidt, Jeremy Fitzhardinge,
	Linux Memory Management List, Linux Kernel list, Nick Piggin

On Wed, 2008-09-24 at 19:45 +0100, Hugh Dickins wrote:
> > I know s390 has different issues & constraints. Martin told me during
> > Plumbers that mprotect was probably also broken for him.
> 
> Then I hope he will probably send Linus the fix.
> 
> Though what we already have falls somewhat short of perfection,
> I've much more enthusiasm for fixing its bugs, than for any fancy
> redesign introducing its own bugs.  Others have more stamina!

As far as I can tell the current code should work. It is not pretty
though, in particular the nasty pairing of flush_tlb_mm() with
ptep_set_wrprotect() and flush_tlb_range() with change_protection() is
fragile. For me the question is if we can find a sensible set of basic
primitives that work for all architectures in a performant way. This is
really hard..

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2008-09-25 23:02 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-09-19 17:42 PTE access rules & abstraction Benjamin Herrenschmidt
2008-09-22  6:22 ` Jeremy Fitzhardinge
2008-09-22 21:05   ` Benjamin Herrenschmidt
2008-09-23  3:10     ` Nick Piggin
2008-09-23  3:16       ` David Miller, Nick Piggin
2008-09-23  5:35         ` Benjamin Herrenschmidt
2008-09-23  6:18           ` Nick Piggin
2008-09-23  5:31       ` Benjamin Herrenschmidt
2008-09-23  6:13         ` Jeremy Fitzhardinge
2008-09-23  6:49           ` Benjamin Herrenschmidt
2008-09-23  9:50             ` Nick Piggin
2008-09-23 11:54               ` peter
2008-09-24 18:45     ` Hugh Dickins
2008-09-24 21:20       ` Benjamin Herrenschmidt
2008-09-24 21:57         ` Jeremy Fitzhardinge
2008-09-24 22:07           ` Benjamin Herrenschmidt
2008-09-24 22:43             ` Jeremy Fitzhardinge
2008-09-24 22:53               ` Benjamin Herrenschmidt
2008-09-24 23:55         ` Hugh Dickins
2008-09-25  1:04           ` Benjamin Herrenschmidt
2008-09-25 18:15             ` Jeremy Fitzhardinge
2008-09-25 21:44               ` Benjamin Herrenschmidt
2008-09-25 22:27                 ` Jeremy Fitzhardinge
2008-09-25 23:02                   ` Benjamin Herrenschmidt
2008-09-24 22:17       ` Martin Schwidefsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox