* PTE access rules & abstraction @ 2008-09-19 17:42 Benjamin Herrenschmidt 2008-09-22 6:22 ` Jeremy Fitzhardinge 0 siblings, 1 reply; 25+ messages in thread From: Benjamin Herrenschmidt @ 2008-09-19 17:42 UTC (permalink / raw) To: Linux Memory Management List; +Cc: Linux Kernel list, Nick Piggin, Hugh Dickins Hi ! Just yesterday, I was browsing through the users of set_pte_at() to check something, and stumbled on a (new ?) bug that will introduce subtle problems on at least powerpc and s390. No big deal, I'll send a fix, but I'm becoming concerned with how fragile our page table & PTE access has become. (The bug btw is that we ptep_get_and_clear followed by a set_pte_at, at least on those architectures, you -must- flush before you put something new after you have cleared a PTE, I'll have to fixup our implementation of the new pte_modify_start/commit). With the need of the various virtual machines on x86, we've seen new page table accessors being created like there is no tomorrow, changes in the PTEs are accessed that may or may not be things we can rely on being stable in arch code, etc... Unfortunately, the arch code often has a very intimate relationship to how page tables are handled. The rules for locking, what can and cannot be done within a single PTE lock section, what can or cannot be done on a PTE, for example after it's been cleared, etc... vary in subtle ways and the way the things are today, the situation is very messy and fragile. Maybe it's time to have one head in "charge" of the page table access to try to keep some sanity, maybe it's time to write down some rules (for example, can we rely now and forever that set_pte_at() will -never- be called to write on top of an already valid PTE ?, etc...). But maybe it's time to try to move the abstraction up a bit, maybe along the lines of what Nick proposed a while ago, some kind of transactional model. That would give a lot more freedom to architectures to have their own PTE access rules and optimisations. Comments ? Ideas ? Cheers, Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: PTE access rules & abstraction 2008-09-19 17:42 PTE access rules & abstraction Benjamin Herrenschmidt @ 2008-09-22 6:22 ` Jeremy Fitzhardinge 2008-09-22 21:05 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 25+ messages in thread From: Jeremy Fitzhardinge @ 2008-09-22 6:22 UTC (permalink / raw) To: benh Cc: Linux Memory Management List, Linux Kernel list, Nick Piggin, Hugh Dickins Benjamin Herrenschmidt wrote: > Just yesterday, I was browsing through the users of set_pte_at() to > check something, and stumbled on a (new ?) bug that will introduce > subtle problems on at least powerpc and s390. > > No big deal, I'll send a fix, but I'm becoming concerned with how > fragile our page table & PTE access has become. > > (The bug btw is that we ptep_get_and_clear followed by a set_pte_at, at > least on those architectures, you -must- flush before you put something > new after you have cleared a PTE, I'll have to fixup our implementation > of the new pte_modify_start/commit). > When I added the ptep_modify_* interface, it occurred to me that assuming that ptep_get_and_clear would always prevent async pte updates was a bit optimistic, or at least presumptuous. And certainly not flushing the tlb seems like something that just happens to work on x86 (in fact I'm not quite sure how it does work on x86). I didn't change the function of that code when I made the change, so the bug was pre-existing; I think it has been like that for quite a while (though I haven't done any git archaeology to back that up). So I don't think there's a new bug here. What's the consequence of not flushing for you? > With the need of the various virtual machines on x86, we've seen new > page table accessors being created like there is no tomorrow, changes in > the PTEs are accessed that may or may not be things we can rely on being > stable in arch code, etc... > There have been a few optional-extras calls, which are all fine to leave as no-ops. I don't think we've added or changed any must-have interfaces. > Unfortunately, the arch code often has a very intimate relationship to > how page tables are handled. The rules for locking, what can and cannot > be done within a single PTE lock section, what can or cannot be done on > a PTE, for example after it's been cleared, etc... vary in subtle ways > and the way the things are today, the situation is very messy and > fragile. > It seems to me that the rules are roughly "anything goes while you're holding the pte lock, but it all must be sane by the time you release it". But a bit more precision would be useful. > Maybe it's time to have one head in "charge" of the page table access to > try to keep some sanity, maybe it's time to write down some rules (for > example, can we rely now and forever that set_pte_at() will -never- be > called to write on top of an already valid PTE ?, etc...). > I don't know if there's any such rule regarding set_pte_at(). Certainly if you were overwriting an existing valid entry, you'd have to arrange for a tlb flush. > But maybe it's time to try to move the abstraction up a bit, maybe along > the lines of what Nick proposed a while ago, some kind of transactional > model. That would give a lot more freedom to architectures to have their > own PTE access rules and optimisations. Do you have a reference to Nick's proposals? A higher level interface might give us virtualization people more scope to play with things. But given that the current interface mostly works for everyone, a high level interface would tend to result in a lot of duplicated code unless you pretty strictly stipluate something like "implement the highest level interface you need *and no higher*; use common library code for everything else". I think documenting the real rules for the current interface would be a more immediately fruitful path to start with. J -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: PTE access rules & abstraction 2008-09-22 6:22 ` Jeremy Fitzhardinge @ 2008-09-22 21:05 ` Benjamin Herrenschmidt 2008-09-23 3:10 ` Nick Piggin 2008-09-24 18:45 ` Hugh Dickins 0 siblings, 2 replies; 25+ messages in thread From: Benjamin Herrenschmidt @ 2008-09-22 21:05 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Linux Memory Management List, Linux Kernel list, Nick Piggin, Hugh Dickins > I didn't change the function of that code when I made the change, so the > bug was pre-existing; I think it has been like that for quite a while > (though I haven't done any git archaeology to back that up). So I don't > think there's a new bug here. > > What's the consequence of not flushing for you? The bug may have been there, as I said, lots of unwritten rules... sometimes broken. I'm not necessarily blaming you, but there have been lots of changes to the PTE accessors over the last 2 years and not always under any control :-) In our case, the consequence is that the entry can be re-hashed because the fact that it was already hashed and where it was hashed, which is encoded in the PTE, gets lost by the clear. That means a potential duplicate entry in the hash. A hard to hit race, but possible. Such a condition is architecturally illegal and can cause things ranging from incorrect translation to machine checks or checkstops (generally, on LPAR machines, what will happen is your partition will get killed). I know s390 has different issues & constraints. Martin told me during Plumbers that mprotect was probably also broken for him. > There have been a few optional-extras calls, which are all fine to leave > as no-ops. I don't think we've added or changed any must-have interfaces. Again, I'm not necessarily talking about the very latest round of changes that you did here... More like a general approach as trying to find a better overall interface to PTE access which currently is a total mess as far as I'm concerned. > It seems to me that the rules are roughly "anything goes while you're > holding the pte lock, but it all must be sane by the time you release > it". But a bit more precision would be useful. Well, unfortunately it's a lot more complex than that. For example, on powerpc, we have to keep the hash table in sync and hash misses don't take the PTE lock. They are aking to a non-atomic HW walk of the page tables if you prefer. So we must be careful about things like never replacing a PTE without first flushing the hash table entry for the previous one (if it was hashed). We also need to ensure we flush before we unlock, but we do that by re-using the new lazy MMU hooks, as to not insert a new PTE (which can potentially be hashed). s390 has more complex sets of rules. > I don't know if there's any such rule regarding set_pte_at(). Certainly > if you were overwriting an existing valid entry, you'd have to arrange > for a tlb flush. Well, for example, the generic copy-on-write case used to do just that, ie set_pte_at() over the previous one, then flush. That is not good for us and thus our set_pte_at() has special code to check whether there's already a present PTE there and does a synchronous flush if there is. However, that also changed, and nowadays, afaik, set_pte_at() is never called anymore to override a already present PTE. We are doing some work on 32 bits code that will need similar constraints and It would be nice if we could rely on the above and make it part of the rules. > Do you have a reference to Nick's proposals? Not off hand, Nick ? How far did you go down that path ? > A higher level interface might give us virtualization people more scope > to play with things. But given that the current interface mostly works > for everyone, a high level interface would tend to result in a lot of > duplicated code unless you pretty strictly stipluate something like > "implement the highest level interface you need *and no higher*; use > common library code for everything else". > > I think documenting the real rules for the current interface would be a > more immediately fruitful path to start with. Well, part of the problem is that with the current interface, some architectures are really mostly hacking around, and in some case, possibly losing potential performance benefits by not having a slightly more abstracted interface. I don't think we should ditch the page tables in favor of some kind of abstract memory objects, don't get me wrong :-) But having a more generic higher level set of transactional interfaces to modifying and flushing PTEs would probably help. Instead we are doing as-hoc hacks left and right mostly based on how things work on x86. Ben. > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: PTE access rules & abstraction 2008-09-22 21:05 ` Benjamin Herrenschmidt @ 2008-09-23 3:10 ` Nick Piggin 2008-09-23 3:16 ` David Miller, Nick Piggin 2008-09-23 5:31 ` Benjamin Herrenschmidt 2008-09-24 18:45 ` Hugh Dickins 1 sibling, 2 replies; 25+ messages in thread From: Nick Piggin @ 2008-09-23 3:10 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Jeremy Fitzhardinge, Linux Memory Management List, Linux Kernel list, Hugh Dickins On Tue, Sep 23, 2008 at 07:05:51AM +1000, Benjamin Herrenschmidt wrote: > > > I didn't change the function of that code when I made the change, so the > > bug was pre-existing; I think it has been like that for quite a while > > (though I haven't done any git archaeology to back that up). So I don't > > think there's a new bug here. > > > > What's the consequence of not flushing for you? > > The bug may have been there, as I said, lots of unwritten rules... > sometimes broken. I'm not necessarily blaming you, but there have been > lots of changes to the PTE accessors over the last 2 years and not > always under any control :-) > > In our case, the consequence is that the entry can be re-hashed because > the fact that it was already hashed and where it was hashed, which is > encoded in the PTE, gets lost by the clear. That means a potential > duplicate entry in the hash. A hard to hit race, but possible. Such a > condition is architecturally illegal and can cause things ranging from > incorrect translation to machine checks or checkstops (generally, on > LPAR machines, what will happen is your partition will get killed). > > I know s390 has different issues & constraints. Martin told me during > Plumbers that mprotect was probably also broken for him. > > > There have been a few optional-extras calls, which are all fine to leave > > as no-ops. I don't think we've added or changed any must-have interfaces. > > Again, I'm not necessarily talking about the very latest round of > changes that you did here... More like a general approach as trying to > find a better overall interface to PTE access which currently is a total > mess as far as I'm concerned. > > > It seems to me that the rules are roughly "anything goes while you're > > holding the pte lock, but it all must be sane by the time you release > > it". But a bit more precision would be useful. > > Well, unfortunately it's a lot more complex than that. For example, on > powerpc, we have to keep the hash table in sync and hash misses don't > take the PTE lock. They are aking to a non-atomic HW walk of the page > tables if you prefer. > > So we must be careful about things like never replacing a PTE without > first flushing the hash table entry for the previous one (if it was > hashed). We also need to ensure we flush before we unlock, but we do > that by re-using the new lazy MMU hooks, as to not insert a new PTE > (which can potentially be hashed). > > s390 has more complex sets of rules. > > > I don't know if there's any such rule regarding set_pte_at(). Certainly > > if you were overwriting an existing valid entry, you'd have to arrange > > for a tlb flush. > > Well, for example, the generic copy-on-write case used to do just that, > ie set_pte_at() over the previous one, then flush. That is not good for > us and thus our set_pte_at() has special code to check whether there's > already a present PTE there and does a synchronous flush if there is. > > However, that also changed, and nowadays, afaik, set_pte_at() is never > called anymore to override a already present PTE. > > We are doing some work on 32 bits code that will need similar > constraints and It would be nice if we could rely on the above and make > it part of the rules. > > > Do you have a reference to Nick's proposals? > > Not off hand, Nick ? How far did you go down that path ? I got something far enough to compile and boot (iirc on x86 and maybe ia64). Basically it was designed with fine grained pte locking in mind which is why I stopped working on it after the pmd locking from Hugh, but I imagine it could be used in some cases to do more intelligent things with pte manipulation. IIRC, basically you would have a pte_modify_start, which gives back a cookie and the pte value, and pte_modify_end which takes a new pte value and the cookie. This would allow the arch to see exactly how the pte was changed, but I imagine in some cases you would still prefer to do that with optimised direct functions rather than test and branches. So I don't know if it would be a magic cure ;) Part of the problem with the pte API (as well as the cache flush and tlb flush APIs) is that it often involves the core mm code telling the arch how it thinks ptes,tlbs,caches should be managed, rather than I think the better approach would be telling the arch what it wants to do. We are getting better slowly I think (eg. you note that set_pte_at is no longer used as a generic "do anything"), but I won't dispute that this whole area could use an overhaul; a document for all the rules, a single person or point of responsibility for those rules... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: PTE access rules & abstraction 2008-09-23 3:10 ` Nick Piggin @ 2008-09-23 3:16 ` David Miller, Nick Piggin 2008-09-23 5:35 ` Benjamin Herrenschmidt 2008-09-23 5:31 ` Benjamin Herrenschmidt 1 sibling, 1 reply; 25+ messages in thread From: David Miller, Nick Piggin @ 2008-09-23 3:16 UTC (permalink / raw) To: npiggin; +Cc: benh, jeremy, linux-mm, linux-kernel, hugh > Part of the problem with the pte API (as well as the cache flush and > tlb flush APIs) is that it often involves the core mm code telling > the arch how it thinks ptes,tlbs,caches should be managed, rather than > I think the better approach would be telling the arch what it wants to > do. > > We are getting better slowly I think (eg. you note that set_pte_at is > no longer used as a generic "do anything"), but I won't dispute that > this whole area could use an overhaul; a document for all the rules, > a single person or point of responsibility for those rules... I agree. To a certain extent this is what BSD does in it's pmap layer, except that they don't have the page table datastructure abstraction like Linus does in the generic code, and which I think was a smart design decision on our side. All of the pmap modules in BSD are pretty big and duplicate a lot of code that arch's don't have to be mindful about under Linux. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: PTE access rules & abstraction 2008-09-23 3:16 ` David Miller, Nick Piggin @ 2008-09-23 5:35 ` Benjamin Herrenschmidt 2008-09-23 6:18 ` Nick Piggin 0 siblings, 1 reply; 25+ messages in thread From: Benjamin Herrenschmidt @ 2008-09-23 5:35 UTC (permalink / raw) To: David Miller; +Cc: npiggin, jeremy, linux-mm, linux-kernel, hugh On Mon, 2008-09-22 at 20:16 -0700, David Miller wrote: > > To a certain extent this is what BSD does in it's pmap layer, except > that they don't have the page table datastructure abstraction like > Linus does in the generic code, and which I think was a smart design > decision on our side. > > All of the pmap modules in BSD are pretty big and duplicate a lot of > code that arch's don't have to be mindful about under Linux. I definitely agree, I don't think we want to go away from the page table as being the abstraction :-) But I'm wondering if we can do a little bit better with the accessors to those page tables. BTW. am I the only one to have got one copy of David's reply (that I'm quoting) coming with a From: Nick Piggin in the headers ? (apparently coming from kvack). Cheers, Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: PTE access rules & abstraction 2008-09-23 5:35 ` Benjamin Herrenschmidt @ 2008-09-23 6:18 ` Nick Piggin 0 siblings, 0 replies; 25+ messages in thread From: Nick Piggin @ 2008-09-23 6:18 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: David Miller, jeremy, linux-mm, linux-kernel, hugh On Tue, Sep 23, 2008 at 03:35:06PM +1000, Benjamin Herrenschmidt wrote: > On Mon, 2008-09-22 at 20:16 -0700, David Miller wrote: > > > > To a certain extent this is what BSD does in it's pmap layer, except > > that they don't have the page table datastructure abstraction like > > Linus does in the generic code, and which I think was a smart design > > decision on our side. > > > > All of the pmap modules in BSD are pretty big and duplicate a lot of > > code that arch's don't have to be mindful about under Linux. > > I definitely agree, I don't think we want to go away from the page table > as being the abstraction :-) But I'm wondering if we can do a little bit > better with the accessors to those page tables. > > BTW. am I the only one to have got one copy of David's reply (that I'm > quoting) coming with a From: Nick Piggin in the headers ? (apparently > coming from kvack). No. I see that, and so does marc. http://marc.info/?t=122184627700007&r=1&w=2 And I've only ever seen it from Dave on kvack. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: PTE access rules & abstraction 2008-09-23 3:10 ` Nick Piggin 2008-09-23 3:16 ` David Miller, Nick Piggin @ 2008-09-23 5:31 ` Benjamin Herrenschmidt 2008-09-23 6:13 ` Jeremy Fitzhardinge 1 sibling, 1 reply; 25+ messages in thread From: Benjamin Herrenschmidt @ 2008-09-23 5:31 UTC (permalink / raw) To: Nick Piggin Cc: Jeremy Fitzhardinge, Linux Memory Management List, Linux Kernel list, Hugh Dickins On Tue, 2008-09-23 at 05:10 +0200, Nick Piggin wrote: > We are getting better slowly I think (eg. you note that set_pte_at is > no longer used as a generic "do anything"), but I won't dispute that > this whole area could use an overhaul; a document for all the rules, > a single person or point of responsibility for those rules... Can we nowadays -rely- on set_pte_at() never being called to overwrite an already valid PTE ? I mean, it looks like the generic code doesn't do it anymore but I wonder if it's reasonable to forbid that from coming back ? That would allow me to remove some hacks in ppc64 and simplify some upcoming ppc32 code. Cheers, Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: PTE access rules & abstraction 2008-09-23 5:31 ` Benjamin Herrenschmidt @ 2008-09-23 6:13 ` Jeremy Fitzhardinge 2008-09-23 6:49 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 25+ messages in thread From: Jeremy Fitzhardinge @ 2008-09-23 6:13 UTC (permalink / raw) To: benh Cc: Nick Piggin, Linux Memory Management List, Linux Kernel list, Hugh Dickins Benjamin Herrenschmidt wrote: > On Tue, 2008-09-23 at 05:10 +0200, Nick Piggin wrote: > >> We are getting better slowly I think (eg. you note that set_pte_at is >> no longer used as a generic "do anything"), but I won't dispute that >> this whole area could use an overhaul; a document for all the rules, >> a single person or point of responsibility for those rules... >> > > Can we nowadays -rely- on set_pte_at() never being called to overwrite > an already valid PTE ? I mean, it looks like the generic code doesn't do > it anymore but I wonder if it's reasonable to forbid that from coming > back ? That would allow me to remove some hacks in ppc64 and simplify > some upcoming ppc32 code. > A good first step might be to define some conventions. For example, define that set_pte*() *always* means setting a non-valid pte to either a new non-valid state (like a swap reference) or to a valid state. modify_pte() would modify the flags of a valid pte, giving a new valid pte. etc... It may be that a given architecture collapses some or all of these down to the same underlying functionality, but it would allow the core intent to be clearly expressed. What is the complete set of primitives we need? I also noticed that a number of the existing pagetable operations are used only once or twice in the core code; I wonder if we really need such special cases, or whether we can make each arch pte operation carry a bit more weight? Also, rather than leaving all the rule enforcing to documentation and a maintainer, we should also consider having a debug mode which adds enough paranoid checks to each operation so that any rule breakage will fail obviously on all architectures. J -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: PTE access rules & abstraction 2008-09-23 6:13 ` Jeremy Fitzhardinge @ 2008-09-23 6:49 ` Benjamin Herrenschmidt 2008-09-23 9:50 ` Nick Piggin 0 siblings, 1 reply; 25+ messages in thread From: Benjamin Herrenschmidt @ 2008-09-23 6:49 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Nick Piggin, Linux Memory Management List, Linux Kernel list, Hugh Dickins > A good first step might be to define some conventions. For example, > define that set_pte*() *always* means setting a non-valid pte to either > a new non-valid state (like a swap reference) or to a valid state. > modify_pte() would modify the flags of a valid > pte, giving a new valid pte. etc... Yup. Or make it clear that ptep_set_access_flags() should only be used to -relax- access (ie, set dirty, writeable, accessed, ... but not remove any of them). > It may be that a given architecture collapses some or all of these down > to the same underlying functionality, but it would allow the core intent > to be clearly expressed. > > What is the complete set of primitives we need? I also noticed that a > number of the existing pagetable operations are used only once or twice > in the core code; I wonder if we really need such special cases, or > whether we can make each arch pte operation carry a bit more weight? Yes, that was some of my concern. It's getting close to having one API per call site :-) > Also, rather than leaving all the rule enforcing to documentation and a > maintainer, we should also consider having a debug mode which adds > enough paranoid checks to each operation so that any rule breakage will > fail obviously on all architectures. We could do both. Now, regarding operations, let's first find the major call sites, see what I miss. I'm omitting free_* in memory.c as those are for freeing pte pages, not accessing PTEs themselves. I'm also ignoring read-only call sites and hugetlb for now. * None-iterative accessors - handle_pte_fault in memory.c, on "fixup" faults (pte is present and it's not a COW), for fixing up DIRTY and ACCESSED (btw, could we make that also fixup EXEC ? I would like this for some stuff I'm working on at the moment, ie set it if the vma has VM_EXEC and it was lost from the PTE as I might want to mask it out of PTEs under some circumstances). Textbook usage of ptep_set_access_flags(), so that's fine. - do_wp_page() in memory.c for COW or fixup of shared writeable mapping writeable-ness. Doesn't overwrite existing PTE for COW anymore, it uses clear_flush nowadays and fixup of shared writeable mapping uses ptep_set_access_flags() as it should, so that's all good. - insert_pfn() and insert_page() still in memory.c for fancy page faults. Just a trivial set_pte_at() of a !present one, no big deal here - RMAP ones ? Some ad-hoc stuff due to _notify thingies. * Iterative accessors (some don't batch, maybe they could/should). - zapping a mapping (zap_p*) in memory.c - fork (copy_p*) in memory.c could batch better maybe ? - setting linear user mappings (remap_p*) in memory.c, trivial set_pte_at() on a range, pte's should be !present I think. - mprotect (change_p*) in memory.c, which has the problem I mentioned - moving page tables (move_p*), pretty trivial clear_flush + set_pte_at - clear_regs_pte_range via walk_page_range in fs/proc/task_mmu.c, does a test_and_clear_young, flushes mm afterward, could use some lazy stuff so we can batch properly on ppc64. - vmalloc, that's a bit special and kernel only, doesn't have nasty races between creating/tearing down mappings vs. using them - highmem I leave alone for now, it's mostly trivial set_pte_at & flushing for normal kmap but kmap_atomic can be nasty, though it's arch specific. - some stuff in fremap I'm not too familiar with and I need to run... What did I miss ? Cheers, Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: PTE access rules & abstraction 2008-09-23 6:49 ` Benjamin Herrenschmidt @ 2008-09-23 9:50 ` Nick Piggin 2008-09-23 11:54 ` peter 0 siblings, 1 reply; 25+ messages in thread From: Nick Piggin @ 2008-09-23 9:50 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Jeremy Fitzhardinge, Linux Memory Management List, Linux Kernel list, Hugh Dickins On Tue, Sep 23, 2008 at 04:49:32PM +1000, Benjamin Herrenschmidt wrote: > > > A good first step might be to define some conventions. For example, > > define that set_pte*() *always* means setting a non-valid pte to either > > a new non-valid state (like a swap reference) or to a valid state. > > modify_pte() would modify the flags of a valid > > pte, giving a new valid pte. etc... > > Yup. Or make it clear that ptep_set_access_flags() should only be used > to -relax- access (ie, set dirty, writeable, accessed, ... but not > remove any of them). > > > It may be that a given architecture collapses some or all of these down > > to the same underlying functionality, but it would allow the core intent > > to be clearly expressed. > > > > What is the complete set of primitives we need? I also noticed that a > > number of the existing pagetable operations are used only once or twice > > in the core code; I wonder if we really need such special cases, or > > whether we can make each arch pte operation carry a bit more weight? > > Yes, that was some of my concern. It's getting close to having one API > per call site :-) I don't think that is a huge problem as such... if there was lots of repeated uses of the API I'd also be concerned about mm/ code not being well factored :) My concern is that things aren't well documented, maybe not consistent enough, and are usually named according to what their implementation looks like on the favourite arch of the person who introduced them, rather than what the VM needs to get done :P Which leads some architectures (I'm looking at ia64 ;)) to reinvent things or just add new primitives rather than finding existing common code. Which makes the problem worse. And it makes generic VM developers not be able to follow what's going on with all the different architectures... > > Also, rather than leaving all the rule enforcing to documentation and a > > maintainer, we should also consider having a debug mode which adds > > enough paranoid checks to each operation so that any rule breakage will > > fail obviously on all architectures. > > We could do both. > > Now, regarding operations, let's first find the major call sites, see > what I miss. I'm omitting free_* in memory.c as those are for freeing > pte pages, not accessing PTEs themselves. I'm also ignoring read-only > call sites and hugetlb for now. > > * None-iterative accessors > > - handle_pte_fault in memory.c, on "fixup" faults (pte is present and > it's not a COW), for fixing up DIRTY and ACCESSED (btw, could we make > that also fixup EXEC ? I would like this for some stuff I'm working on > at the moment, ie set it if the vma has VM_EXEC and it was lost from the > PTE as I might want to mask it out of PTEs under some circumstances). > Textbook usage of ptep_set_access_flags(), so that's fine. > > - do_wp_page() in memory.c for COW or fixup of shared writeable mapping > writeable-ness. Doesn't overwrite existing PTE for COW anymore, it uses > clear_flush nowadays and fixup of shared writeable mapping uses > ptep_set_access_flags() as it should, so that's all good. This is one example of being too low level. It wouldn't be hard to have an arch where the SMC race does not apply, or even it is probably possible to avoid the flush in the case of single-threaded mm. Can't do that however, because the call asks for a flush so we must flush. > - insert_pfn() and insert_page() still in memory.c for fancy page > faults. Just a trivial set_pte_at() of a !present one, no big deal here > > - RMAP ones ? Some ad-hoc stuff due to _notify thingies. rmap ones should only do set_pte_at after a clear_flush I think. > * Iterative accessors (some don't batch, maybe they could/should). > > - zapping a mapping (zap_p*) in memory.c > - fork (copy_p*) in memory.c could batch better maybe ? > - setting linear user mappings (remap_p*) in memory.c, trivial > set_pte_at() on a range, pte's should be !present I think. Yes. > - mprotect (change_p*) in memory.c, which has the problem I mentioned > - moving page tables (move_p*), pretty trivial clear_flush + set_pte_at > - clear_regs_pte_range via walk_page_range in fs/proc/task_mmu.c, does > a test_and_clear_young, flushes mm afterward, could use some lazy stuff > so we can batch properly on ppc64. > - vmalloc, that's a bit special and kernel only, doesn't have nasty > races between creating/tearing down mappings vs. using them > - highmem I leave alone for now, it's mostly trivial set_pte_at & > flushing for normal kmap but kmap_atomic can be nasty, though it's arch > specific. It's also for kva rather than uva, so it's a bit different too... > - some stuff in fremap I'm not too familiar with and I need to run... fremap is only ever doing clear_flush on present ptes, or set_pte_at on pte_none_ptes (and in this case it is going from !present to !present, which is probably also unusual in some cases for arch code to deal with, although it is not restricted to fremap I guess) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: PTE access rules & abstraction 2008-09-23 9:50 ` Nick Piggin @ 2008-09-23 11:54 ` peter 0 siblings, 0 replies; 25+ messages in thread From: peter @ 2008-09-23 11:54 UTC (permalink / raw) To: Nick Piggin Cc: Benjamin Herrenschmidt, Jeremy Fitzhardinge, Linux Memory Management List, Linux Kernel list, Hugh Dickins >>>>> "Nick" == Nick Piggin <npiggin@suse.de> writes: Nick> On Tue, Sep 23, 2008 at 04:49:32PM +1000, Benjamin Herrenschmidt Nick> wrote: >> > What is the complete set of primitives we need? I also noticed >> that a > number of the existing pagetable operations are used only >> once or twice > in the core code; I wonder if we really need such >> special cases, or > whether we can make each arch pte operation >> carry a bit more weight? >> >> Yes, that was some of my concern. It's getting close to having one >> API per call site :-) Nick> I don't think that is a huge problem as such... if there was Nick> lots of repeated uses of the API I'd also be concerned about mm/ Nick> code not being well factored :) Is it worth taking another look at the page-table abstraction layer that Darren Williams posted here last year? PeterC -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: PTE access rules & abstraction 2008-09-22 21:05 ` Benjamin Herrenschmidt 2008-09-23 3:10 ` Nick Piggin @ 2008-09-24 18:45 ` Hugh Dickins 2008-09-24 21:20 ` Benjamin Herrenschmidt 2008-09-24 22:17 ` Martin Schwidefsky 1 sibling, 2 replies; 25+ messages in thread From: Hugh Dickins @ 2008-09-24 18:45 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Jeremy Fitzhardinge, Linux Memory Management List, Linux Kernel list, Nick Piggin, Martin Schwidefsky On Tue, 23 Sep 2008, Benjamin Herrenschmidt wrote: > > The bug may have been there, as I said, lots of unwritten rules... > sometimes broken. I'm not necessarily blaming you, but there have been > lots of changes to the PTE accessors over the last 2 years and not > always under any control :-) > > In our case, the consequence is that the entry can be re-hashed because > the fact that it was already hashed and where it was hashed, which is > encoded in the PTE, gets lost by the clear. That means a potential > duplicate entry in the hash. A hard to hit race, but possible. Such a > condition is architecturally illegal and can cause things ranging from > incorrect translation to machine checks or checkstops (generally, on > LPAR machines, what will happen is your partition will get killed). The powerpc bug whereof you write appears to have been there since ... linux-2.4.0 or earlier: entry = ptep_get_and_clear(pte); set_pte(pte, pte_modify(entry, newprot)); But perhaps powerpc was slightly different back in those days. It sounds to me like a bug in your current ptep_get_and_clear(), not checking if already hashed? > I know s390 has different issues & constraints. Martin told me during > Plumbers that mprotect was probably also broken for him. Then I hope he will probably send Linus the fix. Though what we already have falls somewhat short of perfection, I've much more enthusiasm for fixing its bugs, than for any fancy redesign introducing its own bugs. Others have more stamina! Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: PTE access rules & abstraction 2008-09-24 18:45 ` Hugh Dickins @ 2008-09-24 21:20 ` Benjamin Herrenschmidt 2008-09-24 21:57 ` Jeremy Fitzhardinge 2008-09-24 23:55 ` Hugh Dickins 2008-09-24 22:17 ` Martin Schwidefsky 1 sibling, 2 replies; 25+ messages in thread From: Benjamin Herrenschmidt @ 2008-09-24 21:20 UTC (permalink / raw) To: Hugh Dickins Cc: Jeremy Fitzhardinge, Linux Memory Management List, Linux Kernel list, Nick Piggin, Martin Schwidefsky On Wed, 2008-09-24 at 19:45 +0100, Hugh Dickins wrote: > The powerpc bug whereof you write appears to have been there since ... > linux-2.4.0 or earlier: > entry = ptep_get_and_clear(pte); > set_pte(pte, pte_modify(entry, newprot)); > > But perhaps powerpc was slightly different back in those days. > It sounds to me like a bug in your current ptep_get_and_clear(), > not checking if already hashed? Yes, I figured out the bug was already there. And no, it's not the right approach to have ptep_get_and_clear() flush because it would mean that call cannot batch flushes, and thus we would lose ability to batch in zap_pte_range(). > Though what we already have falls somewhat short of perfection, > I've much more enthusiasm for fixing its bugs, than for any fancy > redesign introducing its own bugs. Others have more stamina! Well, the current set accessor, as far as I'm concerned is a big pile of steaming shit that evolved from x86-specific gunk raped in different horrible ways to make it looks like it fits on other architectures and additionally mashed with goo to make it somewhat palatable by virtualization stuff. Yes, bugs can be fixed but it's still an horrible mess. Now, regarding the above bug, I'm afraid the only approaches I see that would work would be to have either a ptep_get_and_clear_flush(), which I suppose x86 virt. people will hate, or maybe to actually have a powerpc specific variant of the new start/commit hooks that does the flush. Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: PTE access rules & abstraction 2008-09-24 21:20 ` Benjamin Herrenschmidt @ 2008-09-24 21:57 ` Jeremy Fitzhardinge 2008-09-24 22:07 ` Benjamin Herrenschmidt 2008-09-24 23:55 ` Hugh Dickins 1 sibling, 1 reply; 25+ messages in thread From: Jeremy Fitzhardinge @ 2008-09-24 21:57 UTC (permalink / raw) To: benh Cc: Hugh Dickins, Linux Memory Management List, Linux Kernel list, Nick Piggin, Martin Schwidefsky, Peter Chubb Benjamin Herrenschmidt wrote: > Well, the current set accessor, as far as I'm concerned is a big pile of > steaming shit that evolved from x86-specific gunk raped in different > horrible ways to make it looks like it fits on other architectures and > additionally mashed with goo to make it somewhat palatable by > virtualization stuff. Yes, bugs can be fixed but it's still an horrible > mess. > What do you propose then? Ideally one would like to get something that works for powerpc, s390, all the wacky ia64 modes as well as x86. The ia64 folks proposed something, but I've not looked at it closely. From an x86 virtualization perspective, something that's basically x86 with as much scope for batching and deferring as possible would be fine. As a start, what's the state machine for a pte? What states can it be in, and how does it move from state to state? It sounds like powerpc has at least one extra state above x86 (hashed, with the hash key stored in the pte itself?). > Now, regarding the above bug, I'm afraid the only approaches I see that > would work would be to have either a ptep_get_and_clear_flush(), which I > suppose x86 virt. people will hate, or maybe to actually have a powerpc > specific variant of the new start/commit hooks that does the flush. > ptep_get_and_clear() is not batchable anyway, because the x86 implementation requires an atomic xchg on the pte, which will likely result in some sort of trap (and if it doesn't then it doesn't need batching). The start/commit API was specifically so that we can do the mprotect (and fork COW updates) in a batchable way (in Xen its implemented with a pte update hypercall which updates the pte without affecting the A/D bits). J -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: PTE access rules & abstraction 2008-09-24 21:57 ` Jeremy Fitzhardinge @ 2008-09-24 22:07 ` Benjamin Herrenschmidt 2008-09-24 22:43 ` Jeremy Fitzhardinge 0 siblings, 1 reply; 25+ messages in thread From: Benjamin Herrenschmidt @ 2008-09-24 22:07 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Hugh Dickins, Linux Memory Management List, Linux Kernel list, Nick Piggin, Martin Schwidefsky, Peter Chubb > What do you propose then? Ideally one would like to get something that > works for powerpc, s390, all the wacky ia64 modes as well as x86. The > ia64 folks proposed something, but I've not looked at it closely. From > an x86 virtualization perspective, something that's basically x86 with > as much scope for batching and deferring as possible would be fine. That's where things get interesting. I liked Nick ideas of doing something transactional that could encompass the lock, bach and flushing but that may be too much at this stage... > As a start, what's the state machine for a pte? What states can it be > in, and how does it move from state to state? It sounds like powerpc > has at least one extra state above x86 (hashed, with the hash key stored > in the pte itself?). We store in the PTE whether it was hashed, and the location within a hash bucket. (For each hash value, there's 8 buckets, or rather 16 if you count our secondary hashing). We must never write a new valid PTE after we cleared a hashed one without having a flush in between. On 32 bits we have less state (only the 'hashed' bit) but the problem is similar, though we handle it differently: we never clear the hash bit until we flush the hash, ie, pte_clear doesn't clear the hash bit. On 64-bit we do things differently, we do clear PTEs and pile up in a per-cpu batch what needs to be flushed, the flush then happens when leaving lazy mode. > ptep_get_and_clear() is not batchable anyway, because the x86 > implementation requires an atomic xchg on the pte, which will likely > result in some sort of trap (and if it doesn't then it doesn't need > batching). Well, ptep_get_and_clear() used to be used by zap_pte_range() which I _HOPE_ was batchable on x86 :-) Nowadays, there's this new ptep_get_and_clear_full() (yet another totally meaningless name for an ad-hoc API added for some random special purpose) that zap_pte_range() uses. Maybe that one is now subtly different such as it can be used to batch on x86 ? In any case, powerpc batches -everything- (unless it's called *_flush in which case the flush is immediate) in a private per-cpu batch and flushes the hash when leaving lazy mode. > The start/commit API was specifically so that we can do the > mprotect (and fork COW updates) in a batchable way (in Xen its > implemented with a pte update hypercall which updates the pte without > affecting the A/D bits). I think we have different ideas of what batch means but yeah, we do batch everything including these on powerpc without the new start/commit interface. Ben. > J > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: PTE access rules & abstraction 2008-09-24 22:07 ` Benjamin Herrenschmidt @ 2008-09-24 22:43 ` Jeremy Fitzhardinge 2008-09-24 22:53 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 25+ messages in thread From: Jeremy Fitzhardinge @ 2008-09-24 22:43 UTC (permalink / raw) To: benh Cc: Hugh Dickins, Linux Memory Management List, Linux Kernel list, Nick Piggin, Martin Schwidefsky, Peter Chubb Benjamin Herrenschmidt wrote: >> What do you propose then? Ideally one would like to get something that >> works for powerpc, s390, all the wacky ia64 modes as well as x86. The >> ia64 folks proposed something, but I've not looked at it closely. From >> an x86 virtualization perspective, something that's basically x86 with >> as much scope for batching and deferring as possible would be fine. >> > > That's where things get interesting. I liked Nick ideas of doing > something transactional that could encompass the lock, bach and flushing > but that may be too much at this stage... > Yes, that sounds fine in principle, but the practise gets tricky. The trouble with that kind of interface is that it can be fairly heavyweight unless you amortise the cost of the transaction setup/commit with multiple operations. >> ptep_get_and_clear() is not batchable anyway, because the x86 >> implementation requires an atomic xchg on the pte, which will likely >> result in some sort of trap (and if it doesn't then it doesn't need >> batching). >> > > Well, ptep_get_and_clear() used to be used by zap_pte_range() which I > _HOPE_ was batchable on x86 :-) > > Nowadays, there's this new ptep_get_and_clear_full() (yet another > totally meaningless name for an ad-hoc API added for some random special > purpose) that zap_pte_range() uses. Maybe that one is now subtly > different such as it can be used to batch on x86 ? > Yeah. zap_pte_range isn't great when doing a munmap, but the _full gunk lets it special-case process teardown and it ends up not being a performance problem (in the Xen case, we've already switch to another pagetable at that point, so it isn't really a pagetable any more and needs no hypercalls). > In any case, powerpc batches -everything- (unless it's called *_flush in > which case the flush is immediate) in a private per-cpu batch and > flushes the hash when leaving lazy mode. > Are you using the lazy_mmu hooks we put in, or something else? >> The start/commit API was specifically so that we can do the >> mprotect (and fork COW updates) in a batchable way (in Xen its >> implemented with a pte update hypercall which updates the pte without >> affecting the A/D bits). >> > > I think we have different ideas of what batch means but yeah, we do > batch everything including these on powerpc without the new start/commit > interface. Likely; it's one of those overused generic words. The specific meaning I'm using is "we can roll a bunch of updates into a single hypercall". Operations which do atomic RMW or fetch-set are typically not batchable because the caller wants the result *now* and can't wait for a deferred result. J -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: PTE access rules & abstraction 2008-09-24 22:43 ` Jeremy Fitzhardinge @ 2008-09-24 22:53 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 25+ messages in thread From: Benjamin Herrenschmidt @ 2008-09-24 22:53 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Hugh Dickins, Linux Memory Management List, Linux Kernel list, Nick Piggin, Martin Schwidefsky, Peter Chubb > Yes, that sounds fine in principle, but the practise gets tricky. The > trouble with that kind of interface is that it can be fairly heavyweight > unless you amortise the cost of the transaction setup/commit with > multiple operations. I agree. Anyway, I'm sure if we get everybody around the same table, we can find something, maybe not that high level, but a better set of accessors that fits all needs with a more precise semantic. Anyway, I'll see if I can come up with ideas later that we can discuss. It's also all inline so it does give us flexibility in having things compile down to nothing on archs where they aren't needed. > Are you using the lazy_mmu hooks we put in, or something else? Yes. We used to do it differently but it was actually racy in a subtle way, and I switched it to use your lazy_mmu hooks a while ago. > Likely; it's one of those overused generic words. The specific meaning > I'm using is "we can roll a bunch of updates into a single hypercall". > Operations which do atomic RMW or fetch-set are typically not batchable > because the caller wants the result *now* and can't wait for a deferred > result. Right. For us it's mostly about flushing the hash entries, though there is no fundamental difference I believe here, it's about updating a cache, whether it's the TLB, our hash table, hypervisor shadows, etc... and we -should- be able to find a more sensible common ground. Cheers, Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: PTE access rules & abstraction 2008-09-24 21:20 ` Benjamin Herrenschmidt 2008-09-24 21:57 ` Jeremy Fitzhardinge @ 2008-09-24 23:55 ` Hugh Dickins 2008-09-25 1:04 ` Benjamin Herrenschmidt 1 sibling, 1 reply; 25+ messages in thread From: Hugh Dickins @ 2008-09-24 23:55 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Jeremy Fitzhardinge, Linux Memory Management List, Linux Kernel list, Nick Piggin, Martin Schwidefsky On Thu, 25 Sep 2008, Benjamin Herrenschmidt wrote: > > Now, regarding the above bug, I'm afraid the only approaches I see that > would work would be to have either a ptep_get_and_clear_flush(), which I > suppose x86 virt. people will hate, or maybe to actually have a powerpc > specific variant of the new start/commit hooks that does the flush. Whyever not the latter? Jeremy seems to have gifted that to you, for precisely such a purpose. Hugh p.s. I surely agree with you over the name ptep_get_and_clear_full(): horrid, even more confusing than the tlb->fullmm from which it derives its name. I expect I'd agree with you over a lot more too, but please, bugfixes first. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: PTE access rules & abstraction 2008-09-24 23:55 ` Hugh Dickins @ 2008-09-25 1:04 ` Benjamin Herrenschmidt 2008-09-25 18:15 ` Jeremy Fitzhardinge 0 siblings, 1 reply; 25+ messages in thread From: Benjamin Herrenschmidt @ 2008-09-25 1:04 UTC (permalink / raw) To: Hugh Dickins Cc: Jeremy Fitzhardinge, Linux Memory Management List, Linux Kernel list, Nick Piggin, Martin Schwidefsky On Thu, 2008-09-25 at 00:55 +0100, Hugh Dickins wrote: > > Whyever not the latter? Jeremy seems to have gifted that to you, > for precisely such a purpose. Yeah. Not that I don't quite understand what the point of the start/modify/commit thing the way it's currently used in mprotect since we are doing the whole transaction for a single PTE change, ie how does that help with hypervisors vs. a single ptep_modify_protection() for example is beyond me :-) When I think about transactions, I think about starting a transaction, changing a -bunch- of PTEs, then commiting... Essentially I see the PTE lock thing as being a transaction. Cheers, Ben. > Hugh > > p.s. I surely agree with you over the name ptep_get_and_clear_full(): > horrid, even more confusing than the tlb->fullmm from which it derives > its name. I expect I'd agree with you over a lot more too, but > please, bugfixes first. Sure. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: PTE access rules & abstraction 2008-09-25 1:04 ` Benjamin Herrenschmidt @ 2008-09-25 18:15 ` Jeremy Fitzhardinge 2008-09-25 21:44 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 25+ messages in thread From: Jeremy Fitzhardinge @ 2008-09-25 18:15 UTC (permalink / raw) To: benh Cc: Hugh Dickins, Linux Memory Management List, Linux Kernel list, Nick Piggin, Martin Schwidefsky Benjamin Herrenschmidt wrote: > On Thu, 2008-09-25 at 00:55 +0100, Hugh Dickins wrote: > >> Whyever not the latter? Jeremy seems to have gifted that to you, >> for precisely such a purpose. >> > > Yeah. Not that I don't quite understand what the point of the > start/modify/commit thing the way it's currently used in mprotect since > we are doing the whole transaction for a single PTE change, ie how does > that help with hypervisors vs. a single ptep_modify_protection() for > example is beyond me :-) > > When I think about transactions, I think about starting a transaction, > changing a -bunch- of PTEs, then commiting... Essentially I see the PTE > lock thing as being a transaction. I think we need to be a bit clearer about the terminology: A batch is a bunch of things that can be optionally deferred so they can be done in chunks. A transaction is something that must either complete successfully, or have no effect at all. Doing multiple things in a transaction means that they must all complete or none. (In general we assume there's nothing about these low-level pagetable operations which can fail, so we can ignore the failure part of transactions.) In this case, a batch is not a transaction. Doing things between arch_enter_lazy_mmu_mode/arch_leave_lazy_mmu_mode makes no guarantees about when operations are performed, other than guaranteeing that they'll all be done by the time arch_leave_lazy_mmu_mode returns. Everything about how things are chunked into batches are up to the underlying architecture, and the calling code can't make any assumptions about it (the specific problem we've had to fix in a couple of places is things expecting to be able to read back their recent pagetable modifications immediately after issuing the call). The ptep_modify_prot_start/commit pair specifies a single pte update in such a way to allow more implementation flexibility - ie, there's no naked requirement for an atomic fetch-and-clear operation. I chose the transaction-like terminology to emphasize that the start/commit functions must be strictly paired; there's no way to fail or abort the "transaction". A whole group of those start/commit pairs can be batched together without affecting their semantics. J -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: PTE access rules & abstraction 2008-09-25 18:15 ` Jeremy Fitzhardinge @ 2008-09-25 21:44 ` Benjamin Herrenschmidt 2008-09-25 22:27 ` Jeremy Fitzhardinge 0 siblings, 1 reply; 25+ messages in thread From: Benjamin Herrenschmidt @ 2008-09-25 21:44 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Hugh Dickins, Linux Memory Management List, Linux Kernel list, Nick Piggin, Martin Schwidefsky On Thu, 2008-09-25 at 11:15 -0700, Jeremy Fitzhardinge wrote: > The ptep_modify_prot_start/commit pair specifies a single pte update in > such a way to allow more implementation flexibility - ie, there's no > naked requirement for an atomic fetch-and-clear operation. I chose the > transaction-like terminology to emphasize that the start/commit > functions must be strictly paired; there's no way to fail or abort the > "transaction". A whole group of those start/commit pairs can be batched > together without affecting their semantics. I still can't see the point of having now 3 functions instead of just one such as ptep_modify_protection(). I don't see what it buys you other than adding gratuituous new interfaces. Ben;. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: PTE access rules & abstraction 2008-09-25 21:44 ` Benjamin Herrenschmidt @ 2008-09-25 22:27 ` Jeremy Fitzhardinge 2008-09-25 23:02 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 25+ messages in thread From: Jeremy Fitzhardinge @ 2008-09-25 22:27 UTC (permalink / raw) To: benh Cc: Hugh Dickins, Linux Memory Management List, Linux Kernel list, Nick Piggin, Martin Schwidefsky Benjamin Herrenschmidt wrote: > On Thu, 2008-09-25 at 11:15 -0700, Jeremy Fitzhardinge wrote: > >> The ptep_modify_prot_start/commit pair specifies a single pte update in >> such a way to allow more implementation flexibility - ie, there's no >> naked requirement for an atomic fetch-and-clear operation. I chose the >> transaction-like terminology to emphasize that the start/commit >> functions must be strictly paired; there's no way to fail or abort the >> "transaction". A whole group of those start/commit pairs can be batched >> together without affecting their semantics. >> > > I still can't see the point of having now 3 functions instead of just > one such as ptep_modify_protection(). I don't see what it buys you other > than adding gratuituous new interfaces. > Yeah, that would work too; that's pretty much how Xen implements it anyway. The main advantage of the start/commit pair is that the resulting code was completely unchanged from the old code. The mprotect sequence using ptep_modify_protection would end up reading the pte twice before writing it. J -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: PTE access rules & abstraction 2008-09-25 22:27 ` Jeremy Fitzhardinge @ 2008-09-25 23:02 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 25+ messages in thread From: Benjamin Herrenschmidt @ 2008-09-25 23:02 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Hugh Dickins, Linux Memory Management List, Linux Kernel list, Nick Piggin, Martin Schwidefsky On Thu, 2008-09-25 at 15:27 -0700, Jeremy Fitzhardinge wrote: > Yeah, that would work too; that's pretty much how Xen implements it > anyway. The main advantage of the start/commit pair is that the > resulting code was completely unchanged from the old code. The mprotect > sequence using ptep_modify_protection would end up reading the pte twice > before writing it. Not necessarily .. depends how you factor out the interface to it. Anyway, not a big deal now. I'll do a patch to fix the hole on powerpc, and if my brain clicks, over the next few weeks, I'll see if I can come up with an overall nicer API covering all usages. In many case might just be a matter of giving a saner name to existing calls and documenting them properly tho :-) Cheers, Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: PTE access rules & abstraction 2008-09-24 18:45 ` Hugh Dickins 2008-09-24 21:20 ` Benjamin Herrenschmidt @ 2008-09-24 22:17 ` Martin Schwidefsky 1 sibling, 0 replies; 25+ messages in thread From: Martin Schwidefsky @ 2008-09-24 22:17 UTC (permalink / raw) To: Hugh Dickins Cc: Benjamin Herrenschmidt, Jeremy Fitzhardinge, Linux Memory Management List, Linux Kernel list, Nick Piggin On Wed, 2008-09-24 at 19:45 +0100, Hugh Dickins wrote: > > I know s390 has different issues & constraints. Martin told me during > > Plumbers that mprotect was probably also broken for him. > > Then I hope he will probably send Linus the fix. > > Though what we already have falls somewhat short of perfection, > I've much more enthusiasm for fixing its bugs, than for any fancy > redesign introducing its own bugs. Others have more stamina! As far as I can tell the current code should work. It is not pretty though, in particular the nasty pairing of flush_tlb_mm() with ptep_set_wrprotect() and flush_tlb_range() with change_protection() is fragile. For me the question is if we can find a sensible set of basic primitives that work for all architectures in a performant way. This is really hard.. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2008-09-25 23:02 UTC | newest] Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2008-09-19 17:42 PTE access rules & abstraction Benjamin Herrenschmidt 2008-09-22 6:22 ` Jeremy Fitzhardinge 2008-09-22 21:05 ` Benjamin Herrenschmidt 2008-09-23 3:10 ` Nick Piggin 2008-09-23 3:16 ` David Miller, Nick Piggin 2008-09-23 5:35 ` Benjamin Herrenschmidt 2008-09-23 6:18 ` Nick Piggin 2008-09-23 5:31 ` Benjamin Herrenschmidt 2008-09-23 6:13 ` Jeremy Fitzhardinge 2008-09-23 6:49 ` Benjamin Herrenschmidt 2008-09-23 9:50 ` Nick Piggin 2008-09-23 11:54 ` peter 2008-09-24 18:45 ` Hugh Dickins 2008-09-24 21:20 ` Benjamin Herrenschmidt 2008-09-24 21:57 ` Jeremy Fitzhardinge 2008-09-24 22:07 ` Benjamin Herrenschmidt 2008-09-24 22:43 ` Jeremy Fitzhardinge 2008-09-24 22:53 ` Benjamin Herrenschmidt 2008-09-24 23:55 ` Hugh Dickins 2008-09-25 1:04 ` Benjamin Herrenschmidt 2008-09-25 18:15 ` Jeremy Fitzhardinge 2008-09-25 21:44 ` Benjamin Herrenschmidt 2008-09-25 22:27 ` Jeremy Fitzhardinge 2008-09-25 23:02 ` Benjamin Herrenschmidt 2008-09-24 22:17 ` Martin Schwidefsky
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox