On Mon, Nov 10, 2014 at 3:53 PM, Linus Torvalds wrote: > > So I am fine with that, it's the details that confuse me. The thing > doesn't seem to be generic enough to be used for arbitrary page > tables, with (for example) the shifts fixed by the VM page size and > the size of the pte entry type. Also, the levels seem to be very > infexible, with the page table entries being the simple case, but then > you have that "pdep" thing that seems to be just _one_ level of page > directory. Ok, so let me just put my money where my mouth is, and show some example code of a tree walker that I think is actually more generic. Sorry for the delay, I got distracted by other things, and I wanted to write something to show what I think might be a better approach. NOTE NOTE NOTE! I'm not saying you have to do it this way. But before I even show the patch, let me show you the "tree descriptor" from my stupid test-program that uses it, and hopefully that will show what I'm really aiming for: struct tree_walker_definition x86_64_def = { .total_bits = 48, .start = 0, .end = 0x7fffffffffff, .levels = { { .level_bits = 9, .lookup = pgd_lookup }, { .level_bits = 9, .lookup = pud_lookup }, { .level_bits = 9, .lookup = pmd_lookup }, { .level_bits = 9, .walker = pte_walker } } }; so basically, the *concept* is that you can describe a real page table by actually *describing* it. What the above does is tell you: - the amount of bits the tables can cover (48 is four levels of 9 bits each, leaving 12 bits - 4096 bytes - for the actual pages) - limit the range that can be walked (this isn't really all that important, but it does, for example, mean that the walker will fundamentally refuse to give access to the kernel mapping) - show how the different levels work, and what their sizes are and how you look them up or walk them, starting from the top-most. Anyway, I think a descriptor like the above looks *understandable*. It kind of stands on its own, even without showing the actual code. Now, the code to actually *walk* the above tree looks like this: struct tree_walker walk = { .first = 4096, .last = 4096*512*3, .walk = show_walk, .hole = show_hole, .pre_walk = show_pre_walk, .post_walk = show_post_walk, }; walk_tree((struct tree_entry *)pgd, &x86_64_def, &walk); ie you use the "walk_tree()" function to walk a particular tree (in this case it's a fake page table directory in "pgd", see the details in the stupid test-application), giving it the tree definition and the "walk" parameters that show what should happen for particular details (quite often hole/pre-walk/post-walk may be NULL, my test app just shows them being called). Now,. in addition to that, each tree description obviously needs the functions to show how to look up the different levels ("lookup" for moving from one level to another, and "walker" for actually walking the last level page table knowing how "present" bits etc work. Now, your code had a "uint64_t mask" for the present bits, which probably works reasonably well in practice, but I really prefer to just have that "walker" callback instead. That way the page tables can look like *anything*, and you can walk them, without having magic rules that there has to be a particular bit pattern that says it's "present". Also, my walker actually does super-pages - ie one of the mid-level page tables could map one big area. I didn't much test it, but the code is actually fairly straightforward, the way it's all been set up. So it might be buggy, but it's *close*. Now, one place we differ is on locking. I actually think that the person who asks to walk the tree should just do the locking themselves. You can't really walk the tree without knowing what kind of tree it is, and so I think the caller should just do the locking. Obviously, the tree walker itself may have some locking in the "pre_walk/post_walk" thing and in its lookup routines, so the description of the tree can contain some locking of its own, but I did *not* want to make the infrastructure itself force any particular locking strategy. So this does something quite different from what your patch actually did, and does that different thing very differently. It may not really match what you are aiming for, but I'd *really* like the first implementation of HMM that gets merged to not over-design the locking (which I think yours did), and I want it to make *sense* (which I don't think your patch did). Also, please note that this *is* just an example. It has an example user (that is just a stupid user-level toy app to show how it all is put together), but it's not necessarily all that featureful, and it's definitely not very tested. But the code is actually fairly simple. But judge for yourself. Linus