From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Mon, 22 May 2006 20:31:22 +0100 (BST) From: Hugh Dickins Subject: tracking dirty pages patches Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org Return-Path: To: Peter Zijlstra Cc: Andrew Morton , Linus Torvalds , David Howells , linux-mm@kvack.org List-ID: Belated observations on your "tracking dirty pages" patches. page_wrprotect is a nice use of rmap, but I see a couple of problems. One is in the lock ordering (there's info on mm lock ordering at the top of filemap.c, but I find the list at the top of rmap.c easier). set_page_dirty has always (awkwardly) been liable to be called from very low in the hierarchy; whereas you're assuming clear_page_dirty is called from much higher up. And in most cases there's no problem (please cross-check to confirm that); but try_to_free_buffers in fs/ buffer.c calls it while holding mapping->private_lock - page_wrprotect called from test_clear_page_dirty then violates the order. If we're lucky and that is indeed the only violation, maybe Andrew can recommend a change to try_to_free_buffers to avoid it: I have no appreciation of the issues at that end myself. The other worries are in page_wrprotect_one's block entry = pte_mkclean(pte_wrprotect(*pte)); ptep_establish(vma, address, pte, entry); update_mmu_cache(vma, address, entry); lazy_mmu_prot_update(entry); ptep_establish, update_mmu_cache and lazy_mmu_prot_update are tricky arch-dependent functions which have hitherto only been used on the current task mm, whereas you're now using them from (perhaps) another. Well, no, I'm wrong: ptrace's get_user_pages has been using them from another process; but that's not so common a case as to reassure me there won't be issues on some architectures there. Quite likely ptep_establish and update_mmu_cache are okay for use in that way (needs careful checking of arches), at least they take a vma argument from which the mm can be found. Whereas lazy_mmu_prot_update looks likely to be wrong, but only does something on ia64: you need to consult ia64 mm gurus to check what's needed there. Maybe it'll just be a suboptimal issue (but more important now than in ptrace to make it optimal). Is there a problem with page_wrprotect on VM_LOCKED vmas? I'm not sure: usually VM_LOCKED guarantees no faulting, you abandon that. Like others, I don't care for "VM_SharedWritable": you followed the VM_ReadHint macros, but this isn't a read hint, and those are weird. Personally, I much prefer the explicit ((vma->vm_flags & (VM_SHARED|VM_WRITE)) == (VM_SHARED|VM_WRITE)) which is the usual style for vm_flags tests throughout mm (except for the hugetlb test designed to melt away without HUGETLB). But I may be in a minority on that, Linus did put an is_cow_mapping() in memory.c recently, so maybe you'd follow that and say is_shared_writable(). There's a clash and overlap between your "tracking dirty pages" patches and David Howell's "add notification of page becoming writable" patch. The merge of the two in 2.6.17-rc4-mm1 was wrong: your handle_pte_fault change meant it never reached David's page_mkwrite call in do_wp_page. Andrew has resolved that in -mm2 and -mm3 by dropping David's patches for the umpteenth time; but whatever happens to the rest of FS-Cache, I suspect the page_mkwrite patch will return (Anton wants it for NTFS, probably others will want it, but none are relying upon it yet). Please take a look at that patch (David reposted it on linux-kernel last Friday, as 08/14 of FS-Cache try #10): I went over it with him many months ago, and it fills in at least one gap you're missing... mprotect: we all forget mprotect, but mprotect(,,PROT_READ) followed by mprotect(,,PROT_WRITE) will give write permission to all those ptes you've carefully taken write permission from. In the page_mkwrite patch, we found that was most easily fixed by using the !VM_SHARED vm_page_prot in place of the VM_SHARED one. I expect you can simplify your patch a little by doing the same. msync: I rather think that with your changes, if they're to stay, then all the page table walking code can be removed from msync - since it already skipped vmas which were not VM_SHARED, and there's nothing to gain from syncing the !mapping_cap_account_dirty ones. I think MS_ASYNC becomes a no-op, and sys_msync so small it won't deserve its own msync.c (madvise.c wouldn't be a bad place for it). Or am I missing something? I'm not convinced that optimize-follow_pages is a worthwhile optimization (in some cases you're adding an atomic inc and dec), and it's irrelevant to your tracking of dirty pages, but I don't feel strongly about it. Except, if it stays then it needs fixing: the flags 0 case is doing a put_page without having done a get_page. Though currently it seems only some powerpc #ifdef __DEBUG code is using follow_pages in that way: since that's not the common case, I think you'd best just remove the "if (flags & (FOLL_GET | FOLL_TOUCH))" condition from before the get_page. (Why does follow_pages set_page_dirty at all? I _think_ it's in case the get_user_pages caller forgets to set_page_dirty when releasing. But that's not how we usually write kernel code, to hide mistakes most of the time, and your mods may change the balance there. Andrew will remember better whether that set_page_dirty has stronger justification.) Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org