Hugh Dickins wrote: > On Wed, 2 May 2007, Nick Piggin wrote: [snip] > More on-topic, since you suggest doing more within vmtruncate_range > than the filesystem: no, I'm afraid that's misdesigned, and I want > to move almost all of it into the filesystem ->truncate_range. > Because, if what vmtruncate_range is doing before it gets to the > filesystem isn't to be just a waste of time, the filesystem needs > to know what's going on in advance - just as notify_change warns > the filesystem about a coming truncation. But easier than inventing > some new notification is to move it all into the filesystem, with > unmap_mapping_range+truncate_inode_pages_range its library helpers. Well I would prefer it to follow the same pattern as regular truncate. I don't think it is misdesigned to call the filesystem _first_, but I think if you do that then the filesystem should call the vm to prepare / finish truncate, rather than open code calls to unmap itself. >>>But I'm pretty sure (to use your words!) regular truncate was not racy >>>before: I believe Andrea's sequence count was handling that case fine, >>>without a second unmap_mapping_range. >> >>OK, I think you're right. I _think_ it should also be OK with the >>lock_page version as well: we should not be able to have any pages >>after the first unmap_mapping_range call, because of the i_size >>write. So if we have no pages, there is nothing to 'cow' from. > > > I'd be delighted if you can remove those later unmap_mapping_ranges. > As I recall, the important thing for the copy pages is to be holding > the page lock (or whatever other serialization) on the copied page > still while the copy page is inserted into pagetable: that looks > to be so in your __do_fault. Yeah, I think my thought process went wrong on those... I'll revisit. >>>But it is a shame, and leaves me wondering what you gained with the >>>page lock there. >>> >>>One thing gained is ease of understanding, and if your later patches >>>build an edifice upon the knowledge of holding that page lock while >>>faulting, I've no wish to undermine that foundation. >> >>It also fixes a bug, doesn't it? ;) > > > Well, I'd come to think that perhaps the bugs would be solved by > that second unmap_mapping_range alone, so the pagelock changes > just a misleading diversion. > > I'm not sure how I feel about that: calling unmap_mapping_range a > second time feels such a cheat, but if (big if) it does solve the > races, and the pagelock method is as expensive as your numbers > now suggest... Well aside from being terribly ugly, it means we can still drop the dirty bit where we'd otherwise rather not, so I don't think we can do that. I think there may be some way we can do this without taking the page lock, and I was going to look at it, but I think it is quite neat to just lock the page... I don't think performance is _that_ bad. On the P4 it is a couple of % on the microbenchmarks. The G5 is worse, but even then I don't think it is I'll try to improve that and get back to you. The problem is that lock/unlock_page is expensive on powerpc, and if we improve that, we improve more than just the fault handler... The attached patch gets performance up a bit by avoiding some barriers and some cachelines: G5 pagefault fork exec 2.6.21 1.49-1.51 164.6-170.8 741.8-760.3 +patch 1.71-1.73 175.2-180.8 780.5-794.2 +patch2 1.61-1.63 169.8-175.0 748.6-757.0 So that brings the fork/exec hits down to much less than 5%, and would likely speed up other things that lock the page, like write or page reclaim. I think we could get further performance improvement by implementing arch specific bitops for lock/unlock operations, so we don't need to use things like smb_mb__before_clear_bit() if they aren't needed or full barriers in the test_and_set_bit(). -- SUSE Labs, Novell Inc.