Hugh Dickins wrote:
> On Wed, 2 May 2007, Nick Piggin wrote:

[snip]

> More on-topic, since you suggest doing more within vmtruncate_range
> than the filesystem: no, I'm afraid that's misdesigned, and I want
> to move almost all of it into the filesystem ->truncate_range.
> Because, if what vmtruncate_range is doing before it gets to the
> filesystem isn't to be just a waste of time, the filesystem needs
> to know what's going on in advance - just as notify_change warns
> the filesystem about a coming truncation.  But easier than inventing
> some new notification is to move it all into the filesystem, with
> unmap_mapping_range+truncate_inode_pages_range its library helpers.

Well I would prefer it to follow the same pattern as regular
truncate. I don't think it is misdesigned to call the filesystem
_first_, but I think if you do that then the filesystem should
call the vm to prepare / finish truncate, rather than open code
calls to unmap itself.


>>>But I'm pretty sure (to use your words!) regular truncate was not racy
>>>before: I believe Andrea's sequence count was handling that case fine,
>>>without a second unmap_mapping_range.
>>
>>OK, I think you're right. I _think_ it should also be OK with the
>>lock_page version as well: we should not be able to have any pages
>>after the first unmap_mapping_range call, because of the i_size
>>write. So if we have no pages, there is nothing to 'cow' from.
> 
> 
> I'd be delighted if you can remove those later unmap_mapping_ranges.
> As I recall, the important thing for the copy pages is to be holding
> the page lock (or whatever other serialization) on the copied page
> still while the copy page is inserted into pagetable: that looks
> to be so in your __do_fault.

Yeah, I think my thought process went wrong on those... I'll
revisit.


>>>But it is a shame, and leaves me wondering what you gained with the
>>>page lock there.
>>>
>>>One thing gained is ease of understanding, and if your later patches
>>>build an edifice upon the knowledge of holding that page lock while
>>>faulting, I've no wish to undermine that foundation.
>>
>>It also fixes a bug, doesn't it? ;)
> 
> 
> Well, I'd come to think that perhaps the bugs would be solved by
> that second unmap_mapping_range alone, so the pagelock changes
> just a misleading diversion.
> 
> I'm not sure how I feel about that: calling unmap_mapping_range a
> second time feels such a cheat, but if (big if) it does solve the
> races, and the pagelock method is as expensive as your numbers
> now suggest...

Well aside from being terribly ugly, it means we can still drop
the dirty bit where we'd otherwise rather not, so I don't think
we can do that.

I think there may be some way we can do this without taking the
page lock, and I was going to look at it, but I think it is
quite neat to just lock the page...

I don't think performance is _that_ bad. On the P4 it is a couple
of % on the microbenchmarks. The G5 is worse, but even then I
don't think it is I'll try to improve that and get back to you.

The problem is that lock/unlock_page is expensive on powerpc, and
if we improve that, we improve more than just the fault handler...

The attached patch gets performance up a bit by avoiding some
barriers and some cachelines:

G5
          pagefault   fork          exec
2.6.21   1.49-1.51   164.6-170.8   741.8-760.3
+patch   1.71-1.73   175.2-180.8   780.5-794.2
+patch2  1.61-1.63   169.8-175.0   748.6-757.0

So that brings the fork/exec hits down to much less than 5%, and
would likely speed up other things that lock the page, like write
or page reclaim.

I think we could get further performance improvement by
implementing arch specific bitops for lock/unlock operations,
so we don't need to use things like smb_mb__before_clear_bit()
if they aren't needed or full barriers in the test_and_set_bit().

-- 
SUSE Labs, Novell Inc.