Populating multiple ptes at fault time

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Populating multiple ptes at fault time
@ 2008-09-17 17:47 Jeremy Fitzhardinge
  2008-09-17 18:28 ` Rik van Riel
                   ` (3 more replies)
  0 siblings, 4 replies; 40+ messages in thread
From: Jeremy Fitzhardinge @ 2008-09-17 17:47 UTC (permalink / raw)
  To: Nick Piggin, Hugh Dickens
  Cc: Linux Memory Management List, Linux Kernel Mailing List,
	Avi Kivity, Andrew Morton, Rik van Riel

Avi and I were discussing whether we should populate multiple ptes at
pagefault time, rather than one at at time as we do now.

When Linux is operating as a virtual guest, pte population will
generally involve some kind of trap to the hypervisor, either to
validate the pte contents (in Xen's case) or to update the shadow
pagetable (kvm).  This is relatively expensive, and it would be good to
amortise the cost by populating multiple ptes at once.

Xen and kvm already batch pte updates where multiple ptes are explicitly
updated at once (mprotect and unmap, mostly), but in practise that's
relatively rare.  Most pages are demand faulted into a process one at a
time.

It seems to me there are two cases: major faults, and minor faults:

Major faults: the page in question is physically missing, and so the
fault invokes IO.  If we blindly pull in a lot of extra pages that are
never used, then we'll end up wasting a lot of memory.  However, page at
a time IO is pretty bad performance-wise too, so I guess we do clustered
fault-time IO?  If we can distinguish between random and linear fault
patterns, then we can use that as a basis for deciding how much
speculative mapping to do.  Certainly, we should create mappings for any
nearby page which does become physically present.

Minor faults are easier; if the page already exists in memory, we should
just create mappings to it.  If neighbouring pages are also already
present, then we can can cheaply create mappings for them too.

This seems like an obvious idea, so I'm wondering if someone has
prototyped it already to see what effects there are.  In the native
case, pte updates are much cheaper, so perhaps it doesn't help much
there, though it would potentially reduce the number of faults needed. 
But I think there's scope for measurable benefits in the virtual case.

Thanks,
    J

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-17 17:47 Populating multiple ptes at fault time Jeremy Fitzhardinge
@ 2008-09-17 18:28 ` Rik van Riel
  2008-09-17 21:47   ` Jeremy Fitzhardinge
  2008-09-17 20:02 ` Chris Snook
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 40+ messages in thread
From: Rik van Riel @ 2008-09-17 18:28 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nick Piggin, Hugh Dickens, Linux Memory Management List,
	Linux Kernel Mailing List, Avi Kivity, Andrew Morton

On Wed, 17 Sep 2008 10:47:30 -0700
Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Minor faults are easier; if the page already exists in memory, we should
> just create mappings to it.  If neighbouring pages are also already
> present, then we can can cheaply create mappings for them too.

This is especially true for mmaped files, where we do not have to
allocate anything to create the mapping.

Populating multiple PTEs at a time is questionable for anonymous
memory, where we'd have to allocate extra pages.

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-17 17:47 Populating multiple ptes at fault time Jeremy Fitzhardinge
  2008-09-17 18:28 ` Rik van Riel
@ 2008-09-17 20:02 ` Chris Snook
  2008-09-17 21:45   ` Jeremy Fitzhardinge
  2008-09-17 22:02 ` Avi Kivity
  2008-09-17 23:50 ` MinChan Kim
  3 siblings, 1 reply; 40+ messages in thread
From: Chris Snook @ 2008-09-17 20:02 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nick Piggin, Hugh Dickens, Linux Memory Management List,
	Linux Kernel Mailing List, Avi Kivity, Andrew Morton,
	Rik van Riel

Jeremy Fitzhardinge wrote:
> Avi and I were discussing whether we should populate multiple ptes at
> pagefault time, rather than one at at time as we do now.
> 
> When Linux is operating as a virtual guest, pte population will
> generally involve some kind of trap to the hypervisor, either to
> validate the pte contents (in Xen's case) or to update the shadow
> pagetable (kvm).  This is relatively expensive, and it would be good to
> amortise the cost by populating multiple ptes at once.

Is it still expensive when you're using nested page tables?

> Xen and kvm already batch pte updates where multiple ptes are explicitly
> updated at once (mprotect and unmap, mostly), but in practise that's
> relatively rare.  Most pages are demand faulted into a process one at a
> time.
> 
> It seems to me there are two cases: major faults, and minor faults:
> 
> Major faults: the page in question is physically missing, and so the
> fault invokes IO.  If we blindly pull in a lot of extra pages that are
> never used, then we'll end up wasting a lot of memory.  However, page at
> a time IO is pretty bad performance-wise too, so I guess we do clustered
> fault-time IO?  If we can distinguish between random and linear fault
> patterns, then we can use that as a basis for deciding how much
> speculative mapping to do.  Certainly, we should create mappings for any
> nearby page which does become physically present.

We already have rather well-tested code in the VM to detect fault patterns, 
complete with userspace hints to set readahead policy.  It seems to me that if 
we're going to read nearby pages into pagecache, we might as well actually map 
them at the same time.  Duplicating the readahead code is probably a bad idea.

> Minor faults are easier; if the page already exists in memory, we should
> just create mappings to it.  If neighbouring pages are also already
> present, then we can can cheaply create mappings for them too.

If we're mapping pagecache, then sure, this is really cheap, but speculatively 
allocating anonymous pages will hurt, badly, on many workloads.

> This seems like an obvious idea, so I'm wondering if someone has
> prototyped it already to see what effects there are.  In the native
> case, pte updates are much cheaper, so perhaps it doesn't help much
> there, though it would potentially reduce the number of faults needed. 
> But I think there's scope for measurable benefits in the virtual case.

Sounds like something we might want to enable conditionally on the use of pv_ops 
features.

-- Chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-17 20:02 ` Chris Snook
@ 2008-09-17 21:45   ` Jeremy Fitzhardinge
  2008-09-18 18:16     ` Christoph Lameter
  0 siblings, 1 reply; 40+ messages in thread
From: Jeremy Fitzhardinge @ 2008-09-17 21:45 UTC (permalink / raw)
  To: Chris Snook
  Cc: Nick Piggin, Hugh Dickens, Linux Memory Management List,
	Linux Kernel Mailing List, Avi Kivity, Andrew Morton,
	Rik van Riel

Chris Snook wrote:
> Is it still expensive when you're using nested page tables?

No, nested pagetables are the same as native to update, so the main
benefit in that case is the reduction of faults.

> We already have rather well-tested code in the VM to detect fault
> patterns, complete with userspace hints to set readahead policy.  It
> seems to me that if we're going to read nearby pages into pagecache,
> we might as well actually map them at the same time.  Duplicating the
> readahead code is probably a bad idea.

Right, that was my point.  I'm assuming that that machinery already
exists and would be available for use in this case.

>> Minor faults are easier; if the page already exists in memory, we should
>> just create mappings to it.  If neighbouring pages are also already
>> present, then we can can cheaply create mappings for them too.
>
> If we're mapping pagecache, then sure, this is really cheap, but
> speculatively allocating anonymous pages will hurt, badly, on many
> workloads.

OK, makes sense.  Does the access pattern detecting code measure access
patterns to anonymous mappings?

>> This seems like an obvious idea, so I'm wondering if someone has
>> prototyped it already to see what effects there are.  In the native
>> case, pte updates are much cheaper, so perhaps it doesn't help much
>> there, though it would potentially reduce the number of faults
>> needed. But I think there's scope for measurable benefits in the
>> virtual case.
>
> Sounds like something we might want to enable conditionally on the use
> of pv_ops features.

Perhaps, but I'd rather avoid it.  I'm hoping this is something we could
do that has - at worst - no effect on the native case, while improving
the virtual case.  The test matrix is already large enough without
adding another stateful switch.  After all, any side effect which makes
it a bad idea for the native case will probably be bad enough to
overwhelm any benefit in the virtual case.

    J

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-17 18:28 ` Rik van Riel
@ 2008-09-17 21:47   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 40+ messages in thread
From: Jeremy Fitzhardinge @ 2008-09-17 21:47 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Nick Piggin, Hugh Dickens, Linux Memory Management List,
	Linux Kernel Mailing List, Avi Kivity, Andrew Morton

Rik van Riel wrote:
> On Wed, 17 Sep 2008 10:47:30 -0700
> Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>
>   
>> Minor faults are easier; if the page already exists in memory, we should
>> just create mappings to it.  If neighbouring pages are also already
>> present, then we can can cheaply create mappings for them too.
>>     
>
> This is especially true for mmaped files, where we do not have to
> allocate anything to create the mapping.
>   

Yes, that was the case I particularly had in mind.

> Populating multiple PTEs at a time is questionable for anonymous
> memory, where we'd have to allocate extra pages.
>   

It might be worthwhile if the memory access pattern to anonymous memory
is linear.  I agree that speculatively allocating pages on a random
access region would be a bad idea.

    J

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-17 17:47 Populating multiple ptes at fault time Jeremy Fitzhardinge
  2008-09-17 18:28 ` Rik van Riel
  2008-09-17 20:02 ` Chris Snook
@ 2008-09-17 22:02 ` Avi Kivity
  2008-09-17 22:30   ` Jeremy Fitzhardinge
  2008-09-19 17:45   ` Benjamin Herrenschmidt
  2008-09-17 23:50 ` MinChan Kim
  3 siblings, 2 replies; 40+ messages in thread
From: Avi Kivity @ 2008-09-17 22:02 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nick Piggin, Hugh Dickens, Linux Memory Management List,
	Linux Kernel Mailing List, Avi Kivity, Andrew Morton,
	Rik van Riel

Jeremy Fitzhardinge wrote:
> Minor faults are easier; if the page already exists in memory, we should
> just create mappings to it.  If neighbouring pages are also already
> present, then we can can cheaply create mappings for them too.
>
>   

One problem is the accessed bit.  If it's unset, the shadow code cannot 
make the pte present (since it has to trap in order to set the accessed 
bit); if it's set, we're lying to the vm.

This doesn't affect Xen, only kvm.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-17 22:02 ` Avi Kivity
@ 2008-09-17 22:30   ` Jeremy Fitzhardinge
  2008-09-17 22:47     ` Avi Kivity
  2008-09-19 17:45   ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 40+ messages in thread
From: Jeremy Fitzhardinge @ 2008-09-17 22:30 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nick Piggin, Hugh Dickens, Linux Memory Management List,
	Linux Kernel Mailing List, Avi Kivity, Andrew Morton,
	Rik van Riel

Avi Kivity wrote:
> Jeremy Fitzhardinge wrote:
>> Minor faults are easier; if the page already exists in memory, we should
>> just create mappings to it.  If neighbouring pages are also already
>> present, then we can can cheaply create mappings for them too.
>>

(Just to clarify an ambiguity here: by "present" I mean "exists in
memory" not "a present pte".)

> One problem is the accessed bit.  If it's unset, the shadow code
> cannot make the pte present (since it has to trap in order to set the
> accessed bit); if it's set, we're lying to the vm.

So even if the guest pte were present but non-accessed, the shadow pte
would have to be non-present and you'd end up taking the fault anyway?

Hm, that does undermine the benefits.  Does that mean that when the vm
clears the access bit, you always have to make the shadow non-present? 
I guess so.  And similarly with dirty and writable shadow.

The counter-argument is that something has gone wrong if we start
populating ptes that aren't going to be used in the near future anyway -
if they're never used then any effort taken to populate them is wasted. 
Therefore, setting accessed on them from the outset isn't terribly bad.

(I'm not very convinced by that argument either, and it makes the
potential for bad side-effects much worse if the apparent RSS of a
process is multiplied by some factor.)

    J

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-17 22:30   ` Jeremy Fitzhardinge
@ 2008-09-17 22:47     ` Avi Kivity
  2008-09-17 23:02       ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 40+ messages in thread
From: Avi Kivity @ 2008-09-17 22:47 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nick Piggin, Hugh Dickens, Linux Memory Management List,
	Linux Kernel Mailing List, Avi Kivity, Andrew Morton,
	Rik van Riel

Jeremy Fitzhardinge wrote:
>> One problem is the accessed bit.  If it's unset, the shadow code
>> cannot make the pte present (since it has to trap in order to set the
>> accessed bit); if it's set, we're lying to the vm.
>>     
>
> So even if the guest pte were present but non-accessed, the shadow pte
> would have to be non-present and you'd end up taking the fault anyway?
>
>   

Yes.

> Hm, that does undermine the benefits.  Does that mean that when the vm
> clears the access bit, you always have to make the shadow non-present? 
> I guess so.  And similarly with dirty and writable shadow.
>
>   

Yes.

> The counter-argument is that something has gone wrong if we start
> populating ptes that aren't going to be used in the near future anyway -
> if they're never used then any effort taken to populate them is wasted. 
> Therefore, setting accessed on them from the outset isn't terribly bad.
>
>   

We don't know whether the page will be used or not.  Keeping the 
accessed bit clear allows the vm to reclaim it early, and in preference 
to the pages it actually used.

We could work around it by having a hypercall to read and clear accessed 
bits.  If we know the guest will only do that via the hypercall, we can 
keep the accessed (and dirty) bits in the host, and not update them in 
the guest at all.  Given good batching, there's potential for a large 
win there.

(If the host throws away a shadow page, it could sync the bits back into 
the guest pte for safekeeping)

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-17 22:47     ` Avi Kivity
@ 2008-09-17 23:02       ` Jeremy Fitzhardinge
  2008-09-18 20:26         ` Avi Kivity
  0 siblings, 1 reply; 40+ messages in thread
From: Jeremy Fitzhardinge @ 2008-09-17 23:02 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nick Piggin, Hugh Dickens, Linux Memory Management List,
	Linux Kernel Mailing List, Avi Kivity, Andrew Morton,
	Rik van Riel

Avi Kivity wrote:
> We could work around it by having a hypercall to read and clear
> accessed bits.  If we know the guest will only do that via the
> hypercall, we can keep the accessed (and dirty) bits in the host, and
> not update them in the guest at all.  Given good batching, there's
> potential for a large win there.

We added a hypercall to update just the AD bits, though it was primarily
to update D without losing the hardware-set A bit.

I don't think it would be practical to add a hypercall to read the A
bit.  There's too much code which just assumes it can grab a pte and
test the bit state.  There's no pv_op for reading a pte in general, and
even if there were you'd need to have a specialized pv-op for
specifically reading the A bit to avoid unnecessary hypercalls.

Setting/clearing the A bit could be done via the normal set_pte pv_op,
so that's not a big deal.

Do you need to set the A bit synchronously?  What happens if you install
the guest and shadow pte with A clear, and then lazily transfer the A
bit state from the shadow to guest pte?  Maybe at some significant event
like  a tlb flush or:

> (If the host throws away a shadow page, it could sync the bits back
> into the guest pte for safekeeping)

    J

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-17 17:47 Populating multiple ptes at fault time Jeremy Fitzhardinge
                   ` (2 preceding siblings ...)
  2008-09-17 22:02 ` Avi Kivity
@ 2008-09-17 23:50 ` MinChan Kim
  2008-09-18  6:58   ` KOSAKI Motohiro
  2008-09-18  7:26   ` KAMEZAWA Hiroyuki
  3 siblings, 2 replies; 40+ messages in thread
From: MinChan Kim @ 2008-09-17 23:50 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, Rik van Riel, Andrew Morton
  Cc: Nick Piggin, Hugh Dickens, Linux Memory Management List,
	Linux Kernel Mailing List, Avi Kivity

Hi, all

I have been thinking about this idea in native.
I didn't consider it in minor page fault.
As you know, it costs more cheap than major fault.
However, the page fault is one of big bottleneck on demand-paging system.
I think major fault might be a rather big overhead in many core system.

What do you think about this idea in native ?
Do you really think that this idea don't help much in native ?

If I implement it in native, What kinds of benchmark do I need?
Could you recommend any benchmark ?


On Thu, Sep 18, 2008 at 2:47 AM, Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> Avi and I were discussing whether we should populate multiple ptes at
> pagefault time, rather than one at at time as we do now.
>
> When Linux is operating as a virtual guest, pte population will
> generally involve some kind of trap to the hypervisor, either to
> validate the pte contents (in Xen's case) or to update the shadow
> pagetable (kvm).  This is relatively expensive, and it would be good to
> amortise the cost by populating multiple ptes at once.
>
> Xen and kvm already batch pte updates where multiple ptes are explicitly
> updated at once (mprotect and unmap, mostly), but in practise that's
> relatively rare.  Most pages are demand faulted into a process one at a
> time.
>
> It seems to me there are two cases: major faults, and minor faults:
>
> Major faults: the page in question is physically missing, and so the
> fault invokes IO.  If we blindly pull in a lot of extra pages that are
> never used, then we'll end up wasting a lot of memory.  However, page at
> a time IO is pretty bad performance-wise too, so I guess we do clustered
> fault-time IO?  If we can distinguish between random and linear fault
> patterns, then we can use that as a basis for deciding how much
> speculative mapping to do.  Certainly, we should create mappings for any
> nearby page which does become physically present.
>
> Minor faults are easier; if the page already exists in memory, we should
> just create mappings to it.  If neighbouring pages are also already
> present, then we can can cheaply create mappings for them too.
>
>
> This seems like an obvious idea, so I'm wondering if someone has
> prototyped it already to see what effects there are.  In the native
> case, pte updates are much cheaper, so perhaps it doesn't help much
> there, though it would potentially reduce the number of faults needed.
> But I think there's scope for measurable benefits in the virtual case.
>
> Thanks,
>    J
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>



-- 
Kinds regards,
MinChan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-17 23:50 ` MinChan Kim
@ 2008-09-18  6:58   ` KOSAKI Motohiro
  2008-09-18  7:26   ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 40+ messages in thread
From: KOSAKI Motohiro @ 2008-09-18  6:58 UTC (permalink / raw)
  To: MinChan Kim
  Cc: kosaki.motohiro, Jeremy Fitzhardinge, Rik van Riel,
	Andrew Morton, Nick Piggin, Hugh Dickens,
	Linux Memory Management List, Linux Kernel Mailing List,
	Avi Kivity

> Hi, all
> 
> I have been thinking about this idea in native.
> I didn't consider it in minor page fault.
> As you know, it costs more cheap than major fault.
> However, the page fault is one of big bottleneck on demand-paging system.
> I think major fault might be a rather big overhead in many core system.
> 
> What do you think about this idea in native ?
> Do you really think that this idea don't help much in native ?
> 
> If I implement it in native, What kinds of benchmark do I need?
> Could you recommend any benchmark ?

I guess it is also useful for native.
Then, if you post patch & benchmark result, I'll review it with presusure.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-17 23:50 ` MinChan Kim
  2008-09-18  6:58   ` KOSAKI Motohiro
@ 2008-09-18  7:26   ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 40+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-18  7:26 UTC (permalink / raw)
  To: MinChan Kim
  Cc: Jeremy Fitzhardinge, Rik van Riel, Andrew Morton, Nick Piggin,
	Hugh Dickens, Linux Memory Management List,
	Linux Kernel Mailing List, Avi Kivity

On Thu, 18 Sep 2008 08:50:05 +0900
"MinChan Kim" <minchan.kim@gmail.com> wrote:

> Hi, all
> 
> I have been thinking about this idea in native.
> I didn't consider it in minor page fault.
> As you know, it costs more cheap than major fault.
> However, the page fault is one of big bottleneck on demand-paging system.
> I think major fault might be a rather big overhead in many core system.
> 
> What do you think about this idea in native ?
> Do you really think that this idea don't help much in native ?
> 
Hmm, is enlarging page-size-for-anonymous-page more difficult ?
(maybe, yes.)

> If I implement it in native, What kinds of benchmark do I need?
> Could you recommend any benchmark ?
> 

Testing some kind of scripts (shell/perl etc..) is candidates.

I use unixbench's exec/shell test to see charge/uncharge overhead of memory
resource controller, which happens at major page fault.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-17 21:45   ` Jeremy Fitzhardinge
@ 2008-09-18 18:16     ` Christoph Lameter
  2008-09-18 18:53       ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 40+ messages in thread
From: Christoph Lameter @ 2008-09-18 18:16 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Chris Snook, Nick Piggin, Hugh Dickens,
	Linux Memory Management List, Linux Kernel Mailing List,
	Avi Kivity, Andrew Morton, Rik van Riel

I had a patch like that a couple of years back but it was not accepted.

http://www.kernel.org/pub/linux/kernel/people/christoph/prefault/

http://readlist.com/lists/vger.kernel.org/linux-kernel/14/70942.html

http://www.ussg.iu.edu/hypermail/linux/kernel/0503.1/1292.html


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-18 18:16     ` Christoph Lameter
@ 2008-09-18 18:53       ` Jeremy Fitzhardinge
  2008-09-18 19:39         ` Christoph Lameter
  2008-09-18 20:52         ` Martin Bligh
  0 siblings, 2 replies; 40+ messages in thread
From: Jeremy Fitzhardinge @ 2008-09-18 18:53 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Chris Snook, Nick Piggin, Hugh Dickens,
	Linux Memory Management List, Linux Kernel Mailing List,
	Avi Kivity, Andrew Morton, Rik van Riel, Martin J. Bligh

Christoph Lameter wrote:
> I had a patch like that a couple of years back but it was not accepted.
>
> http://www.kernel.org/pub/linux/kernel/people/christoph/prefault/
>
> http://readlist.com/lists/vger.kernel.org/linux-kernel/14/70942.html
>
> http://www.ussg.iu.edu/hypermail/linux/kernel/0503.1/1292.html
>
>   

Thanks, that was exactly what I was hoping to see.  I didn't see any
definitive statements against the patch set, other than a concern that
it could make things worse.  Was the upshot that no consensus was
reached about how to detect when its beneficial to preallocate anonymous
pages?

Martin, in that thread you mentioned that you had tried pre-populating
file-backed mappings as well, but "Mmmm ... we tried doing this before
for filebacked pages by sniffing the
pagecache, but it crippled forky workloads (like kernel compile) with the
extra cost in zap_pte_range, etc. ".

Could you describe, or have a pointer to, what you tried and how it
turned out?  Did you end up populating so many (unused) ptes that
zap_pte_range needed to do lots more work?

Christoph (and others): do you think vm changes in the last 4 years
would have changed the outcome of these results?

Thanks,
    J

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-18 18:53       ` Jeremy Fitzhardinge
@ 2008-09-18 19:39         ` Christoph Lameter
  2008-09-18 22:21           ` KOSAKI Motohiro
  2008-09-18 20:52         ` Martin Bligh
  1 sibling, 1 reply; 40+ messages in thread
From: Christoph Lameter @ 2008-09-18 19:39 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Chris Snook, Nick Piggin, Hugh Dickens,
	Linux Memory Management List, Linux Kernel Mailing List,
	Avi Kivity, Andrew Morton, Rik van Riel, Martin J. Bligh

Jeremy Fitzhardinge wrote:
> Thanks, that was exactly what I was hoping to see.  I didn't see any
> definitive statements against the patch set, other than a concern that
> it could make things worse.  Was the upshot that no consensus was
> reached about how to detect when its beneficial to preallocate anonymous
> pages?

There were multiple discussions on the subject. The consensus was that it was
difficult to generalize this and it would only work on special loads. Plus it
would add some overhead to the general case.

> Christoph (and others): do you think vm changes in the last 4 years
> would have changed the outcome of these results?

Seems that the code today is similar. So it would still work.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-17 23:02       ` Jeremy Fitzhardinge
@ 2008-09-18 20:26         ` Avi Kivity
  2008-09-18 22:18           ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 40+ messages in thread
From: Avi Kivity @ 2008-09-18 20:26 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nick Piggin, Hugh Dickens, Linux Memory Management List,
	Linux Kernel Mailing List, Avi Kivity, Andrew Morton,
	Rik van Riel, Marcelo Tosatti

(potential victim cc'ed)

Jeremy Fitzhardinge wrote:
> Avi Kivity wrote:
>   
>> We could work around it by having a hypercall to read and clear
>> accessed bits.  If we know the guest will only do that via the
>> hypercall, we can keep the accessed (and dirty) bits in the host, and
>> not update them in the guest at all.  Given good batching, there's
>> potential for a large win there.
>>     
>
> We added a hypercall to update just the AD bits, though it was primarily
> to update D without losing the hardware-set A bit.
>
> I don't think it would be practical to add a hypercall to read the A
> bit.  There's too much code which just assumes it can grab a pte and
> test the bit state.  There's no pv_op for reading a pte in general, and
> even if there were you'd need to have a specialized pv-op for
> specifically reading the A bit to avoid unnecessary hypercalls.
>
>   

I didn't think so much code would be interested in the accessed bit.  I 
can think of

 - pte teardown (to mark the page accessed)
 - scanning the active list
 - fork (which copies ptes)

> Setting/clearing the A bit could be done via the normal set_pte pv_op,
> so that's not a big deal.
>
> Do you need to set the A bit synchronously?  

Yes, of course (if no guest cooperation).

> What happens if you install
> the guest and shadow pte with A clear, and then lazily transfer the A
> bit state from the shadow to guest pte?  Maybe at some significant event
> like  a tlb flush or:
>
>   
>> (If the host throws away a shadow page, it could sync the bits back
>> into the guest pte for safekeeping)
>>     

I'll fail my own unit tests.

If we add an async mode for guests that can cope, maybe this is 
workable.  I guess this is what you're suggesting.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-18 18:53       ` Jeremy Fitzhardinge
  2008-09-18 19:39         ` Christoph Lameter
@ 2008-09-18 20:52         ` Martin Bligh
  2008-09-18 20:53           ` Chris Snook
  1 sibling, 1 reply; 40+ messages in thread
From: Martin Bligh @ 2008-09-18 20:52 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Christoph Lameter, Chris Snook, Nick Piggin, Hugh Dickens,
	Linux Memory Management List, Linux Kernel Mailing List,
	Avi Kivity, Andrew Morton, Rik van Riel

>
> Thanks, that was exactly what I was hoping to see.  I didn't see any
> definitive statements against the patch set, other than a concern that
> it could make things worse.  Was the upshot that no consensus was
> reached about how to detect when its beneficial to preallocate anonymous
> pages?
>
> Martin, in that thread you mentioned that you had tried pre-populating
> file-backed mappings as well, but "Mmmm ... we tried doing this before
> for filebacked pages by sniffing the
> pagecache, but it crippled forky workloads (like kernel compile) with the
> extra cost in zap_pte_range, etc. ".
>
> Could you describe, or have a pointer to, what you tried and how it
> turned out?

Don't have the patches still, but it was fairly simple - just faulted in
the next 3 pages whenever we took a fault, if the pages were already
in pagecache. I would have thought that was pretty lightweight and
non-invasive, but turns out it slowed things down.

> Did you end up populating so many (unused) ptes that
> zap_pte_range needed to do lots more work?

Yup, basically you're assuming good locality of reference, but it turns
out that (as davej would say) "userspace sucks".

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-18 20:52         ` Martin Bligh
@ 2008-09-18 20:53           ` Chris Snook
  2008-09-18 21:11             ` Martin Bligh
  0 siblings, 1 reply; 40+ messages in thread
From: Chris Snook @ 2008-09-18 20:53 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Jeremy Fitzhardinge, Christoph Lameter, Nick Piggin,
	Hugh Dickens, Linux Memory Management List,
	Linux Kernel Mailing List, Avi Kivity, Andrew Morton,
	Rik van Riel

Martin Bligh wrote:
>> Thanks, that was exactly what I was hoping to see.  I didn't see any
>> definitive statements against the patch set, other than a concern that
>> it could make things worse.  Was the upshot that no consensus was
>> reached about how to detect when its beneficial to preallocate anonymous
>> pages?
>>
>> Martin, in that thread you mentioned that you had tried pre-populating
>> file-backed mappings as well, but "Mmmm ... we tried doing this before
>> for filebacked pages by sniffing the
>> pagecache, but it crippled forky workloads (like kernel compile) with the
>> extra cost in zap_pte_range, etc. ".
>>
>> Could you describe, or have a pointer to, what you tried and how it
>> turned out?
> 
> Don't have the patches still, but it was fairly simple - just faulted in
> the next 3 pages whenever we took a fault, if the pages were already
> in pagecache. I would have thought that was pretty lightweight and
> non-invasive, but turns out it slowed things down.
> 
>> Did you end up populating so many (unused) ptes that
>> zap_pte_range needed to do lots more work?
> 
> Yup, basically you're assuming good locality of reference, but it turns
> out that (as davej would say) "userspace sucks".

Well, *most* userspace sucks.  It might still be worthwhile to do this when 
userspace is using madvise().

-- Chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-18 20:53           ` Chris Snook
@ 2008-09-18 21:11             ` Martin Bligh
  2008-09-18 21:13               ` Christoph Lameter
  0 siblings, 1 reply; 40+ messages in thread
From: Martin Bligh @ 2008-09-18 21:11 UTC (permalink / raw)
  To: Chris Snook
  Cc: Jeremy Fitzhardinge, Christoph Lameter, Nick Piggin,
	Hugh Dickens, Linux Memory Management List,
	Linux Kernel Mailing List, Avi Kivity, Andrew Morton,
	Rik van Riel

>> Yup, basically you're assuming good locality of reference, but it turns
>> out that (as davej would say) "userspace sucks".
>
> Well, *most* userspace sucks.  It might still be worthwhile to do this when
> userspace is using madvise().

Quite possibly true ... something to benchmark.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-18 21:11             ` Martin Bligh
@ 2008-09-18 21:13               ` Christoph Lameter
  2008-09-18 21:21                 ` Martin Bligh
  0 siblings, 1 reply; 40+ messages in thread
From: Christoph Lameter @ 2008-09-18 21:13 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Chris Snook, Jeremy Fitzhardinge, Nick Piggin, Hugh Dickens,
	Linux Memory Management List, Linux Kernel Mailing List,
	Avi Kivity, Andrew Morton, Rik van Riel

Martin Bligh wrote:
>>> Yup, basically you're assuming good locality of reference, but it turns
>>> out that (as davej would say) "userspace sucks".
>> Well, *most* userspace sucks.  It might still be worthwhile to do this when
>> userspace is using madvise().
> 
> Quite possibly true ... something to benchmark.

Well, I guess we need a new binary format that allows one to execute binaries
in kernel address space with full powers.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-18 21:13               ` Christoph Lameter
@ 2008-09-18 21:21                 ` Martin Bligh
  2008-09-18 21:32                   ` Christoph Lameter
  0 siblings, 1 reply; 40+ messages in thread
From: Martin Bligh @ 2008-09-18 21:21 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Chris Snook, Jeremy Fitzhardinge, Nick Piggin, Hugh Dickens,
	Linux Memory Management List, Linux Kernel Mailing List,
	Avi Kivity, Andrew Morton, Rik van Riel

>>>> Yup, basically you're assuming good locality of reference, but it turns
>>>> out that (as davej would say) "userspace sucks".
>>> Well, *most* userspace sucks.  It might still be worthwhile to do this when
>>> userspace is using madvise().
>>
>> Quite possibly true ... something to benchmark.
>
> Well, I guess we need a new binary format that allows one to execute binaries
> in kernel address space with full powers.

Seems ... extreme ;-)
Maybe we just do it if we're in readahead? (or similar)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-18 21:21                 ` Martin Bligh
@ 2008-09-18 21:32                   ` Christoph Lameter
  2008-09-18 21:49                     ` MinChan Kim
  0 siblings, 1 reply; 40+ messages in thread
From: Christoph Lameter @ 2008-09-18 21:32 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Chris Snook, Jeremy Fitzhardinge, Nick Piggin, Hugh Dickens,
	Linux Memory Management List, Linux Kernel Mailing List,
	Avi Kivity, Andrew Morton, Rik van Riel

Martin Bligh wrote:

>> Well, I guess we need a new binary format that allows one to execute binaries
>> in kernel address space with full powers.
> 
> Seems ... extreme ;-)

Well yes ....

> Maybe we just do it if we're in readahead? (or similar)

If we are in kernel space then the binary can call the readahead function as
needed ... ;-O

Ok, seriously: Anonymous pages are not subject to readahead so it wont work.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-18 21:32                   ` Christoph Lameter
@ 2008-09-18 21:49                     ` MinChan Kim
  2008-09-18 21:58                       ` Christoph Lameter
  0 siblings, 1 reply; 40+ messages in thread
From: MinChan Kim @ 2008-09-18 21:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Martin Bligh, Chris Snook, Jeremy Fitzhardinge, Nick Piggin,
	Hugh Dickens, Linux Memory Management List,
	Linux Kernel Mailing List, Avi Kivity, Andrew Morton,
	Rik van Riel

On Fri, Sep 19, 2008 at 6:32 AM, Christoph Lameter
<cl@linux-foundation.org> wrote:
> Martin Bligh wrote:
>
>>> Well, I guess we need a new binary format that allows one to execute binaries
>>> in kernel address space with full powers.
>>
>> Seems ... extreme ;-)
>
> Well yes ....
>
>> Maybe we just do it if we're in readahead? (or similar)
>
> If we are in kernel space then the binary can call the readahead function as
> needed ... ;-O

In case of file-mapped pages, Shouldn't we use just on-demand
readahead mechanism in kernel ?
If it is inefficient, It means we have to change on-demand readahead
mechanism itself.
How about you ?

> Ok, seriously: Anonymous pages are not subject to readahead so it wont work.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>



-- 
Kinds regards,
MinChan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-18 21:49                     ` MinChan Kim
@ 2008-09-18 21:58                       ` Christoph Lameter
  2008-09-18 22:08                         ` Martin Bligh
  0 siblings, 1 reply; 40+ messages in thread
From: Christoph Lameter @ 2008-09-18 21:58 UTC (permalink / raw)
  To: MinChan Kim
  Cc: Martin Bligh, Chris Snook, Jeremy Fitzhardinge, Nick Piggin,
	Hugh Dickens, Linux Memory Management List,
	Linux Kernel Mailing List, Avi Kivity, Andrew Morton,
	Rik van Riel

MinChan Kim wrote:

> In case of file-mapped pages, Shouldn't we use just on-demand
> readahead mechanism in kernel ?

Correct.

> If it is inefficient, It means we have to change on-demand readahead
> mechanism itself.

Right.

My patches were only for anonymous pages not for file backed because readahead
is available for file backed mappings.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-18 21:58                       ` Christoph Lameter
@ 2008-09-18 22:08                         ` Martin Bligh
  2008-09-18 22:11                           ` Christoph Lameter
  0 siblings, 1 reply; 40+ messages in thread
From: Martin Bligh @ 2008-09-18 22:08 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: MinChan Kim, Chris Snook, Jeremy Fitzhardinge, Nick Piggin,
	Hugh Dickens, Linux Memory Management List,
	Linux Kernel Mailing List, Avi Kivity, Andrew Morton,
	Rik van Riel

> My patches were only for anonymous pages not for file backed because readahead
> is available for file backed mappings.

Do we populate the PTEs though? I didn't think that was batched, but I
might well be wrong.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-18 22:08                         ` Martin Bligh
@ 2008-09-18 22:11                           ` Christoph Lameter
  2008-09-18 22:18                             ` Martin Bligh
  2008-09-18 22:23                             ` Chris Snook
  0 siblings, 2 replies; 40+ messages in thread
From: Christoph Lameter @ 2008-09-18 22:11 UTC (permalink / raw)
  To: Martin Bligh
  Cc: MinChan Kim, Chris Snook, Jeremy Fitzhardinge, Nick Piggin,
	Hugh Dickens, Linux Memory Management List,
	Linux Kernel Mailing List, Avi Kivity, Andrew Morton,
	Rik van Riel

Martin Bligh wrote:
>> My patches were only for anonymous pages not for file backed because readahead
>> is available for file backed mappings.
> 
> Do we populate the PTEs though? I didn't think that was batched, but I
> might well be wrong.

We do not populate the PTEs and AFAICT PTE population was assumed not to be
performance critical since the backing media is comparatively slow.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-18 22:11                           ` Christoph Lameter
@ 2008-09-18 22:18                             ` Martin Bligh
  2008-09-18 22:22                               ` Jeremy Fitzhardinge
  2008-09-18 22:23                             ` Chris Snook
  1 sibling, 1 reply; 40+ messages in thread
From: Martin Bligh @ 2008-09-18 22:18 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Martin Bligh, MinChan Kim, Chris Snook, Jeremy Fitzhardinge,
	Nick Piggin, Hugh Dickens, Linux Memory Management List,
	Linux Kernel Mailing List, Avi Kivity, Andrew Morton,
	Rik van Riel

>>> My patches were only for anonymous pages not for file backed because readahead
>>> is available for file backed mappings.
>>
>> Do we populate the PTEs though? I didn't think that was batched, but I
>> might well be wrong.
>
> We do not populate the PTEs and AFAICT PTE population was assumed not to be
> performance critical since the backing media is comparatively slow.

I think the times when this matters are things like glibc, which are
heavily shared -
we were only 'prefaulting' when the pagecache was already there. So it's a case
for a "readahead like algorithm", not necessarily a direct hook.

Anonymous pages seem much riskier, as presumably there's a no backing page
except in the fork case.

I presume the reason Jeremy is interested is because his pagefaults are more
expensive than most (under virtualization), so he may well find a
different tradeoff
than I did (try running kernbench?)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-18 20:26         ` Avi Kivity
@ 2008-09-18 22:18           ` Jeremy Fitzhardinge
  2008-09-18 23:38             ` Avi Kivity
  0 siblings, 1 reply; 40+ messages in thread
From: Jeremy Fitzhardinge @ 2008-09-18 22:18 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nick Piggin, Hugh Dickens, Linux Memory Management List,
	Linux Kernel Mailing List, Avi Kivity, Andrew Morton,
	Rik van Riel, Marcelo Tosatti

Avi Kivity wrote:
>> Do you need to set the A bit synchronously?  
>
> Yes, of course (if no guest cooperation).

Is the A bit architecturally guaranteed to be synchronously set?  Can
speculative accesses set it?  SDM vol 3 is a bit vague about it.

> I'll fail my own unit tests.
>
> If we add an async mode for guests that can cope, maybe this is
> workable.  I guess this is what you're suggesting.
>

Yes.  At worst Linux would underestimate the process RSS a bit
(depending on how many unsynchronized ptes you leave lying around).  I
bet there's an appropriate pvop hook you could use to force
synchronization just before the kernel actually inspects the bits
(leaving lazy mode sounds good).

    J

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-18 19:39         ` Christoph Lameter
@ 2008-09-18 22:21           ` KOSAKI Motohiro
  0 siblings, 0 replies; 40+ messages in thread
From: KOSAKI Motohiro @ 2008-09-18 22:21 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: kosaki.motohiro, Jeremy Fitzhardinge, Chris Snook, Nick Piggin,
	Hugh Dickens, Linux Memory Management List,
	Linux Kernel Mailing List, Avi Kivity, Andrew Morton,
	Rik van Riel, Martin J. Bligh

> Jeremy Fitzhardinge wrote:
> > Thanks, that was exactly what I was hoping to see.  I didn't see any
> > definitive statements against the patch set, other than a concern that
> > it could make things worse.  Was the upshot that no consensus was
> > reached about how to detect when its beneficial to preallocate anonymous
> > pages?
> 
> There were multiple discussions on the subject. The consensus was that it was
> difficult to generalize this and it would only work on special loads. Plus it
> would add some overhead to the general case.

but at that time, x86_64 large server doesn't exist yet.
I think mesurement again is valuable because typical server environment
is changed in these days.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-18 22:18                             ` Martin Bligh
@ 2008-09-18 22:22                               ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 40+ messages in thread
From: Jeremy Fitzhardinge @ 2008-09-18 22:22 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Christoph Lameter, Martin Bligh, MinChan Kim, Chris Snook,
	Nick Piggin, Hugh Dickens, Linux Memory Management List,
	Linux Kernel Mailing List, Avi Kivity, Andrew Morton,
	Rik van Riel

Martin Bligh wrote:
>>>> My patches were only for anonymous pages not for file backed because readahead
>>>> is available for file backed mappings.
>>>>         
>>> Do we populate the PTEs though? I didn't think that was batched, but I
>>> might well be wrong.
>>>       
>> We do not populate the PTEs and AFAICT PTE population was assumed not to be
>> performance critical since the backing media is comparatively slow.
>>     
>
> I think the times when this matters are things like glibc, which are
> heavily shared -
> we were only 'prefaulting' when the pagecache was already there. So it's a case
> for a "readahead like algorithm", not necessarily a direct hook.
>   

Yes.  My thought was that there should be very little cost to
opportunistically populating the pte for a page which is already
resident anyway.

> Anonymous pages seem much riskier, as presumably there's a no backing page
> except in the fork case.
>
> I presume the reason Jeremy is interested is because his pagefaults are more
> expensive than most (under virtualization), so he may well find a
> different tradeoff
> than I did (try running kernbench?)
>   

Right.  The faults themselves are more or less the same as the native
case, but setting a pte requires a hypercall compared to a memory write
in the native case.  But I can set any number of ptes in one hypercall,
so batching would amortize the cost.

    J

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-18 22:11                           ` Christoph Lameter
  2008-09-18 22:18                             ` Martin Bligh
@ 2008-09-18 22:23                             ` Chris Snook
  2008-09-18 23:16                               ` MinChan Kim
  1 sibling, 1 reply; 40+ messages in thread
From: Chris Snook @ 2008-09-18 22:23 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Martin Bligh, MinChan Kim, Jeremy Fitzhardinge, Nick Piggin,
	Hugh Dickens, Linux Memory Management List,
	Linux Kernel Mailing List, Avi Kivity, Andrew Morton,
	Rik van Riel

Christoph Lameter wrote:
> Martin Bligh wrote:
>>> My patches were only for anonymous pages not for file backed because readahead
>>> is available for file backed mappings.
>> Do we populate the PTEs though? I didn't think that was batched, but I
>> might well be wrong.
> 
> We do not populate the PTEs and AFAICT PTE population was assumed not to be
> performance critical since the backing media is comparatively slow.
> 

Perhaps we should.  In a virtual guest, the backing media is often an emulated 
IDE device, or something similarly inefficient, such that the bottleneck is the CPU.

-- Chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-18 22:23                             ` Chris Snook
@ 2008-09-18 23:16                               ` MinChan Kim
  0 siblings, 0 replies; 40+ messages in thread
From: MinChan Kim @ 2008-09-18 23:16 UTC (permalink / raw)
  To: Chris Snook
  Cc: Christoph Lameter, Martin Bligh, Jeremy Fitzhardinge,
	Nick Piggin, Hugh Dickens, Linux Memory Management List,
	Linux Kernel Mailing List, Avi Kivity, Andrew Morton,
	Rik van Riel

On Fri, Sep 19, 2008 at 7:23 AM, Chris Snook <csnook@redhat.com> wrote:
> Christoph Lameter wrote:
>>
>> Martin Bligh wrote:
>>>>
>>>> My patches were only for anonymous pages not for file backed because
>>>> readahead
>>>> is available for file backed mappings.
>>>
>>> Do we populate the PTEs though? I didn't think that was batched, but I
>>> might well be wrong.
>>
>> We do not populate the PTEs and AFAICT PTE population was assumed not to
>> be
>> performance critical since the backing media is comparatively slow.
>>
>
> Perhaps we should.  In a virtual guest, the backing media is often an
> emulated IDE device, or something similarly inefficient, such that the
> bottleneck is the CPU.

In embedded environment, many people use nand-like device as storage.
Read cost of nand-like device is less than IDE's one.
Also, Nowaday Embedded stuff would like to use multi-core step by step.
So, pte population become important more and more.


> -- Chris
>



-- 
Kinds regards,
MinChan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-18 22:18           ` Jeremy Fitzhardinge
@ 2008-09-18 23:38             ` Avi Kivity
  2008-09-19  0:00               ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 40+ messages in thread
From: Avi Kivity @ 2008-09-18 23:38 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nick Piggin, Hugh Dickens, Linux Memory Management List,
	Linux Kernel Mailing List, Avi Kivity, Andrew Morton,
	Rik van Riel, Marcelo Tosatti

Jeremy Fitzhardinge wrote:
> Avi Kivity wrote:
>   
>>> Do you need to set the A bit synchronously?  
>>>       
>> Yes, of course (if no guest cooperation).
>>     
>
> Is the A bit architecturally guaranteed to be synchronously set?  

I believe so.  The cpu won't cache tlb entries with the A bit clear 
(much like the shadow code), and will rmw the pte on first access.

> Can
> speculative accesses set it?  

Yes, but don't abuse this.

>> If we add an async mode for guests that can cope, maybe this is
>> workable.  I guess this is what you're suggesting.
>>
>>     
>
> Yes.  At worst Linux would underestimate the process RSS a bit
> (depending on how many unsynchronized ptes you leave lying around).  I
>   

Not the RSS (that's pte.present pages) but the working set (aka active 
list).

> bet there's an appropriate pvop hook you could use to force
> synchronization just before the kernel actually inspects the bits
> (leaving lazy mode sounds good).
>   

It would have to be a new lazy mode, not the existing one, I think.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-18 23:38             ` Avi Kivity
@ 2008-09-19  0:00               ` Jeremy Fitzhardinge
  2008-09-19  0:20                 ` Avi Kivity
  0 siblings, 1 reply; 40+ messages in thread
From: Jeremy Fitzhardinge @ 2008-09-19  0:00 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nick Piggin, Hugh Dickens, Linux Memory Management List,
	Linux Kernel Mailing List, Avi Kivity, Andrew Morton,
	Rik van Riel, Marcelo Tosatti

Avi Kivity wrote:
>> Yes.  At worst Linux would underestimate the process RSS a bit
>> (depending on how many unsynchronized ptes you leave lying around).  I
>>   
>
> Not the RSS (that's pte.present pages) but the working set (aka active
> list).

Yep.

>> bet there's an appropriate pvop hook you could use to force
>> synchronization just before the kernel actually inspects the bits
>> (leaving lazy mode sounds good).
>>   
>
> It would have to be a new lazy mode, not the existing one, I think.

The only direct use of pte_young() is in zap_pte_range, within a
mmu_lazy region.  So syncing the A bit state on entering lazy mmu mode
would work fine there.

The call via page_referenced_one() doesn't seem to have a very
convenient hook though.  Perhaps putting something in
page_check_address() would do the job.

    J

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-19  0:00               ` Jeremy Fitzhardinge
@ 2008-09-19  0:20                 ` Avi Kivity
  2008-09-19  0:42                   ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 40+ messages in thread
From: Avi Kivity @ 2008-09-19  0:20 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nick Piggin, Hugh Dickens, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Rik van Riel,
	Marcelo Tosatti

Jeremy Fitzhardinge wrote:
>
>>> bet there's an appropriate pvop hook you could use to force
>>> synchronization just before the kernel actually inspects the bits
>>> (leaving lazy mode sounds good).
>>>   
>>>       
>> It would have to be a new lazy mode, not the existing one, I think.
>>     
>
> The only direct use of pte_young() is in zap_pte_range, within a
> mmu_lazy region.  So syncing the A bit state on entering lazy mmu mode
> would work fine there.
>
>   

Ugh, leaving lazy pte.a mode when entering lazy mmu mode?


> The call via page_referenced_one() doesn't seem to have a very
> convenient hook though.  Perhaps putting something in
> page_check_address() would do the job.
>
>   

Why there?

Why not explicitly in the callers?  We need more than to exit lazy pte.a 
mode, we also need to enter it again later.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-19  0:20                 ` Avi Kivity
@ 2008-09-19  0:42                   ` Jeremy Fitzhardinge
  2008-09-24 12:31                     ` Avi Kivity
  0 siblings, 1 reply; 40+ messages in thread
From: Jeremy Fitzhardinge @ 2008-09-19  0:42 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nick Piggin, Hugh Dickens, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Rik van Riel,
	Marcelo Tosatti

Avi Kivity wrote:
>>
>> The only direct use of pte_young() is in zap_pte_range, within a
>> mmu_lazy region.  So syncing the A bit state on entering lazy mmu mode
>> would work fine there.
>>
>>   
>
> Ugh, leaving lazy pte.a mode when entering lazy mmu mode?

Well, sort of but not quite.  The kernel's announcing its about to start
processing a batch of ptes, so the hypervisor can take the opportunity
to update their state before processing.  "Lazy-mode" is from the
perspective of the kernel lazily updating some state the hypervisor
might care about, and the sync happens when leaving mode.

The flip-side is when the hypervisor is lazily updating some state the
kernel cares about, so it makes sense that the sync when the kernel
enters its lazy mode.  But the analogy isn't very good because we don't
really have an explicit notion of "hypervisor lazy mode", or a formal
handoff of shared state between the kernel and hypervisor.  But in this
case the behaviour isn't too bad.

>> The call via page_referenced_one() doesn't seem to have a very
>> convenient hook though.  Perhaps putting something in
>> page_check_address() would do the job.
>>
>>   
>
> Why there?
>
> Why not explicitly in the callers?  We need more than to exit lazy
> pte.a mode, we also need to enter it again later.
>

Because that's the code that actually walks the pagetable and has the
address of the pte; it just returns a pte_t, not a pte_t *.  It depends
on whether you want fetch the A bit via ptep or vaddr (in general we
pass mm, ptep and vaddr to ops which operate on the current pagetable).

    J

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-17 22:02 ` Avi Kivity
  2008-09-17 22:30   ` Jeremy Fitzhardinge
@ 2008-09-19 17:45   ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 40+ messages in thread
From: Benjamin Herrenschmidt @ 2008-09-19 17:45 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jeremy Fitzhardinge, Nick Piggin, Hugh Dickens,
	Linux Memory Management List, Linux Kernel Mailing List,
	Avi Kivity, Andrew Morton, Rik van Riel

On Wed, 2008-09-17 at 15:02 -0700, Avi Kivity wrote:
> Jeremy Fitzhardinge wrote:
> > Minor faults are easier; if the page already exists in memory, we should
> > just create mappings to it.  If neighbouring pages are also already
> > present, then we can can cheaply create mappings for them too.
> >
> >   
> 
> One problem is the accessed bit.  If it's unset, the shadow code cannot 
> make the pte present (since it has to trap in order to set the accessed 
> bit); if it's set, we're lying to the vm.
> 
> This doesn't affect Xen, only kvm.

Other archs too. On powerpc, !accessed -> not hashed (or not in the TLB
for SW loaded TLB platforms). 

Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-19  0:42                   ` Jeremy Fitzhardinge
@ 2008-09-24 12:31                     ` Avi Kivity
  2008-09-25 18:32                       ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 40+ messages in thread
From: Avi Kivity @ 2008-09-24 12:31 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nick Piggin, Hugh Dickens, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Rik van Riel,
	Marcelo Tosatti

Jeremy Fitzhardinge wrote:
> Avi Kivity wrote:
>   
>>> The only direct use of pte_young() is in zap_pte_range, within a
>>> mmu_lazy region.  So syncing the A bit state on entering lazy mmu mode
>>> would work fine there.
>>>
>>>   
>>>       
>> Ugh, leaving lazy pte.a mode when entering lazy mmu mode?
>>     
>
> Well, sort of but not quite.  The kernel's announcing its about to start
> processing a batch of ptes, so the hypervisor can take the opportunity
> to update their state before processing.  "Lazy-mode" is from the
> perspective of the kernel lazily updating some state the hypervisor
> might care about, and the sync happens when leaving mode.
>
> The flip-side is when the hypervisor is lazily updating some state the
> kernel cares about, so it makes sense that the sync when the kernel
> enters its lazy mode.  But the analogy isn't very good because we don't
> really have an explicit notion of "hypervisor lazy mode", or a formal
> handoff of shared state between the kernel and hypervisor.  But in this
> case the behaviour isn't too bad.
>
>   

Handwavy.  I think the two notions are separate <insert handwavy 
counter-arguments>.

>>> The call via page_referenced_one() doesn't seem to have a very
>>> convenient hook though.  Perhaps putting something in
>>> page_check_address() would do the job.
>>>
>>>   
>>>       
>> Why there?
>>
>> Why not explicitly in the callers?  We need more than to exit lazy
>> pte.a mode, we also need to enter it again later.
>>
>>     
>
> Because that's the code that actually walks the pagetable and has the
> address of the pte; it just returns a pte_t, not a pte_t *.  It depends
> on whether you want fetch the A bit via ptep or vaddr (in general we
> pass mm, ptep and vaddr to ops which operate on the current pagetable).
>   

pte_clear_flush_young_notify_etc() seems even closer.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-24 12:31                     ` Avi Kivity
@ 2008-09-25 18:32                       ` Jeremy Fitzhardinge
  2008-09-26 10:26                         ` Martin Schwidefsky
  0 siblings, 1 reply; 40+ messages in thread
From: Jeremy Fitzhardinge @ 2008-09-25 18:32 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nick Piggin, Hugh Dickens, Linux Memory Management List,
	Linux Kernel Mailing List, Andrew Morton, Rik van Riel,
	Marcelo Tosatti, Benjamin Herrenschmidt, Martin Schwidefsky

Avi Kivity wrote:
> Jeremy Fitzhardinge wrote:
>> Avi Kivity wrote:
>>  
>>>> The only direct use of pte_young() is in zap_pte_range, within a
>>>> mmu_lazy region.  So syncing the A bit state on entering lazy mmu mode
>>>> would work fine there.
>>>>
>>>>         
>>> Ugh, leaving lazy pte.a mode when entering lazy mmu mode?
>>>     
>>
>> Well, sort of but not quite.  The kernel's announcing its about to start
>> processing a batch of ptes, so the hypervisor can take the opportunity
>> to update their state before processing.  "Lazy-mode" is from the
>> perspective of the kernel lazily updating some state the hypervisor
>> might care about, and the sync happens when leaving mode.
>>
>> The flip-side is when the hypervisor is lazily updating some state the
>> kernel cares about, so it makes sense that the sync when the kernel
>> enters its lazy mode.  But the analogy isn't very good because we don't
>> really have an explicit notion of "hypervisor lazy mode", or a formal
>> handoff of shared state between the kernel and hypervisor.  But in this
>> case the behaviour isn't too bad.
>>
>>   
>
> Handwavy.  I think the two notions are separate <insert handwavy
> counter-arguments>.

Perhaps this helps:

Context switches between guest<->hypervisor are relatively expensive. 
The more work we can make each context switch perform the better,
because we can amortize the cost.  Rather than synchronously switching
between the two every time one wants to express a state change to the
other, we batch those changes up and only sync when its important. 
While there are batched outstanding changes in one, the other will have
a somewhat out of date view of the state.  At this level, the idea of
batching is completely symmetrical.

One of the ways we amortize the cost of guest->hypervisor transitions is
by batching multiple pagetable updates together.  This works at two
levels: within explicit arch_enter/leave_lazy_mmu lazy regions, and also
because it is analogous to the architectural requirement that you must
flush the tlb before updates "really" happen.

KVM - and other shadow pagetable implementations - have the additional
problem of transmitting A/D state updates from the shadow pagetable into
the guest pagetable.  Doing this synchronously has the costs we've been
discussing in this thread (namely, extra faults we would like to
avoid).  Doing this in a deferred or batched way is awkward because
there's no analogous architectural asynchrony in updating these pte
flags, and we don't have any existing mechanisms or hooks to support
this kind of deferred update.

However, given that we're talking about cleaning up the pagetable api
anyway, there's no reason we couldn't incorporate this kind of deferred
update in a more formal way.  It definitely makes sense when you have
shadow pagetables, and it probably makes sense on other architectures too.

Very few places actually care about the state of the A/D bits; would it
be expensive to make those places explicitly ask for synchronization
before testing the bits (or alternatively, have an explicit query
operation rather than just poking about in the ptes).  Martin, does this
help with s390's per-page (vs per-pte) A/D state?

    J

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Populating multiple ptes at fault time
  2008-09-25 18:32                       ` Jeremy Fitzhardinge
@ 2008-09-26 10:26                         ` Martin Schwidefsky
  0 siblings, 0 replies; 40+ messages in thread
From: Martin Schwidefsky @ 2008-09-26 10:26 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Avi Kivity, Nick Piggin, Hugh Dickens,
	Linux Memory Management List, Linux Kernel Mailing List,
	Andrew Morton, Rik van Riel, Marcelo Tosatti,
	Benjamin Herrenschmidt

On Thu, 2008-09-25 at 11:32 -0700, Jeremy Fitzhardinge wrote:
> Very few places actually care about the state of the A/D bits; would it
> be expensive to make those places explicitly ask for synchronization
> before testing the bits (or alternatively, have an explicit query
> operation rather than just poking about in the ptes).  Martin, does this
> help with s390's per-page (vs per-pte) A/D state?

With the kvm support the situation on s390 recently has grown a tad more
complicated. We now have dirty bits in the per-page storage key and in
the pgste (page table entry extension) for the kvm guests. For the A/D
bits in the storage key the new pte operations won't help, for the kvm
related bits they could make a difference.

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2008-09-26 10:43 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-09-17 17:47 Populating multiple ptes at fault time Jeremy Fitzhardinge
2008-09-17 18:28 ` Rik van Riel
2008-09-17 21:47   ` Jeremy Fitzhardinge
2008-09-17 20:02 ` Chris Snook
2008-09-17 21:45   ` Jeremy Fitzhardinge
2008-09-18 18:16     ` Christoph Lameter
2008-09-18 18:53       ` Jeremy Fitzhardinge
2008-09-18 19:39         ` Christoph Lameter
2008-09-18 22:21           ` KOSAKI Motohiro
2008-09-18 20:52         ` Martin Bligh
2008-09-18 20:53           ` Chris Snook
2008-09-18 21:11             ` Martin Bligh
2008-09-18 21:13               ` Christoph Lameter
2008-09-18 21:21                 ` Martin Bligh
2008-09-18 21:32                   ` Christoph Lameter
2008-09-18 21:49                     ` MinChan Kim
2008-09-18 21:58                       ` Christoph Lameter
2008-09-18 22:08                         ` Martin Bligh
2008-09-18 22:11                           ` Christoph Lameter
2008-09-18 22:18                             ` Martin Bligh
2008-09-18 22:22                               ` Jeremy Fitzhardinge
2008-09-18 22:23                             ` Chris Snook
2008-09-18 23:16                               ` MinChan Kim
2008-09-17 22:02 ` Avi Kivity
2008-09-17 22:30   ` Jeremy Fitzhardinge
2008-09-17 22:47     ` Avi Kivity
2008-09-17 23:02       ` Jeremy Fitzhardinge
2008-09-18 20:26         ` Avi Kivity
2008-09-18 22:18           ` Jeremy Fitzhardinge
2008-09-18 23:38             ` Avi Kivity
2008-09-19  0:00               ` Jeremy Fitzhardinge
2008-09-19  0:20                 ` Avi Kivity
2008-09-19  0:42                   ` Jeremy Fitzhardinge
2008-09-24 12:31                     ` Avi Kivity
2008-09-25 18:32                       ` Jeremy Fitzhardinge
2008-09-26 10:26                         ` Martin Schwidefsky
2008-09-19 17:45   ` Benjamin Herrenschmidt
2008-09-17 23:50 ` MinChan Kim
2008-09-18  6:58   ` KOSAKI Motohiro
2008-09-18  7:26   ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox