Re: How can we make page replacement smarter (was: swap-prefetch)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: How can we make page replacement smarter (was: swap-prefetch)
       [not found]     ` <fa.0CL7DLsw6U7akTkW79pdCM5NPRk@ifi.uio.no>
@ 2007-07-28 16:32       ` Robert Hancock
  0 siblings, 0 replies; 6+ messages in thread
From: Robert Hancock @ 2007-07-28 16:32 UTC (permalink / raw)
  To: Al Boldi; +Cc: Chris Snook, linux-kernel, linux-mm

Al Boldi wrote:
> Chris Snook wrote:
>> Al Boldi wrote:
>>> Because it is hard to quantify the expected swap-in speed for random
>>> pages, let's first tackle the swap-in of consecutive pages, which should
>>> be at least as fast as swap-out.  So again, why is swap-in so slow?
>> If I'm writing 20 pages to swap, I can find a suitable chunk of swap and
>> write them all in one place.  If I'm reading 20 pages from swap, they
>> could be anywhere.  Also, writes get buffered at one or more layers of
>> hardware.
> 
> Ok, this explains swap-in of random pages.  Makes sense, but it doesn't 
> explain the awful tmpfs performance degradation of consecutive read-in runs 
> from swap, which should have at least stayed constant
> 
>> At best, reads can be read-ahead and cached, which is why
>> sequential swap-in sucks less.  On-demand reads are as expensive as I/O
>> can get.
> 
> Which means that it should be at least as fast as swap-out, even faster 
> because write to disk is usually slower than read on modern disks.  But 
> linux currently shows a distinct 2x slowdown for sequential swap-in wrt 
> swap-out.  And to prove this point, just try suspend to disk where you can 
> see sequential swap-out being reported at about twice the speed of 
> sequential swap-in on resume.  Why is that?

Depends if swap-in is doing any read-ahead. If it's reading one page at 
a time in from the disk then the performance will definitely suck 
because of all the overhead from the tiny I/O's. With random swap-in you 
then pay the horrible seek penalty for all the reads as well.

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

[parent not found: <200707272243.02336.a1426z@gawab.com>]

* Re: swap-prefetch:  A smart way to make good use of idle resources (was: updatedb)
       [not found] <200707272243.02336.a1426z@gawab.com>
@ 2007-07-28  1:56 ` Chris Snook
  2007-07-28  4:17   ` How can we make page replacement smarter (was: swap-prefetch) Al Boldi
  0 siblings, 1 reply; 6+ messages in thread
From: Chris Snook @ 2007-07-28  1:56 UTC (permalink / raw)
  To: Al Boldi; +Cc: linux-kernel, linux-mm

Al Boldi wrote:
> People wrote:
>>>> I believe the users who say their apps really do get paged back in
>>>> though, so suspect that's not the case.
>>> Stopping the bush-circumference beating, I do not. -ck (and gentoo) have
>>> this massive Calimero thing going among their users where people are
>>> much less interested in technology than in how the nasty big kernel
>>> meanies are keeping them down (*).
>> I think the problem is elsewhere. Users don't say: "My apps get paged
>> back in." They say: "My system is more responsive". They really don't
>> care *why* the reaction to a mouse click that takes three seconds with
>> a mainline kernel is instantaneous with -ck. Nasty big kernel meanies,
>> OTOH, want to understand *why* a patch helps in order to decide whether
>> it is really a good idea to merge it. So you've got a bunch of patches
>> (aka -ck) which visibly improve the overall responsiveness of a desktop
>> system, but apparently no one can conclusively explain why or how they
>> achieve that, and therefore they cannot be merged into mainline.
>>
>> I don't have a solution to that dilemma either.
> 
> IMHO, what everybody agrees on, is that swap-prefetch has a positive effect 
> in some cases, and nobody can prove an adverse effect (excluding power 
> consumption).  The reason for this positive effect is also crystal clear:  
> It prefetches from swap on idle into free memory, ie: it doesn't force 
> anybody out, and they are the first to be dropped without further swap-out, 
> which sounds really smart.
> 
> Conclusion:  Either prove swap-prefetch is broken, or get this merged quick.

If you can't prove why it helps and doesn't hurt, then it's a hack, by 
definition.  Behind any performance hack is some fundamental truth that can be 
exploited to greater effect if we reason about it.  So let's reason about it. 
I'll start.

Resource size has been outpacing processing latency since the dawn of time. 
Disks get bigger much faster than seek times shrink.  Main memory and cache keep 
growing, while single-threaded processing speed has nearly ground to a halt.

In the old days, it made lots of sense to manage resource allocation in pages 
and blocks.  In the past few years, we started reserving blocks in ext3 
automatically because it saves more in seek time than it costs in disk space. 
Now we're taking preallocation and antifragmentation to the next level with 
extent-based allocation in ext4.

Well, we're still using bitmap-style allocation for pages, and the prefetch-less 
swap mechanism adheres to this design as well.  Maybe it's time to start 
thinking about memory in a somewhat more extent-like fashion.

With swap prefetch, we're only optimizing the case when the box isn't loaded and 
there's RAM free, but we're not optimizing the case when the box is heavily 
loaded and we need for RAM to be free.  This is a complete reversal of sane 
development priorities.  If swap batching is an optimization at all (and we have 
empirical evidence that it is) then it should also be an optimization to swap 
out chunks of pages when we need to free memory.

So, how do we go about this grouping?  I suggest that if we keep per-VMA 
reference/fault/dirty statistics, we can tell which logically distinct chunks of 
memory are being regularly used.  This would also us to apply different page 
replacement policies to chunks of memory that are being used in different fashions.

With such statistics, we could then page out VMAs in 2MB chunks when we're under 
memory pressure, also giving us the option of transparently paging them back in 
to hugepages when we have the memory free, once anonymous hugepage support is in 
place.

I'm inclined to view swap prefetch as a successful scientific experiment, and 
use that data to inform a more reasoned engineering effort.  If we can design 
something intelligent which happens to behave more or less like swap prefetch 
does under the circumstances where swap prefetch helps, and does something else 
smart under the circumstances where swap prefetch makes no discernable 
difference, it'll be a much bigger improvement.

Because we cannot prove why the existing patch helps, we cannot say what impact 
it will have when things like virtualization and solid state drives radically 
change the coefficients of the equation we have not solved.  Providing a sysctl 
to turn off a misbehaving feature is a poor substitute for doing it right the 
first time, and leaving it off by default will ensure that it only gets used by 
the handful of people who know enough to rebuild with the patch anyway.

Let's talk about how we can make page replacement smarter, so it naturally 
accomplishes what swap prefetch accomplishes, as part of a design we can reason 
about.

CC-ing linux-mm, since that's where I think we should take this next.

	-- Chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* How can we make page replacement smarter (was: swap-prefetch)
  2007-07-28  1:56 ` swap-prefetch: A smart way to make good use of idle resources (was: updatedb) Chris Snook
@ 2007-07-28  4:17   ` Al Boldi
  2007-07-28  7:27     ` Chris Snook
  0 siblings, 1 reply; 6+ messages in thread
From: Al Boldi @ 2007-07-28  4:17 UTC (permalink / raw)
  To: Chris Snook; +Cc: linux-kernel, linux-mm

Chris Snook wrote:
> Resource size has been outpacing processing latency since the dawn of
> time. Disks get bigger much faster than seek times shrink.  Main memory
> and cache keep growing, while single-threaded processing speed has nearly
> ground to a halt.
>
> In the old days, it made lots of sense to manage resource allocation in
> pages and blocks.  In the past few years, we started reserving blocks in
> ext3 automatically because it saves more in seek time than it costs in
> disk space. Now we're taking preallocation and antifragmentation to the
> next level with extent-based allocation in ext4.
>
> Well, we're still using bitmap-style allocation for pages, and the
> prefetch-less swap mechanism adheres to this design as well.  Maybe it's
> time to start thinking about memory in a somewhat more extent-like
> fashion.
>
> With swap prefetch, we're only optimizing the case when the box isn't
> loaded and there's RAM free, but we're not optimizing the case when the
> box is heavily loaded and we need for RAM to be free.  This is a complete
> reversal of sane development priorities.  If swap batching is an
> optimization at all (and we have empirical evidence that it is) then it
> should also be an optimization to swap out chunks of pages when we need to
> free memory.
>
> So, how do we go about this grouping?  I suggest that if we keep per-VMA
> reference/fault/dirty statistics, we can tell which logically distinct
> chunks of memory are being regularly used.  This would also us to apply
> different page replacement policies to chunks of memory that are being
> used in different fashions.
>
> With such statistics, we could then page out VMAs in 2MB chunks when we're
> under memory pressure, also giving us the option of transparently paging
> them back in to hugepages when we have the memory free, once anonymous
> hugepage support is in place.
>
> I'm inclined to view swap prefetch as a successful scientific experiment,
> and use that data to inform a more reasoned engineering effort.  If we can
> design something intelligent which happens to behave more or less like
> swap prefetch does under the circumstances where swap prefetch helps, and
> does something else smart under the circumstances where swap prefetch
> makes no discernable difference, it'll be a much bigger improvement.
>
> Because we cannot prove why the existing patch helps, we cannot say what
> impact it will have when things like virtualization and solid state drives
> radically change the coefficients of the equation we have not solved. 
> Providing a sysctl to turn off a misbehaving feature is a poor substitute
> for doing it right the first time, and leaving it off by default will
> ensure that it only gets used by the handful of people who know enough to
> rebuild with the patch anyway.
>
> Let's talk about how we can make page replacement smarter, so it naturally
> accomplishes what swap prefetch accomplishes, as part of a design we can
> reason about.
>
> CC-ing linux-mm, since that's where I think we should take this next.

Good idea, but unless we understand the problems involved, we are bound to 
repeat it.  So my first question would be:  Why is swap-in so slow?

As I have posted in other threads, swap-in of consecutive pages suffers a 2x 
slowdown wrt swap-out, whereas swap-in of random pages suffers over 6x 
slowdown.

Because it is hard to quantify the expected swap-in speed for random pages, 
let's first tackle the swap-in of consecutive pages, which should be at 
least as fast as swap-out.  So again, why is swap-in so slow?

Once we understand this problem, we may be able to suggest a smart 
improvement.


Thanks!

--
Al

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How can we make page replacement smarter (was: swap-prefetch)
  2007-07-28  4:17   ` How can we make page replacement smarter (was: swap-prefetch) Al Boldi
@ 2007-07-28  7:27     ` Chris Snook
  2007-07-28 11:11       ` Al Boldi
  0 siblings, 1 reply; 6+ messages in thread
From: Chris Snook @ 2007-07-28  7:27 UTC (permalink / raw)
  To: Al Boldi; +Cc: linux-kernel, linux-mm

Al Boldi wrote:
> Chris Snook wrote:
>> Resource size has been outpacing processing latency since the dawn of
>> time. Disks get bigger much faster than seek times shrink.  Main memory
>> and cache keep growing, while single-threaded processing speed has nearly
>> ground to a halt.
>>
>> In the old days, it made lots of sense to manage resource allocation in
>> pages and blocks.  In the past few years, we started reserving blocks in
>> ext3 automatically because it saves more in seek time than it costs in
>> disk space. Now we're taking preallocation and antifragmentation to the
>> next level with extent-based allocation in ext4.
>>
>> Well, we're still using bitmap-style allocation for pages, and the
>> prefetch-less swap mechanism adheres to this design as well.  Maybe it's
>> time to start thinking about memory in a somewhat more extent-like
>> fashion.
>>
>> With swap prefetch, we're only optimizing the case when the box isn't
>> loaded and there's RAM free, but we're not optimizing the case when the
>> box is heavily loaded and we need for RAM to be free.  This is a complete
>> reversal of sane development priorities.  If swap batching is an
>> optimization at all (and we have empirical evidence that it is) then it
>> should also be an optimization to swap out chunks of pages when we need to
>> free memory.
>>
>> So, how do we go about this grouping?  I suggest that if we keep per-VMA
>> reference/fault/dirty statistics, we can tell which logically distinct
>> chunks of memory are being regularly used.  This would also us to apply
>> different page replacement policies to chunks of memory that are being
>> used in different fashions.
>>
>> With such statistics, we could then page out VMAs in 2MB chunks when we're
>> under memory pressure, also giving us the option of transparently paging
>> them back in to hugepages when we have the memory free, once anonymous
>> hugepage support is in place.
>>
>> I'm inclined to view swap prefetch as a successful scientific experiment,
>> and use that data to inform a more reasoned engineering effort.  If we can
>> design something intelligent which happens to behave more or less like
>> swap prefetch does under the circumstances where swap prefetch helps, and
>> does something else smart under the circumstances where swap prefetch
>> makes no discernable difference, it'll be a much bigger improvement.
>>
>> Because we cannot prove why the existing patch helps, we cannot say what
>> impact it will have when things like virtualization and solid state drives
>> radically change the coefficients of the equation we have not solved. 
>> Providing a sysctl to turn off a misbehaving feature is a poor substitute
>> for doing it right the first time, and leaving it off by default will
>> ensure that it only gets used by the handful of people who know enough to
>> rebuild with the patch anyway.
>>
>> Let's talk about how we can make page replacement smarter, so it naturally
>> accomplishes what swap prefetch accomplishes, as part of a design we can
>> reason about.
>>
>> CC-ing linux-mm, since that's where I think we should take this next.
> 
> Good idea, but unless we understand the problems involved, we are bound to 
> repeat it.  So my first question would be:  Why is swap-in so slow?
> 
> As I have posted in other threads, swap-in of consecutive pages suffers a 2x 
> slowdown wrt swap-out, whereas swap-in of random pages suffers over 6x 
> slowdown.
> 
> Because it is hard to quantify the expected swap-in speed for random pages, 
> let's first tackle the swap-in of consecutive pages, which should be at 
> least as fast as swap-out.  So again, why is swap-in so slow?

If I'm writing 20 pages to swap, I can find a suitable chunk of swap and 
write them all in one place.  If I'm reading 20 pages from swap, they 
could be anywhere.  Also, writes get buffered at one or more layers of 
hardware.  At best, reads can be read-ahead and cached, which is why 
sequential swap-in sucks less.  On-demand reads are as expensive as I/O 
can get.

> Once we understand this problem, we may be able to suggest a smart 
> improvement.

There are lots of page replacement schemes that optimize for different 
access patterns, and they all suck at certain other access patterns.  We 
tweak our behavior slightly based on fadvise and madvise hints, but most 
of the memory we're managing is an opaque mass.  With more statistics, 
we could do a better job of managing chunks of unhinted memory with 
disparate access patterns.  Of course, this imposes overhead.  I 
suggested VMA granularity because a VMA represents a logically distinct 
piece of address space, though this may not be suitable for shared mappings.

	-- Chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How can we make page replacement smarter (was: swap-prefetch)
  2007-07-28  7:27     ` Chris Snook
@ 2007-07-28 11:11       ` Al Boldi
  2007-07-29  4:07         ` Rik van Riel
  0 siblings, 1 reply; 6+ messages in thread
From: Al Boldi @ 2007-07-28 11:11 UTC (permalink / raw)
  To: Chris Snook; +Cc: linux-kernel, linux-mm

Chris Snook wrote:
> Al Boldi wrote:
> > Because it is hard to quantify the expected swap-in speed for random
> > pages, let's first tackle the swap-in of consecutive pages, which should
> > be at least as fast as swap-out.  So again, why is swap-in so slow?
>
> If I'm writing 20 pages to swap, I can find a suitable chunk of swap and
> write them all in one place.  If I'm reading 20 pages from swap, they
> could be anywhere.  Also, writes get buffered at one or more layers of
> hardware.

Ok, this explains swap-in of random pages.  Makes sense, but it doesn't 
explain the awful tmpfs performance degradation of consecutive read-in runs 
from swap, which should have at least stayed constant

> At best, reads can be read-ahead and cached, which is why
> sequential swap-in sucks less.  On-demand reads are as expensive as I/O
> can get.

Which means that it should be at least as fast as swap-out, even faster 
because write to disk is usually slower than read on modern disks.  But 
linux currently shows a distinct 2x slowdown for sequential swap-in wrt 
swap-out.  And to prove this point, just try suspend to disk where you can 
see sequential swap-out being reported at about twice the speed of 
sequential swap-in on resume.  Why is that?

Thanks!

--
Al

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How can we make page replacement smarter (was: swap-prefetch)
  2007-07-28 11:11       ` Al Boldi
@ 2007-07-29  4:07         ` Rik van Riel
  2007-07-29  6:40           ` Erblichs
  0 siblings, 1 reply; 6+ messages in thread
From: Rik van Riel @ 2007-07-29  4:07 UTC (permalink / raw)
  To: Al Boldi; +Cc: Chris Snook, linux-kernel, linux-mm

Al Boldi wrote:
> Chris Snook wrote:

>> At best, reads can be read-ahead and cached, which is why
>> sequential swap-in sucks less.  On-demand reads are as expensive as I/O
>> can get.
> 
> Which means that it should be at least as fast as swap-out, even faster 
> because write to disk is usually slower than read on modern disks.  But 
> linux currently shows a distinct 2x slowdown for sequential swap-in wrt 
> swap-out. 

That's because writes are faster than reads in moderate
quantities.

The disk caches writes, allowing the OS to write a whole
bunch of data into the disk cache and the disk can optimize
the IO a bit internally.

The same optimization is not possible for reads.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How can we make page replacement smarter (was: swap-prefetch)
  2007-07-29  4:07         ` Rik van Riel
@ 2007-07-29  6:40           ` Erblichs
  0 siblings, 0 replies; 6+ messages in thread
From: Erblichs @ 2007-07-29  6:40 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Al Boldi, Chris Snook, linux-kernel, linux-mm

Inline..

	Mitchell Erblich

Rik van Riel wrote:
> 
> Al Boldi wrote:
> > Chris Snook wrote:
> 
> >> At best, reads can be read-ahead and cached, which is why
> >> sequential swap-in sucks less.  On-demand reads are as expensive as I/O
> >> can get.
> >
> > Which means that it should be at least as fast as swap-out, even faster
> > because write to disk is usually slower than read on modern disks.  But
> > linux currently shows a distinct 2x slowdown for sequential swap-in wrt
> > swap-out.
> 

> That's because writes are faster than reads in moderate
> quantities.

	Assuming that the write is not a partial write based
	on first doing a read..

	Yes, a COW FS minimes this condition (ex: ZFS)

	However, since writes are mostly asynch in nature
	most writers shouldn't care when the write is
	actually commited, just that the data is stable
	at some point in the future..

	Thus, who would care (as long as we are not waiting
	for the write to complete) if the write was slower.
	IMO, it would make sense to ALMOST always generate
	a certain amount of writable data before the write
	is completed to attempt for the write to be as
	"sequential" on the disk, so any later reads would
	have minimal seeks..
	
> 
> The disk caches writes, allowing the OS to write a whole
> bunch of data into the disk cache and the disk can optimize
> the IO a bit internally.
> 
> The same optimization is not possible for reads.
> 
> --
> Politics is the struggle between those who want to make their country
> the best in the world, and those who believe it already is.  Each group
> calls the other unpatriotic.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2007-07-29  6:40 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <fa.RQO1FPcnWSV7f0LbL9tuLuh/fYY@ifi.uio.no>
     [not found] ` <fa.FI89MRq1q0M+6SmmYNPsXQv2gC8@ifi.uio.no>
     [not found]   ` <fa./S2LBynIjozRhHfPsYxB9mQDpKE@ifi.uio.no>
     [not found]     ` <fa.0CL7DLsw6U7akTkW79pdCM5NPRk@ifi.uio.no>
2007-07-28 16:32       ` How can we make page replacement smarter (was: swap-prefetch) Robert Hancock
     [not found] <200707272243.02336.a1426z@gawab.com>
2007-07-28  1:56 ` swap-prefetch: A smart way to make good use of idle resources (was: updatedb) Chris Snook
2007-07-28  4:17   ` How can we make page replacement smarter (was: swap-prefetch) Al Boldi
2007-07-28  7:27     ` Chris Snook
2007-07-28 11:11       ` Al Boldi
2007-07-29  4:07         ` Rik van Riel
2007-07-29  6:40           ` Erblichs

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox