* page_add/remove_rmap costs
@ 2002-07-24 6:33 Andrew Morton
2002-07-24 6:48 ` William Lee Irwin III
` (3 more replies)
0 siblings, 4 replies; 20+ messages in thread
From: Andrew Morton @ 2002-07-24 6:33 UTC (permalink / raw)
To: linux-mm
Been taking a look at the page_add_rmap/page_remove_rmap cost in 2.5.27
on the quad pIII. The workload is ten instances of this script running
concurrently:
#!/bin/sh
doit()
{
( cat $1 | wc -l )
}
count=0
while [ $count != 500 ]
do
doit foo > /dev/null
count=$(expr $count + 1)
done
echo done
It's just a ton of forking and exitting.
With rmap oprofile says:
./doitlots.sh 10 41.67s user 95.04s system 398% cpu 34.338 total
c0133030 317 1.07963 __free_pages_ok
c0131428 375 1.27716 kmem_cache_free
c01342d0 432 1.47129 free_page_and_swap_cache
c01281d0 461 1.57006 clear_page_tables
c013118c 462 1.57346 kmem_cache_alloc
c012a08c 470 1.60071 handle_mm_fault
c0113e50 504 1.7165 pte_alloc_one
c0107b68 512 1.74375 page_fault
c012c718 583 1.98556 find_get_page
c01332cc 650 2.21375 rmqueue
c0129bc4 807 2.74845 do_anonymous_page
c013396c 851 2.8983 page_cache_release
c0129db0 1124 3.82808 do_no_page
c0128750 1164 3.96431 zap_pte_range
c01284f8 1374 4.67952 copy_page_range
c013a994 1590 5.41516 page_add_rmap
c013aa5c 3739 12.7341 page_remove_rmap
c01293bc 5106 17.3898 do_wp_page
And without rmap it says:
./doitlots.sh 10 43.01s user 76.19s system 394% cpu 30.222 total
c013074c 238 1.20592 lru_cache_add
c0144e90 251 1.27179 link_path_walk
c0112a64 252 1.27685 do_page_fault
c0132b0c 252 1.27685 free_page_and_swap_cache
c01388b4 261 1.32246 do_page_cache_readahead
c01e0700 296 1.4998 radix_tree_lookup
c01263e0 300 1.52006 clear_page_tables
c01319b0 302 1.5302 __free_pages_ok
c01079f8 395 2.00142 page_fault
c012a8a8 396 2.00649 find_get_page
c01127d4 401 2.03182 pte_alloc_one
c0131ca0 451 2.28516 rmqueue
c0127cc8 774 3.92177 do_anonymous_page
c013230c 933 4.7274 page_cache_release
c0126880 964 4.88448 zap_pte_range
c012662c 1013 5.13275 copy_page_range
c0127e70 1138 5.76611 do_no_page
c012750c 4485 22.725 do_wp_page
So that's a ton of CPU time lost playing with pte chains.
I instrumented it all up with the `debug.patch' from
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.27/
./doitlots.sh 5
(gdb) p rmap_stats
$3 = {page_add_rmap = 4106673, page_add_rmap_nope = 20009, page_add_rmap_1st = 479482,
page_add_rmap_2nd = 357745, page_add_rmap_3rd = 3618421, page_remove_rmap = 4119825,
page_remove_rmap_1st = 479263, page_remove_rmap_nope = 20001, add_put_dirty_page = 8742,
add_copy_page_range = 2774954, add_do_wp_page = 272151, add_do_swap_page = 0, add_do_anonymous_page = 93689,
add_do_no_page = 1029498, add_copy_one_pte = 0, remove_zap_pte_range = 3863194, remove_do_wp_page = 272880,
remove_copy_one_pte = 0, do_no_page = 1119244, do_swap_page = 0, do_wp_page = 423034,
nr_copy_page_ranges = 174521, nr_forks = 12477}
What we see here is:
- We did 12477 forks
- those forks called copy_page_range() 174,521 times in total
- Of the 4,106,673 calls to page_add_rmap, 2,774,954 came from
copy_page_range and 1,029,498 came from do_no_page.
- Of the 4,119,825 calls to page_remove_rmap(), 3,863,194 came
from zap_page_range().
So it's pretty much all happening in fork() and exit().
Instruction-level profiling of page_add_rmap shows:
c013ab24 4074 8.63026 0 0 page_add_rmap
c013ab24 11 0.270005 0 0
c013ab25 67 1.64458 0 0
c013ab28 12 0.294551 0 0
c013ab29 12 0.294551 0 0
c013ab2b 1 0.0245459 0 0
c013ab2d 8 0.196367 0 0
c013ab38 313 7.68287 0 0
c013ab3e 7 0.171821 0 0
c013ab40 1 0.0245459 0 0
c013ab43 6 0.147275 0 0
c013ab46 1 0.0245459 0 0
c013ab4e 5 0.12273 0 0
c013ab53 5 0.12273 0 0
c013ab58 7 0.171821 0 0
c013ab5d 1364 33.4806 0 0 (pte_chain_lock)
c013ab61 4 0.0981836 0 0
c013ab63 13 0.319097 0 0
c013ab66 14 0.343643 0 0
c013ab69 1 0.0245459 0 0
c013ab6b 17 0.41728 0 0
c013ab70 2 0.0490918 0 0
c013ab73 41 1.00638 0 0
c013ab78 8 0.196367 0 0
c013ab7a 1 0.0245459 0 0
c013ab7f 4 0.0981836 0 0
c013ab84 16 0.392734 0 0
c013ab87 1 0.0245459 0 0
c013ab9a 102 2.50368 0 0
c013aba0 33 0.810015 0 0
c013aba4 33 0.810015 0 0
c013aba6 3 0.0736377 0 0
c013abab 13 0.319097 0 0
c013abb0 8 0.196367 0 0
c013abb3 2 0.0490918 0 0
c013abb5 7 0.171821 0 0
c013abb8 2 0.0490918 0 0
c013abbe 247 6.06284 0 0
c013abc0 1 0.0245459 0 0
c013abc3 6 0.147275 0 0
c013abcd 55 1.35002 0 0
c013abd3 39 0.95729 0 0
c013abd8 1 0.0245459 0 0
c013abdd 1468 36.0334 0 0 (pte_chain_unlock)
c013abe7 46 1.12911 0 0
c013abea 9 0.220913 0 0
c013abf0 42 1.03093 0 0
c013abf4 4 0.0981836 0 0
c013abf5 3 0.0736377 0 0
c013abf8 8 0.196367 0 0
And page_remove_rmap():
c013abfc 6600 13.9813 0 0 page_remove_rmap
c013abfc 5 0.0757576 0 0
c013abfd 10 0.151515 0 0
c013ac00 1 0.0151515 0 0
c013ac01 23 0.348485 0 0
c013ac02 21 0.318182 0 0
c013ac06 2 0.030303 0 0
c013ac08 1 0.0151515 0 0
c013ac11 339 5.13636 0 0
c013ac17 9 0.136364 0 0
c013ac20 1 0.0151515 0 0
c013ac26 5 0.0757576 0 0
c013ac2b 5 0.0757576 0 0
c013ac36 1 0.0151515 0 0
c013ac40 20 0.30303 0 0
c013ac45 3 0.0454545 0 0
c013ac4a 2399 36.3485 0 0 (The pte_chain_lock)
c013ac50 18 0.272727 0 0
c013ac53 13 0.19697 0 0
c013ac58 15 0.227273 0 0
c013ac60 3 0.0454545 0 0
c013ac63 50 0.757576 0 0
c013ac68 28 0.424242 0 0
c013ac6d 6 0.0909091 0 0
c013ac80 32 0.484848 0 0
c013ac86 42 0.636364 0 0
c013ac94 3 0.0454545 0 0
c013ac97 11 0.166667 0 0
c013ac99 3 0.0454545 0 0
c013ac9b 1 0.0151515 0 0
c013aca0 10 0.151515 0 0 (The `for (pc = page->pte.chain)' loop)
c013aca3 2633 39.8939 0 0
c013aca5 5 0.0757576 0 0
c013aca6 23 0.348485 0 0
c013aca7 2 0.030303 0 0
c013aca8 2 0.030303 0 0
c013acad 15 0.227273 0 0
c013acb0 29 0.439394 0 0
c013acb6 218 3.30303 0 0
c013acbb 2 0.030303 0 0
c013acbe 3 0.0454545 0 0
c013acc3 20 0.30303 0 0
c013accd 1 0.0151515 0 0
c013acd0 6 0.0909091 0 0
c013acd2 2 0.030303 0 0
c013acd4 2 0.030303 0 0
c013acd6 12 0.181818 0 0
c013ace0 7 0.106061 0 0
c013ace5 1 0.0151515 0 0
c013acea 6 0.0909091 0 0
c013acf3 34 0.515152 0 0
c013acf8 1 0.0151515 0 0
c013acfd 411 6.22727 0 0 (Probably the pte_chain_unlock)
c013ad03 4 0.0606061 0 0
c013ad04 57 0.863636 0 0
c013ad05 10 0.151515 0 0
c013ad06 6 0.0909091 0 0
c013ad09 8 0.121212 0 0
The page_add_rmap() one is interesting - the pte_chain_unlock() is as expensive
as the pte_chain_lock(). Which would tend to indicate either that the page->flags
has expired from cache or some other CPU has stolen it.
It is interesting to note that the length of the pte_chain is not a big
factor in all of this. So changing the singly-linked list to something
else probably won't help much.
Instrumentation of pte_chain_lock() shows:
nr_chain_locks = 8152300
nr_chain_lock_contends = 22436
nr_chain_lock_spins = 1946858
So the lock is only contended 0.3% of the time. And when it _is_
contended, the waiting CPU spins an average of 87 loops.
Which leaves one to conclude that the page->flags has just been
naturally evicted out of cache. So the next obvious step is to move a
lot of code out of the locked regions.
debug.patch moves the kmem_cache_alloc() and kmem_cache_free() calls
outside the locked region. But it doesn't help.
So I don't know why the pte_chain_unlock() is so expensive in there.
But even if it could be fixed, we're still too slow.
My gut feel here is that this will be hard to tweak - some algorithmic
change will be needed.
The pte_chains are doing precisely zilch but chew CPU cycles with this
workload. The machine has 2G of memory free. The rmap is pure overhead.
Would it be possible to not build the pte_chain _at all_ until it is
actually needed? Do it lazily? So in the page reclaim code, if the
page has no rmap chain we go off and build it then? This would require
something like a pfn->pte lookup function at the vma level, and a
page->vmas_which_own_me lookup.
Nice thing about this is that a) we already have page->flags
exclusively owned at that time, so the pte_chain_lock() _should_ be
cheap. And b) if the rmap chain is built in this way, all the
pte_chain structures against a page will have good
locality-of-reference, so the chain walk will involve far fewer cache
misses.
Then again, if the per-vma pfn->pte lookup is feasible, we may not need
the pte_chain at all...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: page_add/remove_rmap costs
2002-07-24 6:33 page_add/remove_rmap costs Andrew Morton
@ 2002-07-24 6:48 ` William Lee Irwin III
2002-07-24 16:24 ` Rik van Riel
` (2 subsequent siblings)
3 siblings, 0 replies; 20+ messages in thread
From: William Lee Irwin III @ 2002-07-24 6:48 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-mm
On Tue, Jul 23, 2002 at 11:33:20PM -0700, Andrew Morton wrote:
> So I don't know why the pte_chain_unlock() is so expensive in there.
> But even if it could be fixed, we're still too slow.
> My gut feel here is that this will be hard to tweak - some algorithmic
> change will be needed.
Atomic operation on a cold/unowned/falsely shared cache line. The
operation needs to be avoided when possible.
On Tue, Jul 23, 2002 at 11:33:20PM -0700, Andrew Morton wrote:
> The pte_chains are doing precisely zilch but chew CPU cycles with this
> workload. The machine has 2G of memory free. The rmap is pure overhead.
> Would it be possible to not build the pte_chain _at all_ until it is
> actually needed? Do it lazily? So in the page reclaim code, if the
> page has no rmap chain we go off and build it then? This would require
> something like a pfn->pte lookup function at the vma level, and a
> page->vmas_which_own_me lookup.
The space overhead of keeping them up to date can be mitigated, but this
time overhead can't be circumvented so long as strict per-pte updates
are required. I'm uncertain about lazy construction of them; I suspect
it will raise OOM issues (allocating in order to evict) and often be
constructed only never to be used again, but am not sure.
On Tue, Jul 23, 2002 at 11:33:20PM -0700, Andrew Morton wrote:
> Nice thing about this is that a) we already have page->flags
> exclusively owned at that time, so the pte_chain_lock() _should_ be
> cheap. And b) if the rmap chain is built in this way, all the
> pte_chain structures against a page will have good
> locality-of-reference, so the chain walk will involve far fewer cache
> misses.
This is a less invasive proposal than various others that have been
going around, and could probably be tried and tested quickly.
Cheers,
Bill
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: page_add/remove_rmap costs
2002-07-24 6:33 page_add/remove_rmap costs Andrew Morton
2002-07-24 6:48 ` William Lee Irwin III
@ 2002-07-24 16:24 ` Rik van Riel
2002-07-24 20:15 ` Andrew Morton
2002-07-25 2:45 ` William Lee Irwin III
2002-07-25 4:50 ` William Lee Irwin III
2002-07-26 7:33 ` Daniel Phillips
3 siblings, 2 replies; 20+ messages in thread
From: Rik van Riel @ 2002-07-24 16:24 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-mm
On Tue, 23 Jul 2002, Andrew Morton wrote:
> It's just a ton of forking and exitting.
And exec()ing ...
> What we see here is:
>
> - We did 12477 forks
> - those forks called copy_page_range() 174,521 times in total
> - Of the 4,106,673 calls to page_add_rmap, 2,774,954 came from
> copy_page_range and 1,029,498 came from do_no_page.
> - Of the 4,119,825 calls to page_remove_rmap(), 3,863,194 came
> from zap_page_range().
>
> So it's pretty much all happening in fork() and exit().
And exec() ... In fact, I suspect that about half of the calls
to page_remove_rmap() are coming via exec().
> The page_add_rmap() one is interesting - the pte_chain_unlock() is as
> expensive as the pte_chain_lock(). Which would tend to indicate either
> that the page->flags has expired from cache or some other CPU has stolen
> it.
>
> It is interesting to note that the length of the pte_chain is not a big
> factor in all of this. So changing the singly-linked list to something
> else probably won't help much.
This is more disturbing ... ;)
> My gut feel here is that this will be hard to tweak - some algorithmic
> change will be needed.
>
> The pte_chains are doing precisely zilch but chew CPU cycles with this
> workload. The machine has 2G of memory free. The rmap is pure overhead.
>
> Would it be possible to not build the pte_chain _at all_ until it is
> actually needed? Do it lazily? So in the page reclaim code, if the
> page has no rmap chain we go off and build it then? This would require
> something like a pfn->pte lookup function at the vma level, and a
> page->vmas_which_own_me lookup.
> Then again, if the per-vma pfn->pte lookup is feasible, we may not need
> the pte_chain at all...
It is feasible, both davem and bcrl made code to this effect. The
only problem with that code is that it gets ugly quick after mremap.
regards,
Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: page_add/remove_rmap costs
2002-07-24 16:24 ` Rik van Riel
@ 2002-07-24 20:15 ` Andrew Morton
2002-07-24 20:21 ` Rik van Riel
2002-07-25 3:08 ` William Lee Irwin III
2002-07-25 2:45 ` William Lee Irwin III
1 sibling, 2 replies; 20+ messages in thread
From: Andrew Morton @ 2002-07-24 20:15 UTC (permalink / raw)
To: Rik van Riel; +Cc: linux-mm
Rik van Riel wrote:
>
> ...
> > It is interesting to note that the length of the pte_chain is not a big
> > factor in all of this. So changing the singly-linked list to something
> > else probably won't help much.
>
> This is more disturbing ... ;)
Well yes. It may well indicate that my test is mostly LIFO
on the pte chains. So FIFO workloads would be worse.
> > My gut feel here is that this will be hard to tweak - some algorithmic
> > change will be needed.
> >
> > The pte_chains are doing precisely zilch but chew CPU cycles with this
> > workload. The machine has 2G of memory free. The rmap is pure overhead.
> >
> > Would it be possible to not build the pte_chain _at all_ until it is
> > actually needed? Do it lazily? So in the page reclaim code, if the
> > page has no rmap chain we go off and build it then? This would require
> > something like a pfn->pte lookup function at the vma level, and a
> > page->vmas_which_own_me lookup.
>
> > Then again, if the per-vma pfn->pte lookup is feasible, we may not need
> > the pte_chain at all...
>
> It is feasible, both davem and bcrl made code to this effect. The
> only problem with that code is that it gets ugly quick after mremap.
So.. who's going to do it?
It's early days yet - although this looks bad on benchmarks we really
need a better understanding of _why_ it's so bad, and of whether it
really matters for real workloads.
For example: given that copy_page_range performs atomic ops against
page->count, how come page_add_rmap()'s atomic op against page->flags
is more of a problem?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: page_add/remove_rmap costs
2002-07-24 20:15 ` Andrew Morton
@ 2002-07-24 20:21 ` Rik van Riel
2002-07-24 20:28 ` Andrew Morton
2002-07-25 3:08 ` William Lee Irwin III
1 sibling, 1 reply; 20+ messages in thread
From: Rik van Riel @ 2002-07-24 20:21 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-mm
On Wed, 24 Jul 2002, Andrew Morton wrote:
> > > Then again, if the per-vma pfn->pte lookup is feasible, we may not need
> > > the pte_chain at all...
> >
> > It is feasible, both davem and bcrl made code to this effect. The
> > only problem with that code is that it gets ugly quick after mremap.
>
> So.. who's going to do it?
>
> It's early days yet - although this looks bad on benchmarks we really
> need a better understanding of _why_ it's so bad, and of whether it
> really matters for real workloads.
I guess I'll take a stab at bcrl's and davem's code and will
try to also hide it between an rmap.c interface ;)
> For example: given that copy_page_range performs atomic ops against
> page->count, how come page_add_rmap()'s atomic op against page->flags
> is more of a problem?
Could it have something to do with cpu_relax() delaying
things ?
regards,
Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: page_add/remove_rmap costs
2002-07-24 20:21 ` Rik van Riel
@ 2002-07-24 20:28 ` Andrew Morton
2002-07-25 2:35 ` Rik van Riel
0 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2002-07-24 20:28 UTC (permalink / raw)
To: Rik van Riel; +Cc: linux-mm
Rik van Riel wrote:
>
> On Wed, 24 Jul 2002, Andrew Morton wrote:
>
> > > > Then again, if the per-vma pfn->pte lookup is feasible, we may not need
> > > > the pte_chain at all...
> > >
> > > It is feasible, both davem and bcrl made code to this effect. The
> > > only problem with that code is that it gets ugly quick after mremap.
> >
> > So.. who's going to do it?
> >
> > It's early days yet - although this looks bad on benchmarks we really
> > need a better understanding of _why_ it's so bad, and of whether it
> > really matters for real workloads.
>
> I guess I'll take a stab at bcrl's and davem's code and will
> try to also hide it between an rmap.c interface ;)
hmm, OK. Big job...
> > For example: given that copy_page_range performs atomic ops against
> > page->count, how come page_add_rmap()'s atomic op against page->flags
> > is more of a problem?
>
> Could it have something to do with cpu_relax() delaying
> things ?
Don't think so. That's only executed on the contended case, which
is 0.3% of the time. But hey, it's easy enough to remove it and retest.
I shall do that.
-
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: page_add/remove_rmap costs
2002-07-24 20:28 ` Andrew Morton
@ 2002-07-25 2:35 ` Rik van Riel
0 siblings, 0 replies; 20+ messages in thread
From: Rik van Riel @ 2002-07-25 2:35 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-mm
On Wed, 24 Jul 2002, Andrew Morton wrote:
> Rik van Riel wrote:
> > On Wed, 24 Jul 2002, Andrew Morton wrote:
> >
> > I guess I'll take a stab at bcrl's and davem's code and will
> > try to also hide it between an rmap.c interface ;)
>
> hmm, OK. Big job...
Absolutely, not a short term thing. In the short term
I'll split out the remainder of Craig Kulesa's big patch
and will send you bits and pieces.
> > > For example: given that copy_page_range performs atomic ops against
> > > page->count, how come page_add_rmap()'s atomic op against page->flags
> > > is more of a problem?
> >
> > Could it have something to do with cpu_relax() delaying
> > things ?
>
> Don't think so. That's only executed on the contended case,
You're right.
regards,
Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: page_add/remove_rmap costs
2002-07-24 16:24 ` Rik van Riel
2002-07-24 20:15 ` Andrew Morton
@ 2002-07-25 2:45 ` William Lee Irwin III
1 sibling, 0 replies; 20+ messages in thread
From: William Lee Irwin III @ 2002-07-25 2:45 UTC (permalink / raw)
To: Rik van Riel; +Cc: Andrew Morton, linux-mm
On Tue, 23 Jul 2002, Andrew Morton wrote:
>> Then again, if the per-vma pfn->pte lookup is feasible, we may not need
>> the pte_chain at all...
On Wed, Jul 24, 2002 at 01:24:13PM -0300, Rik van Riel wrote:
> It is feasible, both davem and bcrl made code to this effect. The
> only problem with that code is that it gets ugly quick after mremap.
I actually took an axe to mremap recently, and althought the pieces
never came back together into working code, it's clear that it's far
from optimal. It's doing a virtual sweep over the region and repeating
the pgd -> pmd -> pte traversals for each pte. So invading that territory
may well be justifiable on more grounds than rmap itself.
I may revisit mremap.c at some point in the distant future if that
tweaking alone is considered valuable.
Cheers,
Bill
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: page_add/remove_rmap costs
2002-07-24 20:15 ` Andrew Morton
2002-07-24 20:21 ` Rik van Riel
@ 2002-07-25 3:08 ` William Lee Irwin III
2002-07-25 3:14 ` Martin J. Bligh
2002-07-25 4:21 ` Andrew Morton
1 sibling, 2 replies; 20+ messages in thread
From: William Lee Irwin III @ 2002-07-25 3:08 UTC (permalink / raw)
To: Andrew Morton; +Cc: Rik van Riel, linux-mm
On Wed, Jul 24, 2002 at 01:15:10PM -0700, Andrew Morton wrote:
> So.. who's going to do it?
> It's early days yet - although this looks bad on benchmarks we really
> need a better understanding of _why_ it's so bad, and of whether it
> really matters for real workloads.
> For example: given that copy_page_range performs atomic ops against
> page->count, how come page_add_rmap()'s atomic op against page->flags
> is more of a problem?
Hmm. It probably isn't harming more than benchmarks, but the loop is
pure bloat on UP. #ifdef that out someday. (Heck, don't even touch the
bit for UP except for debugging.)
Hypothesis:
There are too many cachelines to gain exclusive ownership of. It's not
the aggregate arrival rate, it's the aggregate cacheline-claiming
bandwidth needed to get exclusive ownership of all the pages' ->flags.
Experiment 1:
Group pages into blocks of say 2 or 4 for locality, and then hash each
pageblock to a lock. The worst case wrt. claiming cachelines is then
the size of the hash table divided by the size of the lock, but the
potential for cacheline contention exists.
Experiment 2:
Move ->flags to be adjacent to ->count and align struct page to a
divisor of the cacheline size or play tricks to get it down to 32B. =)
Experiment 3:
Compare magic oprofile perfcounter stuff between 2.5.26 and 2.5.27
and do divination based on whatever the cache counters say.
Cheers,
Bill
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: page_add/remove_rmap costs
2002-07-25 3:08 ` William Lee Irwin III
@ 2002-07-25 3:14 ` Martin J. Bligh
2002-07-25 4:21 ` Andrew Morton
1 sibling, 0 replies; 20+ messages in thread
From: Martin J. Bligh @ 2002-07-25 3:14 UTC (permalink / raw)
To: William Lee Irwin III, Andrew Morton; +Cc: Rik van Riel, linux-mm
> On Wed, Jul 24, 2002 at 01:15:10PM -0700, Andrew Morton wrote:
>> So.. who's going to do it?
>> It's early days yet - although this looks bad on benchmarks we really
>> need a better understanding of _why_ it's so bad, and of whether it
>> really matters for real workloads.
>> For example: given that copy_page_range performs atomic ops against
>> page->count, how come page_add_rmap()'s atomic op against page->flags
>> is more of a problem?
If it's bouncing the lock cacheline around that's suspected to be
the problem, might it be faster to take a per-zone lock, rather
than a per-page lock, and batch the work up? Maybe we used a little
too much explosive when we broke up the global lock?
M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: page_add/remove_rmap costs
2002-07-25 3:08 ` William Lee Irwin III
2002-07-25 3:14 ` Martin J. Bligh
@ 2002-07-25 4:21 ` Andrew Morton
1 sibling, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2002-07-25 4:21 UTC (permalink / raw)
To: William Lee Irwin III; +Cc: Rik van Riel, linux-mm
William Lee Irwin III wrote:
>
> On Wed, Jul 24, 2002 at 01:15:10PM -0700, Andrew Morton wrote:
> > So.. who's going to do it?
> > It's early days yet - although this looks bad on benchmarks we really
> > need a better understanding of _why_ it's so bad, and of whether it
> > really matters for real workloads.
> > For example: given that copy_page_range performs atomic ops against
> > page->count, how come page_add_rmap()'s atomic op against page->flags
> > is more of a problem?
>
> Hmm. It probably isn't harming more than benchmarks, but the loop is
> pure bloat on UP. #ifdef that out someday. (Heck, don't even touch the
> bit for UP except for debugging.)
>
> Hypothesis:
> There are too many cachelines to gain exclusive ownership of. It's not
> the aggregate arrival rate, it's the aggregate cacheline-claiming
> bandwidth needed to get exclusive ownership of all the pages' ->flags.
Yup. But one would expect the access to lighten a subsequent
access to the page frame, so the aggregate cost would
be small. It's odd.
It'd be nice to see some hard numbers from a P4, or a PPC64
or something. I'm still wondering why the cost of the pte_chain_unlock()
is so high in page_remove_rmap(). That line should have still been
exclusively owned, but the PIII is going off-chip for some reason.
Is this general, or a peculiarity?
> Experiment 1:
> Group pages into blocks of say 2 or 4 for locality, and then hash each
> pageblock to a lock. The worst case wrt. claiming cachelines is then
> the size of the hash table divided by the size of the lock, but the
> potential for cacheline contention exists.
We could afford to do that. It'd take a bit of reorganising to hold a lock
across multiple page_add_rmap() calls though.
> Experiment 2:
> Move ->flags to be adjacent to ->count and align struct page to a
> divisor of the cacheline size or play tricks to get it down to 32B. =)
Oh crap. I thought I'd done that ages ago.
Whee. Moving page->flags to the zeroth offset shrunk linux
by 110 bytes!
> Experiment 3:
> Compare magic oprofile perfcounter stuff between 2.5.26 and 2.5.27
> and do divination based on whatever the cache counters say.
Using divine intervention is cheating.
-
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: page_add/remove_rmap costs
2002-07-24 6:33 page_add/remove_rmap costs Andrew Morton
2002-07-24 6:48 ` William Lee Irwin III
2002-07-24 16:24 ` Rik van Riel
@ 2002-07-25 4:50 ` William Lee Irwin III
2002-07-25 5:14 ` Andrew Morton
2002-07-25 7:09 ` Andrew Morton
2002-07-26 7:33 ` Daniel Phillips
3 siblings, 2 replies; 20+ messages in thread
From: William Lee Irwin III @ 2002-07-25 4:50 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-mm
On Tue, Jul 23, 2002 at 11:33:20PM -0700, Andrew Morton wrote:
> Been taking a look at the page_add_rmap/page_remove_rmap cost in 2.5.27
> on the quad pIII. The workload is ten instances of this script running
> concurrently:
The workload is 16 instances of the same script running on a 16 cpu NUMA-Q
with 16GB of RAM. oprofile results attached.
Cheers,
Bill
c0105340 3309367 51.0125 default_idle /boot/vmlinux-2.5.28-3
c0135667 1095488 16.8865 .text.lock.page_alloc /boot/vmlinux-2.5.28-3
00000fec 475107 7.32358 dump_one /lib/modules/2.5.28-3/kern
el/arch/i386/oprofile/oprofile.o
c0129c10 349662 5.3899 do_anonymous_page /boot/vmlinux-2.5.28-3
c01353e4 236718 3.64891 get_page_state /boot/vmlinux-2.5.28-3
c0112a84 213189 3.28622 load_balance /boot/vmlinux-2.5.28-3
c013d31c 71599 1.10367 page_add_rmap /boot/vmlinux-2.5.28-3
c013d3cc 67122 1.03466 page_remove_rmap /boot/vmlinux-2.5.28-3
c013d85a 64302 0.991189 .text.lock.rmap /boot/vmlinux-2.5.28-3
c013b2a0 49197 0.758352 blk_queue_bounce /boot/vmlinux-2.5.28-3
c0129eb4 47963 0.73933 do_no_page /boot/vmlinux-2.5.28-3
c010fa60 41383 0.637902 smp_apic_timer_interrupt /boot/vmlinux-2.5.28-3
c0134ba0 39661 0.611358 rmqueue /boot/vmlinux-2.5.28-3
c019f890 32770 0.505136 serial_in /boot/vmlinux-2.5.28-3
c0133714 28913 0.445682 lru_cache_add /boot/vmlinux-2.5.28-3
c0134840 27436 0.422915 __free_pages_ok /boot/vmlinux-2.5.28-3
c013d71c 25028 0.385796 pte_chain_alloc /boot/vmlinux-2.5.28-3
c01963c0 24580 0.378891 __generic_copy_to_user /boot/vmlinux-2.5.28-3
c0196408 19186 0.295744 __generic_copy_from_user /boot/vmlinux-2.5.28-3
c013d7b4 18805 0.289871 pte_chain_free /boot/vmlinux-2.5.28-3
c01338cb 16173 0.2493 .text.lock.swap /boot/vmlinux-2.5.28-3
c01281e0 12649 0.194979 zap_pte_range /boot/vmlinux-2.5.28-3
c012d30c 11764 0.181337 file_read_actor /boot/vmlinux-2.5.28-3
c012a230 11738 0.180936 handle_mm_fault /boot/vmlinux-2.5.28-3
c0135c10 11062 0.170516 free_page_and_swap_cache /boot/vmlinux-2.5.28-3
c0112f4c 10638 0.16398 scheduler_tick /boot/vmlinux-2.5.28-3
c0129164 10439 0.160913 do_wp_page /boot/vmlinux-2.5.28-3
c010d0f0 9551 0.147225 timer_interrupt /boot/vmlinux-2.5.28-3
c0133864 8974 0.138331 lru_cache_del /boot/vmlinux-2.5.28-3
c01350b8 6378 0.0983143 __alloc_pages /boot/vmlinux-2.5.28-3
c0140430 6215 0.0958017 get_empty_filp /boot/vmlinux-2.5.28-3
c01352b8 5968 0.0919943 page_cache_release /boot/vmlinux-2.5.28-3
c0135324 5119 0.0789073 nr_free_pages /boot/vmlinux-2.5.28-3
c012ce18 4951 0.0763177 find_get_page /boot/vmlinux-2.5.28-3
c01406c0 4217 0.0650033 __fput /boot/vmlinux-2.5.28-3
c014aaf8 4188 0.0645563 link_path_walk /boot/vmlinux-2.5.28-3
c014e828 3276 0.0504982 kill_fasync /boot/vmlinux-2.5.28-3
c013ab78 3085 0.047554 kmap_high /boot/vmlinux-2.5.28-3
c014da9d 3012 0.0464288 .text.lock.namei /boot/vmlinux-2.5.28-3
c013b573 2976 0.0458738 .text.lock.highmem /boot/vmlinux-2.5.28-3
c0120620 2886 0.0444865 update_one_process /boot/vmlinux-2.5.28-3
c0127e88 2847 0.0438853 copy_page_range /boot/vmlinux-2.5.28-3
c012a6b0 2721 0.0419431 vm_enough_memory /boot/vmlinux-2.5.28-3
c0110f34 2678 0.0412803 pgd_alloc /boot/vmlinux-2.5.28-3
c0154608 2394 0.0369025 __d_lookup /boot/vmlinux-2.5.28-3
c0107d50 2384 0.0367484 page_fault /boot/vmlinux-2.5.28-3
c0107b98 2236 0.034467 apic_timer_interrupt /boot/vmlinux-2.5.28-3
c0111080 2205 0.0339892 pte_alloc_one /boot/vmlinux-2.5.28-3
c013ead0 2170 0.0334497 dentry_open /boot/vmlinux-2.5.28-3
c0196650 2074 0.0319699 atomic_dec_and_lock /boot/vmlinux-2.5.28-3
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: page_add/remove_rmap costs
2002-07-25 4:50 ` William Lee Irwin III
@ 2002-07-25 5:14 ` Andrew Morton
2002-07-25 5:15 ` John Levon
2002-07-25 7:09 ` Andrew Morton
1 sibling, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2002-07-25 5:14 UTC (permalink / raw)
To: William Lee Irwin III; +Cc: linux-mm
William Lee Irwin III wrote:
>
> On Tue, Jul 23, 2002 at 11:33:20PM -0700, Andrew Morton wrote:
> > Been taking a look at the page_add_rmap/page_remove_rmap cost in 2.5.27
> > on the quad pIII. The workload is ten instances of this script running
> > concurrently:
>
> The workload is 16 instances of the same script running on a 16 cpu NUMA-Q
> with 16GB of RAM. oprofile results attached.
These results look funny.
What I do is:
1) rm -rf /var/opd
2) start test
3) op_start --map-file=/boot/System.map --vmlinux=/boot/vmlinux --ctr0-event=CPU_CLK_UNHALTED --ctr0-count=600000
4) sleep 20
5) op_stop
6) oprofpp -l -i /boot/vmlinux
>
> c0105340 3309367 51.0125 default_idle /boot/vmlinux-2.5.28-3
How come?
> c0135667 1095488 16.8865 .text.lock.page_alloc /boot/vmlinux-2.5.28-3
zone->lock?
> 00000fec 475107 7.32358 dump_one /lib/modules/2.5.28-3/kern
that's part of oprofile.
> el/arch/i386/oprofile/oprofile.o
> c0129c10 349662 5.3899 do_anonymous_page /boot/vmlinux-2.5.28-3
OK.
> c01353e4 236718 3.64891 get_page_state /boot/vmlinux-2.5.28-3
whoa. Who's calling that so often? Any patches applied there?
> c0112a84 213189 3.28622 load_balance /boot/vmlinux-2.5.28-3
I thought you'd disabled this?
> c013d31c 71599 1.10367 page_add_rmap /boot/vmlinux-2.5.28-3
> c013d3cc 67122 1.03466 page_remove_rmap /boot/vmlinux-2.5.28-3
page_add_rmap is more expensive than page_remove_rmap.
So again, the list length isn't the #1 problem.
> c013d85a 64302 0.991189 .text.lock.rmap /boot/vmlinux-2.5.28-3
pte_chain_freelist_lock?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: page_add/remove_rmap costs
2002-07-25 5:14 ` Andrew Morton
@ 2002-07-25 5:15 ` John Levon
2002-07-25 5:30 ` William Lee Irwin III
2002-07-25 5:47 ` Andrew Morton
0 siblings, 2 replies; 20+ messages in thread
From: John Levon @ 2002-07-25 5:15 UTC (permalink / raw)
To: Andrew Morton; +Cc: William Lee Irwin III, linux-mm
On Wed, Jul 24, 2002 at 10:14:37PM -0700, Andrew Morton wrote:
> > c0135667 1095488 16.8865 .text.lock.page_alloc /boot/vmlinux-2.5.28-3
>
> zone->lock?
I wrote a patch some time ago to remove all this guesswork on lock call
sites :
http://marc.theaimsgroup.com/?l=linux-kernel&m=101586797421268&w=2
It seemed to work quite well with my limited testing on my 2-way ...
(pity it macrofies stuff)
> > c0112a84 213189 3.28622 load_balance /boot/vmlinux-2.5.28-3
>
> I thought you'd disabled this?
Maybe wli used "op_session", and this was from a previous run. oprofile
< 0.3 had a bug where the vmlinux samples file wasn't moved.
regards
john
--
"Hungarian notation is the tactical nuclear weapon of source code obfuscation
techniques."
- Roedy Green
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: page_add/remove_rmap costs
2002-07-25 5:15 ` John Levon
@ 2002-07-25 5:30 ` William Lee Irwin III
2002-07-25 5:47 ` Andrew Morton
1 sibling, 0 replies; 20+ messages in thread
From: William Lee Irwin III @ 2002-07-25 5:30 UTC (permalink / raw)
To: John Levon; +Cc: Andrew Morton, linux-mm
On Thu, Jul 25, 2002 at 06:15:52AM +0100, John Levon wrote:
> Maybe wli used "op_session", and this was from a previous run. oprofile
> < 0.3 had a bug where the vmlinux samples file wasn't moved.
I used an explicit session file argument.
Cheers,
Bill
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: page_add/remove_rmap costs
2002-07-25 5:47 ` Andrew Morton
@ 2002-07-25 5:42 ` William Lee Irwin III
2002-07-25 5:59 ` Andrew Morton
0 siblings, 1 reply; 20+ messages in thread
From: William Lee Irwin III @ 2002-07-25 5:42 UTC (permalink / raw)
To: Andrew Morton; +Cc: John Levon, linux-mm
John Levon wrote:
>> I wrote a patch some time ago to remove all this guesswork on lock call
>> sites :
>>
On Wed, Jul 24, 2002 at 10:47:47PM -0700, Andrew Morton wrote:
> Me too, but I just killed all the out-of-line gunk, so the cost
> is shown at the actual callsite.
It will be applied shortly. I've also been building with -g, so addr2line
will resolve the rest given appropriate dumping formats.
What's the op_time / oprofpp command that gives per-EIP sample frequencies?
Cheers,
Bill
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: page_add/remove_rmap costs
2002-07-25 5:15 ` John Levon
2002-07-25 5:30 ` William Lee Irwin III
@ 2002-07-25 5:47 ` Andrew Morton
2002-07-25 5:42 ` William Lee Irwin III
1 sibling, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2002-07-25 5:47 UTC (permalink / raw)
To: John Levon; +Cc: William Lee Irwin III, linux-mm
John Levon wrote:
>
> On Wed, Jul 24, 2002 at 10:14:37PM -0700, Andrew Morton wrote:
>
> > > c0135667 1095488 16.8865 .text.lock.page_alloc /boot/vmlinux-2.5.28-3
> >
> > zone->lock?
>
> I wrote a patch some time ago to remove all this guesswork on lock call
> sites :
>
Me too, but I just killed all the out-of-line gunk, so the cost
is shown at the actual callsite.
--- 2.5.24/include/asm-i386/spinlock.h~spinlock-inline Fri Jun 21 13:12:01 2002
+++ 2.5.24-akpm/include/asm-i386/spinlock.h Fri Jun 21 13:18:12 2002
@@ -46,13 +46,13 @@ typedef struct {
"\n1:\t" \
"lock ; decb %0\n\t" \
"js 2f\n" \
- LOCK_SECTION_START("") \
+ "jmp 3f\n" \
"2:\t" \
"cmpb $0,%0\n\t" \
"rep;nop\n\t" \
"jle 2b\n\t" \
"jmp 1b\n" \
- LOCK_SECTION_END
+ "3:\t" \
/*
* This works. Despite all the confusion.
--- 2.5.24/include/asm-i386/rwlock.h~spinlock-inline Fri Jun 21 13:18:33 2002
+++ 2.5.24-akpm/include/asm-i386/rwlock.h Fri Jun 21 13:22:09 2002
@@ -22,25 +22,19 @@
#define __build_read_lock_ptr(rw, helper) \
asm volatile(LOCK "subl $1,(%0)\n\t" \
- "js 2f\n" \
- "1:\n" \
- LOCK_SECTION_START("") \
- "2:\tcall " helper "\n\t" \
- "jmp 1b\n" \
- LOCK_SECTION_END \
+ "jns 1f\n\t" \
+ "call " helper "\n\t" \
+ "1:\t" \
::"a" (rw) : "memory")
#define __build_read_lock_const(rw, helper) \
asm volatile(LOCK "subl $1,%0\n\t" \
- "js 2f\n" \
- "1:\n" \
- LOCK_SECTION_START("") \
- "2:\tpushl %%eax\n\t" \
+ "jns 1f\n\t" \
+ "pushl %%eax\n\t" \
"leal %0,%%eax\n\t" \
"call " helper "\n\t" \
"popl %%eax\n\t" \
- "jmp 1b\n" \
- LOCK_SECTION_END \
+ "1:\t" \
:"=m" (*(volatile int *)rw) : : "memory")
#define __build_read_lock(rw, helper) do { \
@@ -52,25 +46,19 @@
#define __build_write_lock_ptr(rw, helper) \
asm volatile(LOCK "subl $" RW_LOCK_BIAS_STR ",(%0)\n\t" \
- "jnz 2f\n" \
+ "jz 1f\n\t" \
+ "call " helper "\n\t" \
"1:\n" \
- LOCK_SECTION_START("") \
- "2:\tcall " helper "\n\t" \
- "jmp 1b\n" \
- LOCK_SECTION_END \
::"a" (rw) : "memory")
#define __build_write_lock_const(rw, helper) \
asm volatile(LOCK "subl $" RW_LOCK_BIAS_STR ",%0\n\t" \
- "jnz 2f\n" \
- "1:\n" \
- LOCK_SECTION_START("") \
- "2:\tpushl %%eax\n\t" \
+ "jz 1f\n\t" \
+ "pushl %%eax\n\t" \
"leal %0,%%eax\n\t" \
"call " helper "\n\t" \
"popl %%eax\n\t" \
- "jmp 1b\n" \
- LOCK_SECTION_END \
+ "1:\n" \
:"=m" (*(volatile int *)rw) : : "memory")
#define __build_write_lock(rw, helper) do { \
-
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: page_add/remove_rmap costs
2002-07-25 5:42 ` William Lee Irwin III
@ 2002-07-25 5:59 ` Andrew Morton
0 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2002-07-25 5:59 UTC (permalink / raw)
To: William Lee Irwin III; +Cc: John Levon, linux-mm
William Lee Irwin III wrote:
>
> John Levon wrote:
> >> I wrote a patch some time ago to remove all this guesswork on lock call
> >> sites :
> >>
>
> On Wed, Jul 24, 2002 at 10:47:47PM -0700, Andrew Morton wrote:
> > Me too, but I just killed all the out-of-line gunk, so the cost
> > is shown at the actual callsite.
>
> It will be applied shortly. I've also been building with -g, so addr2line
> will resolve the rest given appropriate dumping formats.
Hope it still works.
> What's the op_time / oprofpp command that gives per-EIP sample frequencies?
I use
oprofpp -L -i /boot/vmlinux
oprofile can also allegedly do eip->file-n-line resolution,
but I'm not sure how that works when you're cross-building.
And generally I doubt i it's useful for kernel stuff, because
the EIP usually resolves to something like test_and_set_bit().
So I just fire up gdb on vmlinux and walk up and down a few bytes until
the address->line resolution falls out of the inline function and
into the caller.
-
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: page_add/remove_rmap costs
2002-07-25 4:50 ` William Lee Irwin III
2002-07-25 5:14 ` Andrew Morton
@ 2002-07-25 7:09 ` Andrew Morton
1 sibling, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2002-07-25 7:09 UTC (permalink / raw)
To: William Lee Irwin III; +Cc: linux-mm
well I tried a few things.
- Disable the pte_chain_lock stuff for uniprocessor
builds.
- Disable the cpu_relax()
- shuffle struct page to put ->flags and ->count next
to each other.
Uniprocessor:
c01c9138 122 0.96649 strnlen_user
c0145860 162 1.28337 __d_lookup
c012c2c4 179 1.41805 rmqueue
c010ba68 180 1.42597 timer_interrupt
c01120ec 190 1.50519 do_page_fault
c013d910 191 1.51311 link_path_walk
c01052c8 219 1.73493 poll_idle
c0132e44 227 1.7983 page_add_rmap
c0122e00 237 1.87753 clear_page_tables
c0111e40 264 2.09142 pte_alloc_one
c0123018 287 2.27363 copy_page_range
c0124324 296 2.34493 do_anonymous_page
c012aa70 471 3.73128 kmem_cache_alloc
c0123224 483 3.82635 zap_pte_range
c01077c4 484 3.83427 page_fault
c0124490 547 4.33336 do_no_page
c012ac5c 560 4.43635 kmem_cache_free
c0132f1c 940 7.44672 page_remove_rmap
c0123cb0 2581 20.4468 do_wp_page
So page_add_rmap went away.
page_remove_rmap:
c0132f8a 1 0.106383 0 0
c0132f8d 1 0.106383 0 0
c0132f93 1 0.106383 0 0
c0132fa4 3 0.319149 0 0
c0132fa7 56 5.95745 0 0 the `for' loop
c0132fa9 2 0.212766 0 0
c0132fab 4 0.425532 0 0
c0132fb0 13 1.38298 0 0
c0132fb3 574 61.0638 0 0 if (pc->ptep == ptep)
c0132fb5 1 0.106383 0 0
c0132fb6 13 1.38298 0 0
c0132fb9 2 0.212766 0 0
c0132fba 4 0.425532 0 0
And the page_remove_rmap cost is now in the list walk.
But the SMP performance is unaltered by these changes.
c0129818 1329 2.42966 do_anonymous_page
c01338dc 1501 2.74411 page_cache_release
c0129a10 2157 3.9434 do_no_page
c0128128 2581 4.71855 copy_page_range
c0128390 2655 4.85384 zap_pte_range
c013a944 4356 7.96358 page_add_rmap
c013aaa0 8423 15.3988 page_remove_rmap
c0128ff8 8457 15.461 do_wp_page
For page_remove_rmap, 32% is the pte_chain_lock, 35%
is the list walk and 12% is the pte_chain_unlock.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: page_add/remove_rmap costs
2002-07-24 6:33 page_add/remove_rmap costs Andrew Morton
` (2 preceding siblings ...)
2002-07-25 4:50 ` William Lee Irwin III
@ 2002-07-26 7:33 ` Daniel Phillips
3 siblings, 0 replies; 20+ messages in thread
From: Daniel Phillips @ 2002-07-26 7:33 UTC (permalink / raw)
To: Andrew Morton, linux-mm; +Cc: Paul Mackerras
On Wednesday 24 July 2002 08:33, Andrew Morton wrote:
> With rmap oprofile says:
>
> ./doitlots.sh 10 41.67s user 95.04s system 398% cpu 34.338 total
>
> [...]
>
> And without rmap it says:
>
> ./doitlots.sh 10 43.01s user 76.19s system 394% cpu 30.222 total
>
> [...]
>
> What we see here is:
>
> - We did 12477 forks
> - those forks called copy_page_range() 174,521 times in total
> - Of the 4,106,673 calls to page_add_rmap, 2,774,954 came from
> copy_page_range and 1,029,498 came from do_no_page.
> - Of the 4,119,825 calls to page_remove_rmap(), 3,863,194 came
> from zap_page_range().
>
> [...]
>
> So it's pretty much all happening in fork() and exit().
> My gut feel here is that this will be hard to tweak - some algorithmic
> change will be needed.
Indeed. This is I developed the refcount-based page table sharing technique
earlier this year: to eliminate the cost of setting up and tearing down
pte_chains that never get called upon to do anything useful. There's still
some work to do on the patch:
nl.linux.org/~phillips/patches/ptab-2.4.17-3
But the interesting part works.
I now know how to do the tlb invalidte on unmap efficiently, in fact Linus
knew right away at the time, but I had to work my way through some basics to
understand what he was going on about. In short, we need to chain the pte
pages to the mm's they belong. Each mm (already) carries a bitmap of
processors the mm is active on, so we just or all those bitmaps together and
call the flavor of interprocessor tlb invalidate that operates on the
resulting bitmap.
The same optimization as for pte_chains applies: if the pte page isn't
shared, we can set a bit in the page->flags and point directly at the mm, or
we can use the vma, which is conveniently hanging around when needed. I
prefer the former because it's more forward looking: we should be able to
dispense entirely with looking up the vma in some common situations. Also,
it's more symmetric with the existing page pte_chain code.
There is also Linus's suggestion for eliminating most (all?) of the locking
in my patch, which has the side effect of doing early reclaim of page tables.
Paul Mackerras did some work on this patch and was easily able to produce a
functional version of it, though I wouldn't call it the most elegant thing in
the world: he unshares the page tables on swap-out (ugh) and swapoff (who
cares). But I don't think he really knew the relationship between page table
sharing and rmap. It should be clear now.
--
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2002-07-26 7:33 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-07-24 6:33 page_add/remove_rmap costs Andrew Morton
2002-07-24 6:48 ` William Lee Irwin III
2002-07-24 16:24 ` Rik van Riel
2002-07-24 20:15 ` Andrew Morton
2002-07-24 20:21 ` Rik van Riel
2002-07-24 20:28 ` Andrew Morton
2002-07-25 2:35 ` Rik van Riel
2002-07-25 3:08 ` William Lee Irwin III
2002-07-25 3:14 ` Martin J. Bligh
2002-07-25 4:21 ` Andrew Morton
2002-07-25 2:45 ` William Lee Irwin III
2002-07-25 4:50 ` William Lee Irwin III
2002-07-25 5:14 ` Andrew Morton
2002-07-25 5:15 ` John Levon
2002-07-25 5:30 ` William Lee Irwin III
2002-07-25 5:47 ` Andrew Morton
2002-07-25 5:42 ` William Lee Irwin III
2002-07-25 5:59 ` Andrew Morton
2002-07-25 7:09 ` Andrew Morton
2002-07-26 7:33 ` Daniel Phillips
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox