page_add/remove_rmap costs

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* page_add/remove_rmap costs
@ 2002-07-24  6:33 Andrew Morton
  2002-07-24  6:48 ` William Lee Irwin III
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Andrew Morton @ 2002-07-24  6:33 UTC (permalink / raw)
  To: linux-mm

Been taking a look at the page_add_rmap/page_remove_rmap cost in 2.5.27
on the quad pIII.  The workload is ten instances of this script running
concurrently:

#!/bin/sh

doit()
{
	( cat $1 | wc -l )
}
	
count=0
	
while [ $count != 500 ]
do
	doit foo > /dev/null
	count=$(expr $count + 1)
done
echo done


It's just a ton of forking and exitting.

	
With rmap oprofile says:

./doitlots.sh 10  41.67s user 95.04s system 398% cpu 34.338 total

c0133030 317      1.07963     __free_pages_ok         
c0131428 375      1.27716     kmem_cache_free         
c01342d0 432      1.47129     free_page_and_swap_cache 
c01281d0 461      1.57006     clear_page_tables       
c013118c 462      1.57346     kmem_cache_alloc        
c012a08c 470      1.60071     handle_mm_fault         
c0113e50 504      1.7165      pte_alloc_one           
c0107b68 512      1.74375     page_fault              
c012c718 583      1.98556     find_get_page           
c01332cc 650      2.21375     rmqueue                 
c0129bc4 807      2.74845     do_anonymous_page       
c013396c 851      2.8983      page_cache_release      
c0129db0 1124     3.82808     do_no_page              
c0128750 1164     3.96431     zap_pte_range           
c01284f8 1374     4.67952     copy_page_range         
c013a994 1590     5.41516     page_add_rmap           
c013aa5c 3739     12.7341     page_remove_rmap        
c01293bc 5106     17.3898     do_wp_page              

And without rmap it says:

./doitlots.sh 10  43.01s user 76.19s system 394% cpu 30.222 total

c013074c 238      1.20592     lru_cache_add           
c0144e90 251      1.27179     link_path_walk          
c0112a64 252      1.27685     do_page_fault           
c0132b0c 252      1.27685     free_page_and_swap_cache 
c01388b4 261      1.32246     do_page_cache_readahead 
c01e0700 296      1.4998      radix_tree_lookup       
c01263e0 300      1.52006     clear_page_tables       
c01319b0 302      1.5302      __free_pages_ok         
c01079f8 395      2.00142     page_fault              
c012a8a8 396      2.00649     find_get_page           
c01127d4 401      2.03182     pte_alloc_one           
c0131ca0 451      2.28516     rmqueue                 
c0127cc8 774      3.92177     do_anonymous_page       
c013230c 933      4.7274      page_cache_release      
c0126880 964      4.88448     zap_pte_range           
c012662c 1013     5.13275     copy_page_range         
c0127e70 1138     5.76611     do_no_page              
c012750c 4485     22.725      do_wp_page              

So that's a ton of CPU time lost playing with pte chains.

I instrumented it all up with the `debug.patch' from
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.27/

./doitlots.sh 5

(gdb) p rmap_stats
$3 = {page_add_rmap = 4106673, page_add_rmap_nope = 20009, page_add_rmap_1st = 479482, 
  page_add_rmap_2nd = 357745, page_add_rmap_3rd = 3618421, page_remove_rmap = 4119825, 
  page_remove_rmap_1st = 479263, page_remove_rmap_nope = 20001, add_put_dirty_page = 8742, 
  add_copy_page_range = 2774954, add_do_wp_page = 272151, add_do_swap_page = 0, add_do_anonymous_page = 93689, 
  add_do_no_page = 1029498, add_copy_one_pte = 0, remove_zap_pte_range = 3863194, remove_do_wp_page = 272880, 
  remove_copy_one_pte = 0, do_no_page = 1119244, do_swap_page = 0, do_wp_page = 423034, 
  nr_copy_page_ranges = 174521, nr_forks = 12477}

What we see here is:

- We did 12477 forks
- those forks called copy_page_range() 174,521 times in total
- Of the 4,106,673 calls to page_add_rmap, 2,774,954 came from
  copy_page_range and 1,029,498 came from do_no_page.
- Of the 4,119,825 calls to page_remove_rmap(), 3,863,194 came
  from zap_page_range().

So it's pretty much all happening in fork() and exit().

Instruction-level profiling of page_add_rmap shows:

c013ab24 4074     8.63026     0        0           page_add_rmap           
 c013ab24 11       0.270005    0        0           
 c013ab25 67       1.64458     0        0           
 c013ab28 12       0.294551    0        0           
 c013ab29 12       0.294551    0        0           
 c013ab2b 1        0.0245459   0        0           
 c013ab2d 8        0.196367    0        0           
 c013ab38 313      7.68287     0        0           
 c013ab3e 7        0.171821    0        0           
 c013ab40 1        0.0245459   0        0           
 c013ab43 6        0.147275    0        0           
 c013ab46 1        0.0245459   0        0           
 c013ab4e 5        0.12273     0        0           
 c013ab53 5        0.12273     0        0           
 c013ab58 7        0.171821    0        0           
 c013ab5d 1364     33.4806     0        0           (pte_chain_lock)
 c013ab61 4        0.0981836   0        0           
 c013ab63 13       0.319097    0        0           
 c013ab66 14       0.343643    0        0           
 c013ab69 1        0.0245459   0        0           
 c013ab6b 17       0.41728     0        0           
 c013ab70 2        0.0490918   0        0           
 c013ab73 41       1.00638     0        0           
 c013ab78 8        0.196367    0        0           
 c013ab7a 1        0.0245459   0        0           
 c013ab7f 4        0.0981836   0        0           
 c013ab84 16       0.392734    0        0           
 c013ab87 1        0.0245459   0        0           
 c013ab9a 102      2.50368     0        0           
 c013aba0 33       0.810015    0        0           
 c013aba4 33       0.810015    0        0           
 c013aba6 3        0.0736377   0        0           
 c013abab 13       0.319097    0        0           
 c013abb0 8        0.196367    0        0           
 c013abb3 2        0.0490918   0        0           
 c013abb5 7        0.171821    0        0           
 c013abb8 2        0.0490918   0        0           
 c013abbe 247      6.06284     0        0           
 c013abc0 1        0.0245459   0        0           
 c013abc3 6        0.147275    0        0           
 c013abcd 55       1.35002     0        0           
 c013abd3 39       0.95729     0        0           
 c013abd8 1        0.0245459   0        0           
 c013abdd 1468     36.0334     0        0           (pte_chain_unlock)
 c013abe7 46       1.12911     0        0           
 c013abea 9        0.220913    0        0           
 c013abf0 42       1.03093     0        0           
 c013abf4 4        0.0981836   0        0           
 c013abf5 3        0.0736377   0        0           
 c013abf8 8        0.196367    0        0           

And page_remove_rmap():

c013abfc 6600     13.9813     0        0           page_remove_rmap        
 c013abfc 5        0.0757576   0        0           
 c013abfd 10       0.151515    0        0           
 c013ac00 1        0.0151515   0        0           
 c013ac01 23       0.348485    0        0           
 c013ac02 21       0.318182    0        0           
 c013ac06 2        0.030303    0        0           
 c013ac08 1        0.0151515   0        0           
 c013ac11 339      5.13636     0        0           
 c013ac17 9        0.136364    0        0           
 c013ac20 1        0.0151515   0        0           
 c013ac26 5        0.0757576   0        0           
 c013ac2b 5        0.0757576   0        0           
 c013ac36 1        0.0151515   0        0           
 c013ac40 20       0.30303     0        0           
 c013ac45 3        0.0454545   0        0           
 c013ac4a 2399     36.3485     0        0           (The pte_chain_lock)
 c013ac50 18       0.272727    0        0           
 c013ac53 13       0.19697     0        0           
 c013ac58 15       0.227273    0        0           
 c013ac60 3        0.0454545   0        0           
 c013ac63 50       0.757576    0        0           
 c013ac68 28       0.424242    0        0           
 c013ac6d 6        0.0909091   0        0           
 c013ac80 32       0.484848    0        0           
 c013ac86 42       0.636364    0        0           
 c013ac94 3        0.0454545   0        0           
 c013ac97 11       0.166667    0        0           
 c013ac99 3        0.0454545   0        0           
 c013ac9b 1        0.0151515   0        0           
 c013aca0 10       0.151515    0        0           (The `for (pc = page->pte.chain)' loop)
 c013aca3 2633     39.8939     0        0           
 c013aca5 5        0.0757576   0        0           
 c013aca6 23       0.348485    0        0           
 c013aca7 2        0.030303    0        0           
 c013aca8 2        0.030303    0        0           
 c013acad 15       0.227273    0        0           
 c013acb0 29       0.439394    0        0           
 c013acb6 218      3.30303     0        0           
 c013acbb 2        0.030303    0        0           
 c013acbe 3        0.0454545   0        0           
 c013acc3 20       0.30303     0        0           
 c013accd 1        0.0151515   0        0           
 c013acd0 6        0.0909091   0        0           
 c013acd2 2        0.030303    0        0           
 c013acd4 2        0.030303    0        0           
 c013acd6 12       0.181818    0        0           
 c013ace0 7        0.106061    0        0           
 c013ace5 1        0.0151515   0        0           
 c013acea 6        0.0909091   0        0           
 c013acf3 34       0.515152    0        0           
 c013acf8 1        0.0151515   0        0           
 c013acfd 411      6.22727     0        0           (Probably the pte_chain_unlock)
 c013ad03 4        0.0606061   0        0           
 c013ad04 57       0.863636    0        0           
 c013ad05 10       0.151515    0        0           
 c013ad06 6        0.0909091   0        0           
 c013ad09 8        0.121212    0        0           

The page_add_rmap() one is interesting - the pte_chain_unlock() is as expensive
as the pte_chain_lock().  Which would tend to indicate either that the page->flags
has expired from cache or some other CPU has stolen it.

It is interesting to note that the length of the pte_chain is not a big
factor in all of this.  So changing the singly-linked list to something
else probably won't help much.

Instrumentation of pte_chain_lock() shows:

nr_chain_locks =         8152300
nr_chain_lock_contends =   22436
nr_chain_lock_spins =    1946858

So the lock is only contended 0.3% of the time.  And when it _is_
contended, the waiting CPU spins an average of 87 loops.

Which leaves one to conclude that the page->flags has just been
naturally evicted out of cache.  So the next obvious step is to move a
lot of code out of the locked regions.

debug.patch moves the kmem_cache_alloc() and kmem_cache_free() calls
outside the locked region.  But it doesn't help.

So I don't know why the pte_chain_unlock() is so expensive in there.
But even if it could be fixed, we're still too slow.


My gut feel here is that this will be hard to tweak - some algorithmic
change will be needed.

The pte_chains are doing precisely zilch but chew CPU cycles with this
workload.  The machine has 2G of memory free.  The rmap is pure overhead.

Would it be possible to not build the pte_chain _at all_ until it is
actually needed?  Do it lazily?  So in the page reclaim code, if the
page has no rmap chain we go off and build it then?  This would require
something like a pfn->pte lookup function at the vma level, and a
page->vmas_which_own_me lookup.

Nice thing about this is that a) we already have page->flags
exclusively owned at that time, so the pte_chain_lock() _should_ be
cheap.  And b) if the rmap chain is built in this way, all the
pte_chain structures against a page will have good
locality-of-reference, so the chain walk will involve far fewer cache
misses.

Then again, if the per-vma pfn->pte lookup is feasible, we may not need
the pte_chain at all...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: page_add/remove_rmap costs
  2002-07-24  6:33 page_add/remove_rmap costs Andrew Morton
@ 2002-07-24  6:48 ` William Lee Irwin III
  2002-07-24 16:24 ` Rik van Riel
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 20+ messages in thread
From: William Lee Irwin III @ 2002-07-24  6:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm

On Tue, Jul 23, 2002 at 11:33:20PM -0700, Andrew Morton wrote:
> So I don't know why the pte_chain_unlock() is so expensive in there.
> But even if it could be fixed, we're still too slow.
> My gut feel here is that this will be hard to tweak - some algorithmic
> change will be needed.

Atomic operation on a cold/unowned/falsely shared cache line. The
operation needs to be avoided when possible.


On Tue, Jul 23, 2002 at 11:33:20PM -0700, Andrew Morton wrote:
> The pte_chains are doing precisely zilch but chew CPU cycles with this
> workload.  The machine has 2G of memory free.  The rmap is pure overhead.
> Would it be possible to not build the pte_chain _at all_ until it is
> actually needed?  Do it lazily?  So in the page reclaim code, if the
> page has no rmap chain we go off and build it then?  This would require
> something like a pfn->pte lookup function at the vma level, and a
> page->vmas_which_own_me lookup.

The space overhead of keeping them up to date can be mitigated, but this
time overhead can't be circumvented so long as strict per-pte updates
are required. I'm uncertain about lazy construction of them; I suspect
it will raise OOM issues (allocating in order to evict) and often be
constructed only never to be used again, but am not sure.


On Tue, Jul 23, 2002 at 11:33:20PM -0700, Andrew Morton wrote:
> Nice thing about this is that a) we already have page->flags
> exclusively owned at that time, so the pte_chain_lock() _should_ be
> cheap.  And b) if the rmap chain is built in this way, all the
> pte_chain structures against a page will have good
> locality-of-reference, so the chain walk will involve far fewer cache
> misses.

This is a less invasive proposal than various others that have been
going around, and could probably be tried and tested quickly.


Cheers,
Bill
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: page_add/remove_rmap costs
  2002-07-24  6:33 page_add/remove_rmap costs Andrew Morton
  2002-07-24  6:48 ` William Lee Irwin III
@ 2002-07-24 16:24 ` Rik van Riel
  2002-07-24 20:15   ` Andrew Morton
  2002-07-25  2:45   ` William Lee Irwin III
  2002-07-25  4:50 ` William Lee Irwin III
  2002-07-26  7:33 ` Daniel Phillips
  3 siblings, 2 replies; 20+ messages in thread
From: Rik van Riel @ 2002-07-24 16:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm

On Tue, 23 Jul 2002, Andrew Morton wrote:

> It's just a ton of forking and exitting.

And exec()ing ...

> What we see here is:
>
> - We did 12477 forks
> - those forks called copy_page_range() 174,521 times in total
> - Of the 4,106,673 calls to page_add_rmap, 2,774,954 came from
>   copy_page_range and 1,029,498 came from do_no_page.
> - Of the 4,119,825 calls to page_remove_rmap(), 3,863,194 came
>   from zap_page_range().
>
> So it's pretty much all happening in fork() and exit().

And exec() ... In fact, I suspect that about half of the calls
to page_remove_rmap() are coming via exec().


> The page_add_rmap() one is interesting - the pte_chain_unlock() is as
> expensive as the pte_chain_lock().  Which would tend to indicate either
> that the page->flags has expired from cache or some other CPU has stolen
> it.
>
> It is interesting to note that the length of the pte_chain is not a big
> factor in all of this.  So changing the singly-linked list to something
> else probably won't help much.

This is more disturbing ... ;)


> My gut feel here is that this will be hard to tweak - some algorithmic
> change will be needed.
>
> The pte_chains are doing precisely zilch but chew CPU cycles with this
> workload.  The machine has 2G of memory free.  The rmap is pure overhead.
>
> Would it be possible to not build the pte_chain _at all_ until it is
> actually needed?  Do it lazily?  So in the page reclaim code, if the
> page has no rmap chain we go off and build it then?  This would require
> something like a pfn->pte lookup function at the vma level, and a
> page->vmas_which_own_me lookup.

> Then again, if the per-vma pfn->pte lookup is feasible, we may not need
> the pte_chain at all...

It is feasible, both davem and bcrl made code to this effect. The
only problem with that code is that it gets ugly quick after mremap.

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: page_add/remove_rmap costs
  2002-07-24 16:24 ` Rik van Riel
@ 2002-07-24 20:15   ` Andrew Morton
  2002-07-24 20:21     ` Rik van Riel
  2002-07-25  3:08     ` William Lee Irwin III
  2002-07-25  2:45   ` William Lee Irwin III
  1 sibling, 2 replies; 20+ messages in thread
From: Andrew Morton @ 2002-07-24 20:15 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm

Rik van Riel wrote:
> 
> ...
> > It is interesting to note that the length of the pte_chain is not a big
> > factor in all of this.  So changing the singly-linked list to something
> > else probably won't help much.
> 
> This is more disturbing ... ;)

Well yes.  It may well indicate that my test is mostly LIFO
on the pte chains.  So FIFO workloads would be worse.

> > My gut feel here is that this will be hard to tweak - some algorithmic
> > change will be needed.
> >
> > The pte_chains are doing precisely zilch but chew CPU cycles with this
> > workload.  The machine has 2G of memory free.  The rmap is pure overhead.
> >
> > Would it be possible to not build the pte_chain _at all_ until it is
> > actually needed?  Do it lazily?  So in the page reclaim code, if the
> > page has no rmap chain we go off and build it then?  This would require
> > something like a pfn->pte lookup function at the vma level, and a
> > page->vmas_which_own_me lookup.
> 
> > Then again, if the per-vma pfn->pte lookup is feasible, we may not need
> > the pte_chain at all...
> 
> It is feasible, both davem and bcrl made code to this effect. The
> only problem with that code is that it gets ugly quick after mremap.

So.. who's going to do it?

It's early days yet - although this looks bad on benchmarks we really
need a better understanding of _why_ it's so bad, and of whether it
really matters for real workloads.

For example: given that copy_page_range performs atomic ops against
page->count, how come page_add_rmap()'s atomic op against page->flags
is more of a problem?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: page_add/remove_rmap costs
  2002-07-24 20:15   ` Andrew Morton
@ 2002-07-24 20:21     ` Rik van Riel
  2002-07-24 20:28       ` Andrew Morton
  2002-07-25  3:08     ` William Lee Irwin III
  1 sibling, 1 reply; 20+ messages in thread
From: Rik van Riel @ 2002-07-24 20:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm

On Wed, 24 Jul 2002, Andrew Morton wrote:

> > > Then again, if the per-vma pfn->pte lookup is feasible, we may not need
> > > the pte_chain at all...
> >
> > It is feasible, both davem and bcrl made code to this effect. The
> > only problem with that code is that it gets ugly quick after mremap.
>
> So.. who's going to do it?
>
> It's early days yet - although this looks bad on benchmarks we really
> need a better understanding of _why_ it's so bad, and of whether it
> really matters for real workloads.

I guess I'll take a stab at bcrl's and davem's code and will
try to also hide it between an rmap.c interface ;)

> For example: given that copy_page_range performs atomic ops against
> page->count, how come page_add_rmap()'s atomic op against page->flags
> is more of a problem?

Could it have something to do with cpu_relax() delaying
things ?

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: page_add/remove_rmap costs
  2002-07-24 20:21     ` Rik van Riel
@ 2002-07-24 20:28       ` Andrew Morton
  2002-07-25  2:35         ` Rik van Riel
  0 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2002-07-24 20:28 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm

Rik van Riel wrote:
> 
> On Wed, 24 Jul 2002, Andrew Morton wrote:
> 
> > > > Then again, if the per-vma pfn->pte lookup is feasible, we may not need
> > > > the pte_chain at all...
> > >
> > > It is feasible, both davem and bcrl made code to this effect. The
> > > only problem with that code is that it gets ugly quick after mremap.
> >
> > So.. who's going to do it?
> >
> > It's early days yet - although this looks bad on benchmarks we really
> > need a better understanding of _why_ it's so bad, and of whether it
> > really matters for real workloads.
> 
> I guess I'll take a stab at bcrl's and davem's code and will
> try to also hide it between an rmap.c interface ;)

hmm, OK.  Big job...

> > For example: given that copy_page_range performs atomic ops against
> > page->count, how come page_add_rmap()'s atomic op against page->flags
> > is more of a problem?
> 
> Could it have something to do with cpu_relax() delaying
> things ?

Don't think so.  That's only executed on the contended case, which
is 0.3% of the time.  But hey, it's easy enough to remove it and retest.
I shall do that.

-
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: page_add/remove_rmap costs
  2002-07-24 20:28       ` Andrew Morton
@ 2002-07-25  2:35         ` Rik van Riel
  0 siblings, 0 replies; 20+ messages in thread
From: Rik van Riel @ 2002-07-25  2:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm

On Wed, 24 Jul 2002, Andrew Morton wrote:
> Rik van Riel wrote:
> > On Wed, 24 Jul 2002, Andrew Morton wrote:
> >
> > I guess I'll take a stab at bcrl's and davem's code and will
> > try to also hide it between an rmap.c interface ;)
>
> hmm, OK.  Big job...

Absolutely, not a short term thing.  In the short term
I'll split out the remainder of Craig Kulesa's big patch
and will send you bits and pieces.

> > > For example: given that copy_page_range performs atomic ops against
> > > page->count, how come page_add_rmap()'s atomic op against page->flags
> > > is more of a problem?
> >
> > Could it have something to do with cpu_relax() delaying
> > things ?
>
> Don't think so.  That's only executed on the contended case,

You're right.

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: page_add/remove_rmap costs
  2002-07-24 16:24 ` Rik van Riel
  2002-07-24 20:15   ` Andrew Morton
@ 2002-07-25  2:45   ` William Lee Irwin III
  1 sibling, 0 replies; 20+ messages in thread
From: William Lee Irwin III @ 2002-07-25  2:45 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, linux-mm

On Tue, 23 Jul 2002, Andrew Morton wrote:
>> Then again, if the per-vma pfn->pte lookup is feasible, we may not need
>> the pte_chain at all...

On Wed, Jul 24, 2002 at 01:24:13PM -0300, Rik van Riel wrote:
> It is feasible, both davem and bcrl made code to this effect. The
> only problem with that code is that it gets ugly quick after mremap.

I actually took an axe to mremap recently, and althought the pieces
never came back together into working code, it's clear that it's far
from optimal. It's doing a virtual sweep over the region and repeating
the pgd -> pmd -> pte traversals for each pte. So invading that territory
may well be justifiable on more grounds than rmap itself.

I may revisit mremap.c at some point in the distant future if that
tweaking alone is considered valuable.

Cheers,
Bill
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: page_add/remove_rmap costs
  2002-07-24 20:15   ` Andrew Morton
  2002-07-24 20:21     ` Rik van Riel
@ 2002-07-25  3:08     ` William Lee Irwin III
  2002-07-25  3:14       ` Martin J. Bligh
  2002-07-25  4:21       ` Andrew Morton
  1 sibling, 2 replies; 20+ messages in thread
From: William Lee Irwin III @ 2002-07-25  3:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, linux-mm

On Wed, Jul 24, 2002 at 01:15:10PM -0700, Andrew Morton wrote:
> So.. who's going to do it?
> It's early days yet - although this looks bad on benchmarks we really
> need a better understanding of _why_ it's so bad, and of whether it
> really matters for real workloads.
> For example: given that copy_page_range performs atomic ops against
> page->count, how come page_add_rmap()'s atomic op against page->flags
> is more of a problem?

Hmm. It probably isn't harming more than benchmarks, but the loop is
pure bloat on UP. #ifdef that out someday. (Heck, don't even touch the
bit for UP except for debugging.)

Hypothesis:
There are too many cachelines to gain exclusive ownership of. It's not
the aggregate arrival rate, it's the aggregate cacheline-claiming
bandwidth needed to get exclusive ownership of all the pages' ->flags.

Experiment 1:
Group pages into blocks of say 2 or 4 for locality, and then hash each
pageblock to a lock. The worst case wrt. claiming cachelines is then
the size of the hash table divided by the size of the lock, but the
potential for cacheline contention exists.

Experiment 2:
Move ->flags to be adjacent to ->count and align struct page to a
divisor of the cacheline size or play tricks to get it down to 32B. =)

Experiment 3:
Compare magic oprofile perfcounter stuff between 2.5.26 and 2.5.27
and do divination based on whatever the cache counters say.

Cheers,
Bill
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: page_add/remove_rmap costs
  2002-07-25  3:08     ` William Lee Irwin III
@ 2002-07-25  3:14       ` Martin J. Bligh
  2002-07-25  4:21       ` Andrew Morton
  1 sibling, 0 replies; 20+ messages in thread
From: Martin J. Bligh @ 2002-07-25  3:14 UTC (permalink / raw)
  To: William Lee Irwin III, Andrew Morton; +Cc: Rik van Riel, linux-mm

> On Wed, Jul 24, 2002 at 01:15:10PM -0700, Andrew Morton wrote:
>> So.. who's going to do it?
>> It's early days yet - although this looks bad on benchmarks we really
>> need a better understanding of _why_ it's so bad, and of whether it
>> really matters for real workloads.
>> For example: given that copy_page_range performs atomic ops against
>> page->count, how come page_add_rmap()'s atomic op against page->flags
>> is more of a problem?

If it's bouncing the lock cacheline around that's suspected to be 
the problem, might it be faster to take a per-zone lock, rather
than a per-page lock, and batch the work up? Maybe we used a little
too much explosive when we broke up the global lock?

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: page_add/remove_rmap costs
  2002-07-25  3:08     ` William Lee Irwin III
  2002-07-25  3:14       ` Martin J. Bligh
@ 2002-07-25  4:21       ` Andrew Morton
  1 sibling, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2002-07-25  4:21 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Rik van Riel, linux-mm

William Lee Irwin III wrote:
> 
> On Wed, Jul 24, 2002 at 01:15:10PM -0700, Andrew Morton wrote:
> > So.. who's going to do it?
> > It's early days yet - although this looks bad on benchmarks we really
> > need a better understanding of _why_ it's so bad, and of whether it
> > really matters for real workloads.
> > For example: given that copy_page_range performs atomic ops against
> > page->count, how come page_add_rmap()'s atomic op against page->flags
> > is more of a problem?
> 
> Hmm. It probably isn't harming more than benchmarks, but the loop is
> pure bloat on UP. #ifdef that out someday. (Heck, don't even touch the
> bit for UP except for debugging.)
> 
> Hypothesis:
> There are too many cachelines to gain exclusive ownership of. It's not
> the aggregate arrival rate, it's the aggregate cacheline-claiming
> bandwidth needed to get exclusive ownership of all the pages' ->flags.

Yup.  But one would expect the access to lighten a subsequent
access to the page frame, so the aggregate cost would
be small.  It's odd.

It'd be nice to see some hard numbers from a P4, or a PPC64
or something.   I'm still wondering why the cost of the pte_chain_unlock()
is so high in page_remove_rmap().  That line should have still been
exclusively owned, but the PIII is going off-chip for some reason.
Is this general, or a peculiarity?

> Experiment 1:
> Group pages into blocks of say 2 or 4 for locality, and then hash each
> pageblock to a lock. The worst case wrt. claiming cachelines is then
> the size of the hash table divided by the size of the lock, but the
> potential for cacheline contention exists.

We could afford to do that.  It'd take a bit of reorganising to hold a lock
across multiple page_add_rmap() calls though.
 
> Experiment 2:
> Move ->flags to be adjacent to ->count and align struct page to a
> divisor of the cacheline size or play tricks to get it down to 32B. =)

Oh crap.  I thought I'd done that ages ago.

Whee.  Moving page->flags to the zeroth offset shrunk linux
by 110 bytes!

> Experiment 3:
> Compare magic oprofile perfcounter stuff between 2.5.26 and 2.5.27
> and do divination based on whatever the cache counters say.

Using divine intervention is cheating.

-
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: page_add/remove_rmap costs
  2002-07-24  6:33 page_add/remove_rmap costs Andrew Morton
  2002-07-24  6:48 ` William Lee Irwin III
  2002-07-24 16:24 ` Rik van Riel
@ 2002-07-25  4:50 ` William Lee Irwin III
  2002-07-25  5:14   ` Andrew Morton
  2002-07-25  7:09   ` Andrew Morton
  2002-07-26  7:33 ` Daniel Phillips
  3 siblings, 2 replies; 20+ messages in thread
From: William Lee Irwin III @ 2002-07-25  4:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm

On Tue, Jul 23, 2002 at 11:33:20PM -0700, Andrew Morton wrote:
> Been taking a look at the page_add_rmap/page_remove_rmap cost in 2.5.27
> on the quad pIII.  The workload is ten instances of this script running
> concurrently:

The workload is 16 instances of the same script running on a 16 cpu NUMA-Q
with 16GB of RAM. oprofile results attached.


Cheers,
Bill


c0105340 3309367  51.0125     default_idle            /boot/vmlinux-2.5.28-3
c0135667 1095488  16.8865     .text.lock.page_alloc   /boot/vmlinux-2.5.28-3
00000fec 475107   7.32358     dump_one                /lib/modules/2.5.28-3/kern
el/arch/i386/oprofile/oprofile.o
c0129c10 349662   5.3899      do_anonymous_page       /boot/vmlinux-2.5.28-3
c01353e4 236718   3.64891     get_page_state          /boot/vmlinux-2.5.28-3
c0112a84 213189   3.28622     load_balance            /boot/vmlinux-2.5.28-3
c013d31c 71599    1.10367     page_add_rmap           /boot/vmlinux-2.5.28-3
c013d3cc 67122    1.03466     page_remove_rmap        /boot/vmlinux-2.5.28-3
c013d85a 64302    0.991189    .text.lock.rmap         /boot/vmlinux-2.5.28-3
c013b2a0 49197    0.758352    blk_queue_bounce        /boot/vmlinux-2.5.28-3
c0129eb4 47963    0.73933     do_no_page              /boot/vmlinux-2.5.28-3
c010fa60 41383    0.637902    smp_apic_timer_interrupt /boot/vmlinux-2.5.28-3
c0134ba0 39661    0.611358    rmqueue                 /boot/vmlinux-2.5.28-3
c019f890 32770    0.505136    serial_in               /boot/vmlinux-2.5.28-3
c0133714 28913    0.445682    lru_cache_add           /boot/vmlinux-2.5.28-3
c0134840 27436    0.422915    __free_pages_ok         /boot/vmlinux-2.5.28-3
c013d71c 25028    0.385796    pte_chain_alloc         /boot/vmlinux-2.5.28-3
c01963c0 24580    0.378891    __generic_copy_to_user  /boot/vmlinux-2.5.28-3
c0196408 19186    0.295744    __generic_copy_from_user /boot/vmlinux-2.5.28-3
c013d7b4 18805    0.289871    pte_chain_free          /boot/vmlinux-2.5.28-3
c01338cb 16173    0.2493      .text.lock.swap         /boot/vmlinux-2.5.28-3
c01281e0 12649    0.194979    zap_pte_range           /boot/vmlinux-2.5.28-3
c012d30c 11764    0.181337    file_read_actor         /boot/vmlinux-2.5.28-3
c012a230 11738    0.180936    handle_mm_fault         /boot/vmlinux-2.5.28-3
c0135c10 11062    0.170516    free_page_and_swap_cache /boot/vmlinux-2.5.28-3
c0112f4c 10638    0.16398     scheduler_tick          /boot/vmlinux-2.5.28-3
c0129164 10439    0.160913    do_wp_page              /boot/vmlinux-2.5.28-3
c010d0f0 9551     0.147225    timer_interrupt         /boot/vmlinux-2.5.28-3
c0133864 8974     0.138331    lru_cache_del           /boot/vmlinux-2.5.28-3
c01350b8 6378     0.0983143   __alloc_pages           /boot/vmlinux-2.5.28-3
c0140430 6215     0.0958017   get_empty_filp          /boot/vmlinux-2.5.28-3
c01352b8 5968     0.0919943   page_cache_release      /boot/vmlinux-2.5.28-3
c0135324 5119     0.0789073   nr_free_pages           /boot/vmlinux-2.5.28-3
c012ce18 4951     0.0763177   find_get_page           /boot/vmlinux-2.5.28-3
c01406c0 4217     0.0650033   __fput                  /boot/vmlinux-2.5.28-3
c014aaf8 4188     0.0645563   link_path_walk          /boot/vmlinux-2.5.28-3
c014e828 3276     0.0504982   kill_fasync             /boot/vmlinux-2.5.28-3
c013ab78 3085     0.047554    kmap_high               /boot/vmlinux-2.5.28-3
c014da9d 3012     0.0464288   .text.lock.namei        /boot/vmlinux-2.5.28-3
c013b573 2976     0.0458738   .text.lock.highmem      /boot/vmlinux-2.5.28-3
c0120620 2886     0.0444865   update_one_process      /boot/vmlinux-2.5.28-3
c0127e88 2847     0.0438853   copy_page_range         /boot/vmlinux-2.5.28-3
c012a6b0 2721     0.0419431   vm_enough_memory        /boot/vmlinux-2.5.28-3
c0110f34 2678     0.0412803   pgd_alloc               /boot/vmlinux-2.5.28-3
c0154608 2394     0.0369025   __d_lookup              /boot/vmlinux-2.5.28-3
c0107d50 2384     0.0367484   page_fault              /boot/vmlinux-2.5.28-3
c0107b98 2236     0.034467    apic_timer_interrupt    /boot/vmlinux-2.5.28-3
c0111080 2205     0.0339892   pte_alloc_one           /boot/vmlinux-2.5.28-3
c013ead0 2170     0.0334497   dentry_open             /boot/vmlinux-2.5.28-3
c0196650 2074     0.0319699   atomic_dec_and_lock     /boot/vmlinux-2.5.28-3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: page_add/remove_rmap costs
  2002-07-25  4:50 ` William Lee Irwin III
@ 2002-07-25  5:14   ` Andrew Morton
  2002-07-25  5:15     ` John Levon
  2002-07-25  7:09   ` Andrew Morton
  1 sibling, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2002-07-25  5:14 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: linux-mm

William Lee Irwin III wrote:
> 
> On Tue, Jul 23, 2002 at 11:33:20PM -0700, Andrew Morton wrote:
> > Been taking a look at the page_add_rmap/page_remove_rmap cost in 2.5.27
> > on the quad pIII.  The workload is ten instances of this script running
> > concurrently:
> 
> The workload is 16 instances of the same script running on a 16 cpu NUMA-Q
> with 16GB of RAM. oprofile results attached.

These results look funny.

What I do is:

1) rm -rf /var/opd
2) start test
3) op_start --map-file=/boot/System.map --vmlinux=/boot/vmlinux --ctr0-event=CPU_CLK_UNHALTED --ctr0-count=600000
4) sleep 20
5) op_stop
6) oprofpp -l -i /boot/vmlinux


> 
> c0105340 3309367  51.0125     default_idle            /boot/vmlinux-2.5.28-3

How come?

> c0135667 1095488  16.8865     .text.lock.page_alloc   /boot/vmlinux-2.5.28-3

zone->lock?

> 00000fec 475107   7.32358     dump_one                /lib/modules/2.5.28-3/kern

that's part of oprofile.

> el/arch/i386/oprofile/oprofile.o
> c0129c10 349662   5.3899      do_anonymous_page       /boot/vmlinux-2.5.28-3

OK.

> c01353e4 236718   3.64891     get_page_state          /boot/vmlinux-2.5.28-3

whoa.  Who's calling that so often?  Any patches applied there?

> c0112a84 213189   3.28622     load_balance            /boot/vmlinux-2.5.28-3

I thought you'd disabled this?

> c013d31c 71599    1.10367     page_add_rmap           /boot/vmlinux-2.5.28-3
> c013d3cc 67122    1.03466     page_remove_rmap        /boot/vmlinux-2.5.28-3

page_add_rmap is more expensive than page_remove_rmap.
So again, the list length isn't the #1 problem.

> c013d85a 64302    0.991189    .text.lock.rmap         /boot/vmlinux-2.5.28-3

pte_chain_freelist_lock?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: page_add/remove_rmap costs
  2002-07-25  5:14   ` Andrew Morton
@ 2002-07-25  5:15     ` John Levon
  2002-07-25  5:30       ` William Lee Irwin III
  2002-07-25  5:47       ` Andrew Morton
  0 siblings, 2 replies; 20+ messages in thread
From: John Levon @ 2002-07-25  5:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: William Lee Irwin III, linux-mm

On Wed, Jul 24, 2002 at 10:14:37PM -0700, Andrew Morton wrote:

> > c0135667 1095488  16.8865     .text.lock.page_alloc   /boot/vmlinux-2.5.28-3
> 
> zone->lock?

I wrote a patch some time ago to remove all this guesswork on lock call
sites :

http://marc.theaimsgroup.com/?l=linux-kernel&m=101586797421268&w=2

It seemed to work quite well with my limited testing on my 2-way ...
(pity it macrofies stuff)

> > c0112a84 213189   3.28622     load_balance            /boot/vmlinux-2.5.28-3
> 
> I thought you'd disabled this?

Maybe wli used "op_session", and this was from a previous run. oprofile
< 0.3 had a bug where the vmlinux samples file wasn't moved.

regards
john

-- 
"Hungarian notation is the tactical nuclear weapon of source code obfuscation
techniques." 
	- Roedy Green 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: page_add/remove_rmap costs
  2002-07-25  5:15     ` John Levon
@ 2002-07-25  5:30       ` William Lee Irwin III
  2002-07-25  5:47       ` Andrew Morton
  1 sibling, 0 replies; 20+ messages in thread
From: William Lee Irwin III @ 2002-07-25  5:30 UTC (permalink / raw)
  To: John Levon; +Cc: Andrew Morton, linux-mm

On Thu, Jul 25, 2002 at 06:15:52AM +0100, John Levon wrote:
> Maybe wli used "op_session", and this was from a previous run. oprofile
> < 0.3 had a bug where the vmlinux samples file wasn't moved.

I used an explicit session file argument.


Cheers,
Bill
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: page_add/remove_rmap costs
  2002-07-25  5:47       ` Andrew Morton
@ 2002-07-25  5:42         ` William Lee Irwin III
  2002-07-25  5:59           ` Andrew Morton
  0 siblings, 1 reply; 20+ messages in thread
From: William Lee Irwin III @ 2002-07-25  5:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: John Levon, linux-mm

John Levon wrote:
>> I wrote a patch some time ago to remove all this guesswork on lock call
>> sites :
>> 

On Wed, Jul 24, 2002 at 10:47:47PM -0700, Andrew Morton wrote:
> Me too, but I just killed all the out-of-line gunk, so the cost
> is shown at the actual callsite.

It will be applied shortly. I've also been building with -g, so addr2line
will resolve the rest given appropriate dumping formats.

What's the op_time / oprofpp command that gives per-EIP sample frequencies?


Cheers,
Bill
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: page_add/remove_rmap costs
  2002-07-25  5:15     ` John Levon
  2002-07-25  5:30       ` William Lee Irwin III
@ 2002-07-25  5:47       ` Andrew Morton
  2002-07-25  5:42         ` William Lee Irwin III
  1 sibling, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2002-07-25  5:47 UTC (permalink / raw)
  To: John Levon; +Cc: William Lee Irwin III, linux-mm

John Levon wrote:
> 
> On Wed, Jul 24, 2002 at 10:14:37PM -0700, Andrew Morton wrote:
> 
> > > c0135667 1095488  16.8865     .text.lock.page_alloc   /boot/vmlinux-2.5.28-3
> >
> > zone->lock?
> 
> I wrote a patch some time ago to remove all this guesswork on lock call
> sites :
> 

Me too, but I just killed all the out-of-line gunk, so the cost
is shown at the actual callsite.


--- 2.5.24/include/asm-i386/spinlock.h~spinlock-inline	Fri Jun 21 13:12:01 2002
+++ 2.5.24-akpm/include/asm-i386/spinlock.h	Fri Jun 21 13:18:12 2002
@@ -46,13 +46,13 @@ typedef struct {
 	"\n1:\t" \
 	"lock ; decb %0\n\t" \
 	"js 2f\n" \
-	LOCK_SECTION_START("") \
+	"jmp 3f\n" \
 	"2:\t" \
 	"cmpb $0,%0\n\t" \
 	"rep;nop\n\t" \
 	"jle 2b\n\t" \
 	"jmp 1b\n" \
-	LOCK_SECTION_END
+	"3:\t" \
 
 /*
  * This works. Despite all the confusion.
--- 2.5.24/include/asm-i386/rwlock.h~spinlock-inline	Fri Jun 21 13:18:33 2002
+++ 2.5.24-akpm/include/asm-i386/rwlock.h	Fri Jun 21 13:22:09 2002
@@ -22,25 +22,19 @@
 
 #define __build_read_lock_ptr(rw, helper)   \
 	asm volatile(LOCK "subl $1,(%0)\n\t" \
-		     "js 2f\n" \
-		     "1:\n" \
-		     LOCK_SECTION_START("") \
-		     "2:\tcall " helper "\n\t" \
-		     "jmp 1b\n" \
-		     LOCK_SECTION_END \
+		     "jns 1f\n\t" \
+		     "call " helper "\n\t" \
+		     "1:\t" \
 		     ::"a" (rw) : "memory")
 
 #define __build_read_lock_const(rw, helper)   \
 	asm volatile(LOCK "subl $1,%0\n\t" \
-		     "js 2f\n" \
-		     "1:\n" \
-		     LOCK_SECTION_START("") \
-		     "2:\tpushl %%eax\n\t" \
+		     "jns 1f\n\t" \
+		     "pushl %%eax\n\t" \
 		     "leal %0,%%eax\n\t" \
 		     "call " helper "\n\t" \
 		     "popl %%eax\n\t" \
-		     "jmp 1b\n" \
-		     LOCK_SECTION_END \
+		     "1:\t" \
 		     :"=m" (*(volatile int *)rw) : : "memory")
 
 #define __build_read_lock(rw, helper)	do { \
@@ -52,25 +46,19 @@
 
 #define __build_write_lock_ptr(rw, helper) \
 	asm volatile(LOCK "subl $" RW_LOCK_BIAS_STR ",(%0)\n\t" \
-		     "jnz 2f\n" \
+		     "jz 1f\n\t" \
+		     "call " helper "\n\t" \
 		     "1:\n" \
-		     LOCK_SECTION_START("") \
-		     "2:\tcall " helper "\n\t" \
-		     "jmp 1b\n" \
-		     LOCK_SECTION_END \
 		     ::"a" (rw) : "memory")
 
 #define __build_write_lock_const(rw, helper) \
 	asm volatile(LOCK "subl $" RW_LOCK_BIAS_STR ",%0\n\t" \
-		     "jnz 2f\n" \
-		     "1:\n" \
-		     LOCK_SECTION_START("") \
-		     "2:\tpushl %%eax\n\t" \
+		     "jz 1f\n\t" \
+		     "pushl %%eax\n\t" \
 		     "leal %0,%%eax\n\t" \
 		     "call " helper "\n\t" \
 		     "popl %%eax\n\t" \
-		     "jmp 1b\n" \
-		     LOCK_SECTION_END \
+		     "1:\n" \
 		     :"=m" (*(volatile int *)rw) : : "memory")
 
 #define __build_write_lock(rw, helper)	do { \

-
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: page_add/remove_rmap costs
  2002-07-25  5:42         ` William Lee Irwin III
@ 2002-07-25  5:59           ` Andrew Morton
  0 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2002-07-25  5:59 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: John Levon, linux-mm

William Lee Irwin III wrote:
> 
> John Levon wrote:
> >> I wrote a patch some time ago to remove all this guesswork on lock call
> >> sites :
> >>
> 
> On Wed, Jul 24, 2002 at 10:47:47PM -0700, Andrew Morton wrote:
> > Me too, but I just killed all the out-of-line gunk, so the cost
> > is shown at the actual callsite.
> 
> It will be applied shortly. I've also been building with -g, so addr2line
> will resolve the rest given appropriate dumping formats.

Hope it still works.

> What's the op_time / oprofpp command that gives per-EIP sample frequencies?

I use

	oprofpp -L -i /boot/vmlinux

oprofile can also allegedly do eip->file-n-line resolution,
but I'm not sure how that works when you're cross-building.
And generally I doubt i it's useful for kernel stuff, because
the EIP usually resolves to something like test_and_set_bit().

So I just fire up gdb on vmlinux and walk up and down a few bytes until
the address->line resolution falls out of the inline function and
into the caller.

-
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: page_add/remove_rmap costs
  2002-07-25  4:50 ` William Lee Irwin III
  2002-07-25  5:14   ` Andrew Morton
@ 2002-07-25  7:09   ` Andrew Morton
  1 sibling, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2002-07-25  7:09 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: linux-mm

well I tried a few things.

- Disable the pte_chain_lock stuff for uniprocessor
  builds.

- Disable the cpu_relax()

- shuffle struct page to put ->flags and ->count next
  to each other.



Uniprocessor:

c01c9138 122      0.96649     strnlen_user            
c0145860 162      1.28337     __d_lookup              
c012c2c4 179      1.41805     rmqueue                 
c010ba68 180      1.42597     timer_interrupt         
c01120ec 190      1.50519     do_page_fault           
c013d910 191      1.51311     link_path_walk          
c01052c8 219      1.73493     poll_idle               
c0132e44 227      1.7983      page_add_rmap           
c0122e00 237      1.87753     clear_page_tables       
c0111e40 264      2.09142     pte_alloc_one           
c0123018 287      2.27363     copy_page_range         
c0124324 296      2.34493     do_anonymous_page       
c012aa70 471      3.73128     kmem_cache_alloc        
c0123224 483      3.82635     zap_pte_range           
c01077c4 484      3.83427     page_fault              
c0124490 547      4.33336     do_no_page              
c012ac5c 560      4.43635     kmem_cache_free         
c0132f1c 940      7.44672     page_remove_rmap        
c0123cb0 2581     20.4468     do_wp_page              

So page_add_rmap went away.

page_remove_rmap:

 c0132f8a 1        0.106383    0        0           
 c0132f8d 1        0.106383    0        0           
 c0132f93 1        0.106383    0        0           
 c0132fa4 3        0.319149    0        0           
 c0132fa7 56       5.95745     0        0           the `for' loop
 c0132fa9 2        0.212766    0        0           
 c0132fab 4        0.425532    0        0           
 c0132fb0 13       1.38298     0        0           
 c0132fb3 574      61.0638     0        0           if (pc->ptep == ptep)
 c0132fb5 1        0.106383    0        0           
 c0132fb6 13       1.38298     0        0           
 c0132fb9 2        0.212766    0        0           
 c0132fba 4        0.425532    0        0           

And the page_remove_rmap cost is now in the list walk.


But the SMP performance is unaltered by these changes.

c0129818 1329     2.42966     do_anonymous_page       
c01338dc 1501     2.74411     page_cache_release      
c0129a10 2157     3.9434      do_no_page              
c0128128 2581     4.71855     copy_page_range         
c0128390 2655     4.85384     zap_pte_range           
c013a944 4356     7.96358     page_add_rmap           
c013aaa0 8423     15.3988     page_remove_rmap        
c0128ff8 8457     15.461      do_wp_page              

For page_remove_rmap, 32% is the pte_chain_lock, 35%
is the list walk and 12% is the pte_chain_unlock.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: page_add/remove_rmap costs
  2002-07-24  6:33 page_add/remove_rmap costs Andrew Morton
                   ` (2 preceding siblings ...)
  2002-07-25  4:50 ` William Lee Irwin III
@ 2002-07-26  7:33 ` Daniel Phillips
  3 siblings, 0 replies; 20+ messages in thread
From: Daniel Phillips @ 2002-07-26  7:33 UTC (permalink / raw)
  To: Andrew Morton, linux-mm; +Cc: Paul Mackerras

On Wednesday 24 July 2002 08:33, Andrew Morton wrote:
> With rmap oprofile says:
> 
> ./doitlots.sh 10  41.67s user 95.04s system 398% cpu 34.338 total
> 
> [...]
>
> And without rmap it says:
> 
> ./doitlots.sh 10  43.01s user 76.19s system 394% cpu 30.222 total
>
> [...]
>
> What we see here is:
> 
> - We did 12477 forks
> - those forks called copy_page_range() 174,521 times in total
> - Of the 4,106,673 calls to page_add_rmap, 2,774,954 came from
>   copy_page_range and 1,029,498 came from do_no_page.
> - Of the 4,119,825 calls to page_remove_rmap(), 3,863,194 came
>   from zap_page_range().
> 
> [...]
>
> So it's pretty much all happening in fork() and exit().
> My gut feel here is that this will be hard to tweak - some algorithmic
> change will be needed.

Indeed.  This is I developed the refcount-based page table sharing technique 
earlier this year: to eliminate the cost of setting up and tearing down 
pte_chains that never get called upon to do anything useful.  There's still 
some work to do on the patch:

   nl.linux.org/~phillips/patches/ptab-2.4.17-3

But the interesting part works.

I now know how to do the tlb invalidte on unmap efficiently, in fact Linus 
knew right away at the time, but I had to work my way through some basics to 
understand what he was going on about.  In short, we need to chain the pte 
pages to the mm's they belong.  Each mm (already) carries a bitmap of 
processors the mm is active on, so we just or all those bitmaps together and 
call the flavor of interprocessor tlb invalidate that operates on the 
resulting bitmap.

The same optimization as for pte_chains applies: if the pte page isn't 
shared, we can set a bit in the page->flags and point directly at the mm, or 
we can use the vma, which is conveniently hanging around when needed.  I 
prefer the former because it's more forward looking: we should be able to 
dispense entirely with looking up the vma in some common situations.  Also, 
it's more symmetric with the existing page pte_chain code.

There is also Linus's suggestion for eliminating most (all?) of the locking 
in my patch, which has the side effect of doing early reclaim of page tables. 

Paul Mackerras did some work on this patch and was easily able to produce a 
functional version of it, though I wouldn't call it the most elegant thing in 
the world: he unshares the page tables on swap-out (ugh) and swapoff (who 
cares).  But I don't think he really knew the relationship between page table 
sharing and rmap.  It should be clear now.

-- 
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2002-07-26  7:33 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-07-24  6:33 page_add/remove_rmap costs Andrew Morton
2002-07-24  6:48 ` William Lee Irwin III
2002-07-24 16:24 ` Rik van Riel
2002-07-24 20:15   ` Andrew Morton
2002-07-24 20:21     ` Rik van Riel
2002-07-24 20:28       ` Andrew Morton
2002-07-25  2:35         ` Rik van Riel
2002-07-25  3:08     ` William Lee Irwin III
2002-07-25  3:14       ` Martin J. Bligh
2002-07-25  4:21       ` Andrew Morton
2002-07-25  2:45   ` William Lee Irwin III
2002-07-25  4:50 ` William Lee Irwin III
2002-07-25  5:14   ` Andrew Morton
2002-07-25  5:15     ` John Levon
2002-07-25  5:30       ` William Lee Irwin III
2002-07-25  5:47       ` Andrew Morton
2002-07-25  5:42         ` William Lee Irwin III
2002-07-25  5:59           ` Andrew Morton
2002-07-25  7:09   ` Andrew Morton
2002-07-26  7:33 ` Daniel Phillips

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox