[rfc] lockless pagecache

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Nick Piggin <nickpiggin@yahoo.com.au>
To: linux-kernel <linux-kernel@vger.kernel.org>,
	Linux Memory Management <linux-mm@kvack.org>
Subject: [rfc] lockless pagecache
Date: Mon, 27 Jun 2005 16:29:37 +1000	[thread overview]
Message-ID: <42BF9CD1.2030102@yahoo.com.au> (raw)

Hi,

This is going to be a fairly long and probably incoherent post. The
idea and implementation are not completely analysed for holes, and
I wouldn't be surprised if some (even fatal ones) exist.

That said, I wanted something to talk about at Ottawa and I think
this is a promising idea - it is at the stage where it would be good
to have interested parties pick it apart. BTW. this is my main reason
for the PageReserved removal patches, so if this falls apart then
some good will have come from it! :)

OK, so my aim is to remove the requirement to take mapping->tree_lock
when looking up pagecache pages (eg. for a read/write or nopage fault).
Note that this does not deal with insertion and removal of pages from
pagecache mappings - that is usually a slower path operation associated
with IO or page reclaim or truncate. However if there was interest in
making these paths more scalable, there are possibilities for that too.

What for? Well there are probably lots of reasons, but suppose you have
a big app with lots of processes all mmaping and playing around with
various parts of the same big file (say, a shared memory file), then
you might start seeing problems if you want to scale this workload up
to say 32+ CPUs.

Now the tree_lock was recently(ish) converted to an rwlock, precisely
for such a workload and that was apparently very successful. However
an rwlock is significantly heavier, and as machines get faster and
bigger, rwlocks (and any locks) will tend to use more and more of Paul
McKenney's toilet paper due to cacheline bouncing.

So in the interest of saving some trees, let's try it without any locks.

First I'll put up some numbers to get you interested - of a 64-way Altix
with 64 processes each read-faulting in their own 512MB part of a 32GB
file that is preloaded in pagecache (with the proper NUMA memory
allocation).

[best of 5 runs]

plain 2.6.12-git4:
  1 proc    0.65u   1.43s 2.09e 99%CPU
64 proc    0.75u 291.30s 4.92e 5927%CPU

64 proc prof:
3242763 total                                      0.5366
1269413 _read_unlock_irq                         19834.5781
842042 do_no_page                               355.5921
779373 cond_resched                             3479.3438
100667 ia64_pal_call_static                     524.3073
  96469 _spin_lock                               1004.8854
  92857 default_idle                             241.8151
  25572 filemap_nopage                            15.6691
  11981 ia64_load_scratch_fpregs                 187.2031
  11671 ia64_save_scratch_fpregs                 182.3594
   2566 page_fault                                 2.5867

It has slowed by a factor of 2.5x when going from serial to 64-way, and it
is due to mapping->tree_lock. Serial is even at the disadvantage of reading
from remote memory 62 times out of 64.

2.6.12-git4-lockless:
  1 proc    0.66u   1.38s 2.04e 99%CPU
64 proc    0.68u   1.42s 0.12e 1686%CPU

64 proc prof:
  81934 total                                      0.0136
  31108 ia64_pal_call_static                     162.0208
  28394 default_idle                              73.9427
   3796 ia64_save_scratch_fpregs                  59.3125
   3736 ia64_load_scratch_fpregs                  58.3750
   2208 page_fault                                 2.2258
   1380 unmap_vmas                                 0.3292
   1298 __mod_page_state                           8.1125
   1089 do_no_page                                 0.4599
    830 find_get_page                              2.5938
    781 ia64_do_page_fault                         0.2805

So we have increased performance exactly 17x when going from 1 to 64 way,
however if you look at the CPU utilisation figure and the elapsed time,
you'll see my test didn't provide enough work to keep all CPUs busy, and
for the amount of CPU time used, we appear to have perfect scalability.
In fact, it is slightly superlinear probably due to remote memory access
on the serial run.

I'll reply to this post with the series of commented patches which is
probably the best way to explain how it is done. They are against
2.6.12-git4 + some future iteration of the PageReserved patches. I
can provide the complete rollup privately on request.

Comments, flames, laughing me out of town, etc. are all very welcome.

Nick

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

next             reply	other threads:[~2005-06-27  6:29 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-06-27  6:29 Nick Piggin [this message]
2005-06-27  6:32 ` [patch 1] mm: PG_free flag Nick Piggin
2005-06-27  6:32   ` [patch 2] mm: speculative get_page Nick Piggin
2005-06-27  6:33     ` [patch 3] radix tree: lookup_slot Nick Piggin
2005-06-27  6:34       ` [patch 4] radix tree: lockless readside Nick Piggin
2005-06-27  6:34         ` [patch 5] mm: lockless pagecache lookups Nick Piggin
2005-06-27  6:35           ` [patch 6] mm: spinlock tree_lock Nick Piggin
2005-06-27 14:12     ` [patch 2] mm: speculative get_page William Lee Irwin III
2005-06-28  0:03       ` Nick Piggin
2005-06-28  0:56         ` Nick Piggin
2005-06-28  1:22         ` William Lee Irwin III
2005-06-28  1:42           ` Nick Piggin
2005-06-28  4:06             ` William Lee Irwin III
2005-06-28  4:50               ` Nick Piggin
2005-06-28  5:08                 ` [patch 2] mm: speculative get_page, " David S. Miller, Nick Piggin
2005-06-28  5:34                   ` Nick Piggin
2005-06-28 14:19                   ` William Lee Irwin III
2005-06-28 15:43                     ` Nick Piggin
2005-06-28 17:01                       ` Christoph Lameter
2005-06-28 23:10                         ` Nick Piggin
2005-06-28 21:32                   ` Jesse Barnes
2005-06-28 22:17                     ` Christoph Lameter
2005-06-28 12:45     ` Andy Whitcroft
2005-06-28 13:16       ` Nick Piggin
2005-06-28 16:02         ` Dave Hansen
2005-06-29 16:31           ` Pavel Machek
2005-06-29 18:43             ` Dave Hansen
2005-06-29 21:22               ` Pavel Machek
2005-06-29 16:31         ` Pavel Machek
2005-06-27  6:43 ` VFS scalability (was: [rfc] lockless pagecache) Nick Piggin
2005-06-27  7:13   ` Andi Kleen
2005-06-27  7:33     ` VFS scalability Nick Piggin
2005-06-27  7:44       ` Andi Kleen
2005-06-27  8:03         ` Nick Piggin
2005-06-27  7:46 ` [rfc] lockless pagecache Andrew Morton
2005-06-27  8:02   ` Nick Piggin
2005-06-27  8:15     ` Andrew Morton
2005-06-27  8:28       ` Nick Piggin
2005-06-27  8:56     ` Lincoln Dale
2005-06-27  9:04       ` Nick Piggin
2005-06-27 18:14         ` Chen, Kenneth W
2005-06-27 18:50           ` Badari Pulavarty
2005-06-27 19:05             ` Chen, Kenneth W
2005-06-27 19:22               ` Christoph Lameter
2005-06-27 19:42                 ` Chen, Kenneth W
2005-07-05 15:11                   ` Sonny Rao
2005-07-05 15:31                     ` Martin J. Bligh
2005-07-05 15:37                       ` Sonny Rao
2005-06-27 13:17     ` Benjamin LaHaise
2005-06-28  0:32       ` Nick Piggin
2005-06-28  1:26         ` William Lee Irwin III
2005-06-27 14:08   ` Martin J. Bligh
2005-06-27 17:49   ` Christoph Lameter
2005-06-29 10:49 ` Hirokazu Takahashi
2005-06-29 11:38   ` Nick Piggin
2005-06-30  3:32     ` Hirokazu Takahashi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=42BF9CD1.2030102@yahoo.com.au \
    --to=nickpiggin@yahoo.com.au \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox