* 2.5 page cache improvement idea
@ 2001-02-26 23:46 Ben LaHaise
2001-02-27 0:41 ` Christoph Hellwig
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Ben LaHaise @ 2001-02-26 23:46 UTC (permalink / raw)
To: linux-mm
Hey folks,
Here's an idea I just bounced off of Rik that seems like it would be
pretty useful. Currently the page cache hash is system wide. For 2.5,
I'm suggesting that we make the page cache hash a per-inode structure and
possibly move the page index and mapping into the structure's information.
Also, for dealing with hash collisions (which are going to happen under
certain well known circumstances), we could move to a b*tree structure
hanging off of the hashes. So we'd have a data structure that looks like
the following:
inode
-> hash table
-> struct page, index, mapping
-> head of b*tree for overflow
page
-> pointer back to hash bucket/b*tree entry
These changes would replace ~20 bytes in struct page with one pointer.
Now, continuing along with making struct page smaller, we can blast away
the wait queue and replace it with either a tiny-waitqueue (4 bytes) or
make use of hashed wait queues (0 bytes per page). That would save
another 8-12 bytes. Now, add in a couple of additional space savers like
making the zone pointer an index, and eliminating the virtual pointer, and
we have a struct page that's less than 32 bytes (we could even leave the
index/mapping in that way).
Tiny waitqueues are an idea based on the fact that we never have more than
~65536 waiters in the system (typically much less -> ~# of tasks). They
replace the whole spinlock/next/prev structure with a single long that
contains the index of the wait structure in a table in the high and low
words. By making use of cmpxchg on x86, one doesn't need spinlocks to
update this structure.
These are just a couple of quick ideas that I'll try to implement at some
point... Let me know of any thoughts on the matter.
-ben
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: 2.5 page cache improvement idea
2001-02-26 23:46 2.5 page cache improvement idea Ben LaHaise
@ 2001-02-27 0:41 ` Christoph Hellwig
2001-02-27 2:42 ` Chuck Lever
2001-02-27 11:52 ` Stephen C. Tweedie
2 siblings, 0 replies; 10+ messages in thread
From: Christoph Hellwig @ 2001-02-27 0:41 UTC (permalink / raw)
To: Ben LaHaise; +Cc: linux-mm
On Mon, Feb 26, 2001 at 06:46:24PM -0500, Ben LaHaise wrote:
> Hey folks,
>
> Here's an idea I just bounced off of Rik that seems like it would be
> pretty useful. Currently the page cache hash is system wide. For 2.5,
> I'm suggesting that we make the page cache hash a per-inode structure and
> possibly move the page index and mapping into the structure's information.
> Also, for dealing with hash collisions (which are going to happen under
> certain well known circumstances), we could move to a b*tree structure
> hanging off of the hashes. So we'd have a data structure that looks like
> the following:
>
>
> inode
Shouldn't this be address_space instead?
>
> -> hash table
> -> struct page, index, mapping
> -> head of b*tree for overflow
>
> page
> -> pointer back to hash bucket/b*tree entry
>
> These changes would replace ~20 bytes in struct page with one pointer.
Looks sane - the elimination of a systemwide resource should improve
scalability a lot - if it comes together with size reduction of major
structure only the side effects need some thoughs :P
Christoph
--
Of course it doesn't work. We've performed a software upgrade.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.5 page cache improvement idea
2001-02-26 23:46 2.5 page cache improvement idea Ben LaHaise
2001-02-27 0:41 ` Christoph Hellwig
@ 2001-02-27 2:42 ` Chuck Lever
2001-02-27 2:49 ` Ben LaHaise
2001-02-27 11:52 ` Stephen C. Tweedie
2 siblings, 1 reply; 10+ messages in thread
From: Chuck Lever @ 2001-02-27 2:42 UTC (permalink / raw)
To: Ben LaHaise, linux-mm
didn't andrea archangeli try this with red-black trees a couple of years
ago?
instead of hashing or b*trees, let me recommend self-organizing data
structures.
using a splay tree might improve locality of reference and optimize the tree
so
that frequently used pages appear close to the root.
----- Original Message -----
From: Ben LaHaise <bcrl@redhat.com>
To: <linux-mm@kvack.org>
Sent: Monday, February 26, 2001 6:46 PM
Subject: 2.5 page cache improvement idea
> Hey folks,
>
> Here's an idea I just bounced off of Rik that seems like it would be
> pretty useful. Currently the page cache hash is system wide. For 2.5,
> I'm suggesting that we make the page cache hash a per-inode structure and
> possibly move the page index and mapping into the structure's information.
> Also, for dealing with hash collisions (which are going to happen under
> certain well known circumstances), we could move to a b*tree structure
> hanging off of the hashes. So we'd have a data structure that looks like
> the following:
>
>
> inode
> -> hash table
> -> struct page, index, mapping
> -> head of b*tree for overflow
>
> page
> -> pointer back to hash bucket/b*tree entry
>
> These changes would replace ~20 bytes in struct page with one pointer.
> Now, continuing along with making struct page smaller, we can blast away
> the wait queue and replace it with either a tiny-waitqueue (4 bytes) or
> make use of hashed wait queues (0 bytes per page). That would save
> another 8-12 bytes. Now, add in a couple of additional space savers like
> making the zone pointer an index, and eliminating the virtual pointer, and
> we have a struct page that's less than 32 bytes (we could even leave the
> index/mapping in that way).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.5 page cache improvement idea
2001-02-27 2:42 ` Chuck Lever
@ 2001-02-27 2:49 ` Ben LaHaise
2001-02-27 3:26 ` Gerrit Huizenga
0 siblings, 1 reply; 10+ messages in thread
From: Ben LaHaise @ 2001-02-27 2:49 UTC (permalink / raw)
To: Chuck Lever; +Cc: linux-mm
On Mon, 26 Feb 2001, Chuck Lever wrote:
> didn't andrea archangeli try this with red-black trees a couple of years
> ago?
Yes. However, keep in mind that red-black are not particularly well
suited to the cache design of most modern cpus.
> instead of hashing or b*trees, let me recommend self-organizing data
> structures. using a splay tree might improve locality of reference and
> optimize the tree so that frequently used pages appear close to the
> root.
splay trees might well be an option, however keep in mind the benefits of
the hybrid hash/b*tree structure: for the typical case of a small number
of entries (ie a small file), the hash buckets will provide optimal O(1)
performance. Once we move into larger data files, the operation becomes
slightly more overhead (but still tunable according by increasing the hash
size) but still relatively low (2 cache misses for a two level b*tree
which maps a *lot* of data). The problem with something like a splay tree
is that it will generate bus traffic for the common case of reading from
the data file, which is something we don't want for a frequently accessed
mapping like glibc on a NUMA system. B*trees, while originally designed
for use on disks, apply quite well to modern cache heirarchies.
-ben
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.5 page cache improvement idea
2001-02-27 2:49 ` Ben LaHaise
@ 2001-02-27 3:26 ` Gerrit Huizenga
2001-02-27 5:47 ` Kanoj Sarcar
2001-02-27 13:42 ` Andrea Arcangeli
0 siblings, 2 replies; 10+ messages in thread
From: Gerrit Huizenga @ 2001-02-27 3:26 UTC (permalink / raw)
To: Ben LaHaise; +Cc: Chuck Lever, linux-mm
If you are considering NUMA architectures as a case of frequently
accesses pages, e.g. glibc or text pages of commonly used executables,
it is probably better to do page replication per node on demand than
to worry about optimizing the page lookups for limited bus traffic.
Most NUMA machines are relatively rich in physical memory, and cross
node traffic is relatively expensive. As a result, wasting a small
number of physical pages on duplicate read-only pages cuts down node
to node traffic in most cases. Many NUMA systems have a cache for
remote memory (some cache only remote pages, some cache local and remote
pages in the same cache - icky but cheaper). As that cache cycles,
it is cheaper to replace read-only text pages from the local node
rather than the remote. So, for things like kernel text (e.g. one of
the SGI patches) and for glibc's text, as well as the text of other
common shared libraries, it winds up being a significant win to replicate
those text pages (on demand!) in local memory.
gerrit
> On Mon, 26 Feb 2001, Chuck Lever wrote:
>
> > didn't andrea archangeli try this with red-black trees a couple of years
> > ago?
>
> Yes. However, keep in mind that red-black are not particularly well
> suited to the cache design of most modern cpus.
>
> > instead of hashing or b*trees, let me recommend self-organizing data
> > structures. using a splay tree might improve locality of reference and
> > optimize the tree so that frequently used pages appear close to the
> > root.
>
> splay trees might well be an option, however keep in mind the benefits of
> the hybrid hash/b*tree structure: for the typical case of a small number
> of entries (ie a small file), the hash buckets will provide optimal O(1)
> performance. Once we move into larger data files, the operation becomes
> slightly more overhead (but still tunable according by increasing the hash
> size) but still relatively low (2 cache misses for a two level b*tree
> which maps a *lot* of data). The problem with something like a splay tree
> is that it will generate bus traffic for the common case of reading from
> the data file, which is something we don't want for a frequently accessed
> mapping like glibc on a NUMA system. B*trees, while originally designed
> for use on disks, apply quite well to modern cache heirarchies.
>
> -ben
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux.eu.org/Linux-MM/
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.5 page cache improvement idea
2001-02-27 3:26 ` Gerrit Huizenga
@ 2001-02-27 5:47 ` Kanoj Sarcar
2001-02-27 9:05 ` Gerrit Huizenga
2001-02-27 13:42 ` Andrea Arcangeli
1 sibling, 1 reply; 10+ messages in thread
From: Kanoj Sarcar @ 2001-02-27 5:47 UTC (permalink / raw)
To: gerrit; +Cc: Ben LaHaise, Chuck Lever, linux-mm
> node traffic is relatively expensive. As a result, wasting a small
> number of physical pages on duplicate read-only pages cuts down node
> to node traffic in most cases. Many NUMA systems have a cache for
> remote memory (some cache only remote pages, some cache local and remote
> pages in the same cache - icky but cheaper). As that cache cycles,
> it is cheaper to replace read-only text pages from the local node
> rather than the remote. So, for things like kernel text (e.g. one of
> the SGI patches) and for glibc's text, as well as the text of other
The mips64 port onto SGI o2000 uses kernel text replication, that has
been part of 2.3/2.4 for a long time now. Is there another patch you
are talking about here?
Kanoj
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.5 page cache improvement idea
2001-02-27 5:47 ` Kanoj Sarcar
@ 2001-02-27 9:05 ` Gerrit Huizenga
2001-02-27 9:21 ` Kanoj Sarcar
0 siblings, 1 reply; 10+ messages in thread
From: Gerrit Huizenga @ 2001-02-27 9:05 UTC (permalink / raw)
To: Kanoj Sarcar; +Cc: Ben LaHaise, Chuck Lever, linux-mm
That change is platform specific, isn't it? I thought there was also
a recent IA-64 patch in progress for the same thing, but I might
be mistaken. I'm thinking that it would be useful if the machine
independent code supported kernel text replication as well as
shared/read-only text replication for user level applications.
gerrit
> > node traffic is relatively expensive. As a result, wasting a small
> > number of physical pages on duplicate read-only pages cuts down node
> > to node traffic in most cases. Many NUMA systems have a cache for
> > remote memory (some cache only remote pages, some cache local and remote
> > pages in the same cache - icky but cheaper). As that cache cycles,
> > it is cheaper to replace read-only text pages from the local node
> > rather than the remote. So, for things like kernel text (e.g. one of
> > the SGI patches) and for glibc's text, as well as the text of other
>
> The mips64 port onto SGI o2000 uses kernel text replication, that has
> been part of 2.3/2.4 for a long time now. Is there another patch you
> are talking about here?
>
> Kanoj
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.5 page cache improvement idea
2001-02-27 9:05 ` Gerrit Huizenga
@ 2001-02-27 9:21 ` Kanoj Sarcar
0 siblings, 0 replies; 10+ messages in thread
From: Kanoj Sarcar @ 2001-02-27 9:21 UTC (permalink / raw)
To: gerrit; +Cc: Ben LaHaise, Chuck Lever, linux-mm
>
>
> That change is platform specific, isn't it? I thought there was also
Yes, completely.
> a recent IA-64 patch in progress for the same thing, but I might
Yes, a couple of people are planning on attempting that soon, we will
see when someone gets to it.
> be mistaken. I'm thinking that it would be useful if the machine
> independent code supported kernel text replication as well as
There *might* be some changes needed for kernel text replication
(mostly to deal with /proc/kcore type of things), but for the most
part, this is really arch specific. You know, those types of things
that tend to show up mostly in production environment, rather than
at the prototype/testing stages.
> shared/read-only text replication for user level applications.
>
User level apps, now thats a different beast. I *suspect* a lot
of work needs to be put into replication heuristics before you can
reap benefits from it. Currently, this is in my pet numa priority list,
but not one of the top few.
Kanoj
> gerrit
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.5 page cache improvement idea
2001-02-27 3:26 ` Gerrit Huizenga
2001-02-27 5:47 ` Kanoj Sarcar
@ 2001-02-27 13:42 ` Andrea Arcangeli
1 sibling, 0 replies; 10+ messages in thread
From: Andrea Arcangeli @ 2001-02-27 13:42 UTC (permalink / raw)
To: Gerrit Huizenga; +Cc: Ben LaHaise, Chuck Lever, linux-mm
On Mon, Feb 26, 2001 at 07:26:18PM -0800, Gerrit Huizenga wrote:
> If you are considering NUMA architectures as a case of frequently
> accesses pages, e.g. glibc or text pages of commonly used executables,
> it is probably better to do page replication per node on demand than
> to worry about optimizing the page lookups for limited bus traffic.
>
> Most NUMA machines are relatively rich in physical memory, and cross
> node traffic is relatively expensive. As a result, wasting a small
> number of physical pages on duplicate read-only pages cuts down node
> to node traffic in most cases. Many NUMA systems have a cache for
> remote memory (some cache only remote pages, some cache local and remote
> pages in the same cache - icky but cheaper). As that cache cycles,
> it is cheaper to replace read-only text pages from the local node
> rather than the remote. So, for things like kernel text (e.g. one of
> the SGI patches) and for glibc's text, as well as the text of other
> common shared libraries, it winds up being a significant win to replicate
> those text pages (on demand!) in local memory.
Agreed.
Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.5 page cache improvement idea
2001-02-26 23:46 2.5 page cache improvement idea Ben LaHaise
2001-02-27 0:41 ` Christoph Hellwig
2001-02-27 2:42 ` Chuck Lever
@ 2001-02-27 11:52 ` Stephen C. Tweedie
2 siblings, 0 replies; 10+ messages in thread
From: Stephen C. Tweedie @ 2001-02-27 11:52 UTC (permalink / raw)
To: Ben LaHaise; +Cc: linux-mm
Hi,
On Mon, Feb 26, 2001 at 06:46:24PM -0500, Ben LaHaise wrote:
>
> inode
> -> hash table
> -> struct page, index, mapping
> -> head of b*tree for overflow
Isn't this going to bloat the size of the inode itself horribly,
though? You don't know in advance how much data you'll be caching
against any given inode, so I can only see this working if you use
dynamic hashing (in which case the btree overflow goes away).
Cheers,
Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2001-02-27 13:42 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-02-26 23:46 2.5 page cache improvement idea Ben LaHaise
2001-02-27 0:41 ` Christoph Hellwig
2001-02-27 2:42 ` Chuck Lever
2001-02-27 2:49 ` Ben LaHaise
2001-02-27 3:26 ` Gerrit Huizenga
2001-02-27 5:47 ` Kanoj Sarcar
2001-02-27 9:05 ` Gerrit Huizenga
2001-02-27 9:21 ` Kanoj Sarcar
2001-02-27 13:42 ` Andrea Arcangeli
2001-02-27 11:52 ` Stephen C. Tweedie
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox