[patch] arca-vm-2.2.5

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [patch] arca-vm-2.2.5
@ 1999-04-01 23:32 Andrea Arcangeli
  1999-04-04 21:07 ` Chuck Lever
  0 siblings, 1 reply; 51+ messages in thread
From: Andrea Arcangeli @ 1999-04-01 23:32 UTC (permalink / raw)
  To: linux-kernel, linux-mm; +Cc: Chuck Lever, Stephen C. Tweedie

Well in the last days I had new design ideas on the VM (I mean
shrink_mmap() and friends). I finished implementing them and the result
looks like impressive under heavy VM load.

I would like if people that runs linux under high VM load would try it out
my new VM code.

	ftp://e-mind.com/pub/linux/arca-tree/2.2.5_arca2.gz

If you try it out please feedback...

Thanks!

Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-01 23:32 [patch] arca-vm-2.2.5 Andrea Arcangeli
@ 1999-04-04 21:07 ` Chuck Lever
  1999-04-05  0:22   ` Andrea Arcangeli
  0 siblings, 1 reply; 51+ messages in thread
From: Chuck Lever @ 1999-04-04 21:07 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel, linux-mm, Stephen C. Tweedie

On Fri, 2 Apr 1999, Andrea Arcangeli wrote:
> Well in the last days I had new design ideas on the VM (I mean
> shrink_mmap() and friends). I finished implementing them and the result
> looks like impressive under heavy VM load.
> 
> I would like if people that runs linux under high VM load would try it out
> my new VM code.
> 
> 	ftp://e-mind.com/pub/linux/arca-tree/2.2.5_arca2.gz

i noticed a couple of differences with your original modifications, right
off the bat.

first, i notice you've altered the page hash function and quadrupled the
size of the hash table.  do you have measurements/benchmarks that show
that the page hash was not working well?  can you say how a plain 2.2.5
kernel compares to one that has just the page hash changes without the
rest of your VM modifications? the reason i ask is because i've played
with that hash table, and found most changes to it cause undesirable
increases in system CPU utilization.  although, it *is* highly interesting
that the buffer hash table is orders of magnitude larger, yet hashes about
the same number of objects.  can someone provide history on the design of
the page hash function?

also, can you tell what improvement you expect from the additional logic
in try_to_free_buffers() ?

	- Chuck Lever
--
corporate:	<chuckl@netscape.com>
personal:	<chucklever@netscape.net> or <cel@monkey.org>

The Linux Scalability project:
	http://www.citi.umich.edu/projects/citi-netscape/

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-04 21:07 ` Chuck Lever
@ 1999-04-05  0:22   ` Andrea Arcangeli
  1999-04-05 13:23     ` Mark Hemment
                       ` (2 more replies)
  0 siblings, 3 replies; 51+ messages in thread
From: Andrea Arcangeli @ 1999-04-05  0:22 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-kernel, linux-mm, Stephen C. Tweedie, Linus Torvalds

On Sun, 4 Apr 1999, Chuck Lever wrote:

>> 	ftp://e-mind.com/pub/linux/arca-tree/2.2.5_arca2.gz

*snip*

>first, i notice you've altered the page hash function and quadrupled the

The page hash function change is from Stephen (I did it here too because I
completly agreed with it). The point is that shm entries uses the lower
bits of the pagemap->offset field.

>size of the hash table.  do you have measurements/benchmarks that show
>that the page hash was not working well?  can you say how a plain 2.2.5

The page_hash looked like to me a quite obvious improvement while swapping
in/out shm entreis (it will improve the swap cache queries) but looks my
comment below...

>kernel compares to one that has just the page hash changes without the
>rest of your VM modifications? the reason i ask is because i've played

The reason of that is that it's an obvious improvement. And since it's
statically allocated (not dynamically allocated at boot in function of the
memory size) a bit larger default can be desiderable, I can safely alloc
some more bit of memory (some decade of kbyte) without harming the
avalilable mm here. Well, as I just said many times I think someday we'll
need RB-trees instead of fuzzy hash but it's not a big issue right now
due the so low number of pages available.

Returning to your question in my tree I enlarged the hashtable to 13 bit.
This mean that in the best case I'll be able to address in O(1) up to 8192
pages. Here I have 32752 pages so as worse I'll have 4 pages chained on
every hash entry. 13 bits of hash-depth will alloc for the hash 32k of
memory (really not an issue ;).

In the stock kernel instead the hash size is 2^11 = 2048 so in the worst
case I would have 16 pages chained in the same hash entry.

>with that hash table, and found most changes to it cause undesirable
>increases in system CPU utilization.  although, it *is* highly interesting

Swapping out/in shm entries is not a so frequent task as doing normal
query on the page cache. So I am removing the patch here. Thanks for the
info, I really didn't thought about this...

For the record this is the hash-function change we are talking about:

Index: pagemap.h
===================================================================
RCS file: /var/cvs/linux/include/linux/pagemap.h,v
retrieving revision 1.1.1.2
retrieving revision 1.1.2.12
diff -u -r1.1.1.2 -r1.1.2.12
--- pagemap.h	1999/01/23 16:29:55	1.1.1.2
+++ pagemap.h	1999/04/01 23:12:37	1.1.2.12
@@ -36,7 +35,7 @@
 #define i (((unsigned long) inode)/(sizeof(struct inode) & ~ (sizeof(struct inode) - 1)))
 #define o (offset >> PAGE_SHIFT)
 #define s(x) ((x)+((x)>>PAGE_HASH_BITS))
-	return s(i+o) & (PAGE_HASH_SIZE-1);
+	return s(i+o+offset) & (PAGE_HASH_SIZE-1);
 #undef i
 #undef o
 #undef s

>that the buffer hash table is orders of magnitude larger, yet hashes about
>the same number of objects.  can someone provide history on the design of
>the page hash function?

I can't help you into this, but looks Ok to me ;). If somebody did the
math on it I'd like to try understanding it.

>also, can you tell what improvement you expect from the additional logic
>in try_to_free_buffers() ?

Eh, my shrink_mmap() is is a black magic and it's long to explain what I
thought ;). Well one of the reasons is that ext2 take used the superblock
all the time and so when I reach an used buffers I'll put back at the top
of the lru list since I don't want to go in swap because there are some
unfreeable superblock that live forever at the end of the pagemap
lru_list.

Note also (you didn't asked about that but I bet you noticed that ;) that
in my tree I also made every pagemap entry L1 cacheline aliged. I asked to
people that was complainig about page colouring (and I still don't know
what is exactly page colouring , I only have a guess but I would like to
read something about implementation details, pointers???) to try out my
patch to see if it made differences; but I had no feedback :(. I also
made the irq_state entry cacheline aligned (when I understood the
cacheline issue I agreed with it).

Many thanks for commenting and reading my new experimental code (rock
solid here). I'll release now a:

	ftp://e-mind.com/pub/linux/arca-tree/2.2.5_arca4.bz2

It will have my latest stuff I did (flushtime-bugfix included and sane
sysctl values included too) in the last days plus the old hash-function
for the reasons you pointed out to me now.

If you'll find some spare time to try out the new patch let me know the
numbers! ;))

Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-05  0:22   ` Andrea Arcangeli
@ 1999-04-05 13:23     ` Mark Hemment
  1999-04-05 15:56       ` Andrea Arcangeli
                         ` (2 more replies)
  1999-04-05 21:31     ` Chuck Lever
  1999-04-06 14:00     ` Stephen C. Tweedie
  2 siblings, 3 replies; 51+ messages in thread
From: Mark Hemment @ 1999-04-05 13:23 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel, linux-mm

Hi Andrea/All,

  I'm just throwing some ideas around.....

> The page_hash looked like to me a quite obvious improvement while swapping
> in/out shm entreis (it will improve the swap cache queries) but looks my
> comment below...

  One worthwhile improvement to the page-hash is to reduce the need to
re-check the hash after a blocking operation (ie. page allocation).
  Changing the page-hash from an array of page ptrs to an array of;
	typedef struct page_hash_s {
		struct page	*ph_page;
		unsigned int	 ph_cookie;
	} page_hash_t;

  	page_hash_t page_hash[PAGE_HASH_TABLE];

  Whenever a new page is linked into a hash line, the ph_cookie for that
line is incremented.
  Before a page allocation, take a copy of the cookie for the hash line
where the page will be inserted.  If, after the allocation, the cookie
hasn't changed, then there is no reason to re-search the hash.

  This does double the size of the page-hash, and would require profiling
to determine if it is worthwhile.

  (Note: If the VM sub-system was threaded, there would be a different
solution than the above.)

> Note also (you didn't asked about that but I bet you noticed that ;) that
> in my tree I also made every pagemap entry L1 cacheline aliged. I asked to
> people that was complainig about page colouring (and I still don't know
> what is exactly page colouring , I only have a guess but I would like to
> read something about implementation details, pointers???) to try out my
> patch to see if it made differences; but I had no feedback :(. I also
> made the irq_state entry cacheline aligned (when I understood the
> cacheline issue I agreed with it).

  Yikes!!!!
  The page structure needs to be as small as possible.  If its size
happens to L1 align, then that is great, but otherwise it isn't worth the
effort - the extra memory used to store the "padding" is much better used
else where.
  Most accesses to the page struct are reads, this means it can live in
the Shared state across mutilple L1 caches.  The "slightly" common
operation of incremented the ref-count/changing-flag-bits doesn't really
come into play often enough to matter.
  Keeping the struct small can result in part of the page struct of
interest in the L1 cache, along with part of the next one.  As it isn't a
heavily modified structure, with no spin locks, "false sharing" isn't a
problem.  Besides, the VM isn't threaded, so it isn't going to be playing
ping-pong with the cache lines anyway.
  OK, with a smaller than h/w cache sized structure, whose size isn't a
a power of 2 (assuming cache-line sizes only come in power of 2) there may
need to be 2 cache-line faults to load the required members of a struct.
Smaller the page struct, the less lightly this is to happen.

  If you want to reduce the page struct size, it is possible to unionize
the "inode" and "buffers" pointers into one member, indicating which is
active via bits in the flag fields.  (Note:  If you do this change, the
flag-bit doesn't need to be set for the Swap-Cache state - this is a
special case for anonymous pages - looks like it drops out nicely.)

  Adding an extra member to page-struct, to count I/O write operations on
a page, is useful.  When a page is under going I/O, it has the PG_locked
bit set.  Fine.  If the I/O is write, then in many cases there is no
reason to block waiting for the I/O to finish.  eg.  Under a fault for
page not present, provided the returned mapping for the page is read-only,
why wait for the write operation to finish?  There is no need.
  Why use a count for I/O write ops?  Take the case of a WP fault.  This
does a page allocation as its first operation, just so the following
operations occur atomically.  It may be possible (I haven't checked it
through fully) to simply I/O write lock the faulting page, this (along
with some other code re-arrangement in do_wp_page() and shrink_mmap())
could allow this allocation to occur later - when we are more sure that it
is requried.

  Page-colouring is a bast*rd.  Modifying the page allocator to take a
'suggested' colour into account isn't too difficult.  I messed around with
this about 16mths ago (for added performance, changed the coalescing from
eager to lazy).  The problem end is page fragmentation.  Some of the 
fragmentation can be over come with a more intelligent page reaper, but
going as far as adding pte-chaining is way, way, over the top.  Going the
BSD 4.4 route, with linked lists of shadow objects, is possible, and gets
very interesting. :)
  Some of the fragmentation probs can be handled better by dropping the
static virt-to-phys mapping, and making it dynamic.  Allowing the page
allocator to give out virtually contigious pages (as well as the current
physically contigious pages) gives it more flexibility.  This has a _lot_
of issues, not least with device drivers - either they have strict
requirements on the type of memory they perform I/O on (ie. they accept
phys contigious only), or they need to be changed to perform
scatter/gather.

Regards,
Mark

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-05 13:23     ` Mark Hemment
@ 1999-04-05 15:56       ` Andrea Arcangeli
  1999-04-07 11:28         ` [patch] only-one-cache-query [was Re: [patch] arca-vm-2.2.5] Andrea Arcangeli
  1999-04-05 20:24       ` [patch] arca-vm-2.2.5 Horst von Brand
  1999-04-17 11:12       ` Andrea Arcangeli
  2 siblings, 1 reply; 51+ messages in thread
From: Andrea Arcangeli @ 1999-04-05 15:56 UTC (permalink / raw)
  To: Mark Hemment; +Cc: linux-kernel, linux-mm

On Mon, 5 Apr 1999, Mark Hemment wrote:

>Hi Andrea/All,
>
>  I'm just throwing some ideas around.....
>
>> The page_hash looked like to me a quite obvious improvement while swapping
>> in/out shm entreis (it will improve the swap cache queries) but looks my
>> comment below...
>
>  One worthwhile improvement to the page-hash is to reduce the need to
>re-check the hash after a blocking operation (ie. page allocation).

Yes, I just thought about that some time ago.

>  Changing the page-hash from an array of page ptrs to an array of;
>	typedef struct page_hash_s {
>		struct page	*ph_page;
>		unsigned int	 ph_cookie;
>	} page_hash_t;
>
>  	page_hash_t page_hash[PAGE_HASH_TABLE];
>
>  Whenever a new page is linked into a hash line, the ph_cookie for that
>line is incremented.
>  Before a page allocation, take a copy of the cookie for the hash line
>where the page will be inserted.  If, after the allocation, the cookie
>hasn't changed, then there is no reason to re-search the hash.
>
>  This does double the size of the page-hash, and would require profiling
>to determine if it is worthwhile.

I don't think it's worthwhile simply because most of the time you'll have
only to pass as _worse_ one or two chains of the hash entry. And passing
one or two chains will be far more ligther than the new mechanism.

If you have gigabyte of cache you will be hurted by performances anyway
and you'll enlarge the hash size anyway and so you'll return in no need of
the mechanizm.

But maybe I am plain wrong. If you'll try let me know the profiling (I am
very courious). At least the bench will be trivial ;). Just do a 'time cat
/usr/bin/* >/dev/null' with 100mbyte of cache and then with 10mbyte of
memory usable for the cache with the two kernels and you'll see if it's
worthwhile.

>  (Note: If the VM sub-system was threaded, there would be a different
>solution than the above.)

Which is the other solution? And which is the point of the threaded issue?
We are just playing games with threaded issues exactly because we have a
sleep to allocate memory.

>> Note also (you didn't asked about that but I bet you noticed that ;) that
>> in my tree I also made every pagemap entry L1 cacheline aliged. I asked to
>> people that was complainig about page colouring (and I still don't know
>> what is exactly page colouring , I only have a guess but I would like to
>> read something about implementation details, pointers???) to try out my
>> patch to see if it made differences; but I had no feedback :(. I also
>> made the irq_state entry cacheline aligned (when I understood the
>> cacheline issue I agreed with it).
>
>  Yikes!!!!
>  The page structure needs to be as small as possible.  If its size
>happens to L1 align, then that is great, but otherwise it isn't worth the
>effort - the extra memory used to store the "padding" is much better used
>else where.

Note that in UP I am not enlarging the pagemap struct size. I will agree
that the padding is much better used else where when somebody that
complains about page colouring will show me no difference in numbers
between L1 cache alignment and zero padding.

>  Most accesses to the page struct are reads, this means it can live in
>the Shared state across mutilple L1 caches.  The "slightly" common
>operation of incremented the ref-count/changing-flag-bits doesn't really
>come into play often enough to matter.

Also setting swapcache/locked/update and whatever SMP-locking flags
etc..etc...

>problem.  Besides, the VM isn't threaded, so it isn't going to be playing
>ping-pong with the cache lines anyway.

It's not threaded now but I was just making shrink_mmap() threaded here.
With my new pagemap_lru it's trivial to make shrink_mmap() threaded (just 
add some spinlock to add_cache/del_cache/mkyoung() and use
by hand releasekernellock reaquirekernellock).

>  If you want to reduce the page struct size, it is possible to unionize
>the "inode" and "buffers" pointers into one member, indicating which is
>active via bits in the flag fields.  (Note:  If you do this change, the

Agreed!

>flag-bit doesn't need to be set for the Swap-Cache state - this is a
>special case for anonymous pages - looks like it drops out nicely.)

I think that PG_swap_cache is simply overhead to get performances. It's
not needed right now _too_, because you only need to compare page->inode
with &swapper_inode to know if it's a swap cache page or not. The ->buffer
field has nothing to do with that.

>  Adding an extra member to page-struct, to count I/O write operations on
>a page, is useful.  When a page is under going I/O, it has the PG_locked
>bit set.  Fine.  If the I/O is write, then in many cases there is no
>reason to block waiting for the I/O to finish.  eg.  Under a fault for

Agreed it would make sense, even if I don't know how much this will make
difference while I am sure that the complexity will increase a good bit...

>page not present, provided the returned mapping for the page is read-only,
>why wait for the write operation to finish?  There is no need.
>  Why use a count for I/O write ops?  Take the case of a WP fault.  This
>does a page allocation as its first operation, just so the following
>operations occur atomically.  It may be possible (I haven't checked it
>through fully) to simply I/O write lock the faulting page, this (along
>with some other code re-arrangement in do_wp_page() and shrink_mmap())
>could allow this allocation to occur later - when we are more sure that it
>is requried.

When you return from a fault if you don't map a new page you'll get a
second fault. If you fault you need the (maybe writable) page alloced
_now_. Maybe I am missing your point...

Many thanks your interesting comments/thoughts.

Andrea Arcangeli

BTW, Did you seen my kmem_cache_destroy()? I got zero feedback about it
    ;). It worked fine so far here and it's always included in my
    arca-tree patches. Alexander Viro (or others?, i don't remeber
    well) asked for it in order to avoid a two level modules.

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-05 13:23     ` Mark Hemment
  1999-04-05 15:56       ` Andrea Arcangeli
@ 1999-04-05 20:24       ` Horst von Brand
  1999-04-05 23:25         ` Andrea Arcangeli
  1999-04-17 11:12       ` Andrea Arcangeli
  2 siblings, 1 reply; 51+ messages in thread
From: Horst von Brand @ 1999-04-05 20:24 UTC (permalink / raw)
  To: Mark Hemment; +Cc: Andrea Arcangeli, linux-kernel, linux-mm

Mark Hemment <markhe@sco.COM> said:

[...]

>   One worthwhile improvement to the page-hash is to reduce the need to
> re-check the hash after a blocking operation (ie. page allocation).
>   Changing the page-hash from an array of page ptrs to an array of;
> 	typedef struct page_hash_s {
> 		struct page	*ph_page;
> 		unsigned int	 ph_cookie;
> 	} page_hash_t;
> 
>   	page_hash_t page_hash[PAGE_HASH_TABLE];
> 
>   Whenever a new page is linked into a hash line, the ph_cookie for that
> line is incremented.

If you link new pages in at the start (would make sense, IMHO, since they
will probably be used soon) you can just use the pointer as cookie.
-- 
Dr. Horst H. von Brand                       mailto:vonbrand@inf.utfsm.cl
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-05  0:22   ` Andrea Arcangeli
  1999-04-05 13:23     ` Mark Hemment
@ 1999-04-05 21:31     ` Chuck Lever
  1999-04-06  0:15       ` Andrea Arcangeli
  1999-04-06 14:02       ` Stephen C. Tweedie
  1999-04-06 14:00     ` Stephen C. Tweedie
  2 siblings, 2 replies; 51+ messages in thread
From: Chuck Lever @ 1999-04-05 21:31 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel, linux-mm

On Mon, 5 Apr 1999, Andrea Arcangeli wrote:
> >first, i notice you've altered the page hash function and quadrupled the
> 
> The page hash function change is from Stephen (I did it here too because I
> completly agreed with it). The point is that shm entries uses the lower
> bits of the pagemap->offset field.

hmmm.  wouldn't you think that hashing with the low order bits in the
offset would cause two different offsets against the same page to result
in the hash function generating different output?  the performance
difference isn't too bad with Stephen's hash function -- i tried it out
last night, and it's on the order of 1% slower on a 233Mhz K6 for general
lookups.  i didn't try shm (not sure how i would :)

> >kernel compares to one that has just the page hash changes without the
> >rest of your VM modifications? the reason i ask is because i've played
> 
> The reason of that is that it's an obvious improvement. And since it's
> statically allocated (not dynamically allocated at boot in function of the
> memory size) a bit larger default can be desiderable, I can safely alloc
> some more bit of memory (some decade of kbyte) without harming the
> avalilable mm here. Well, as I just said many times I think someday we'll
> need RB-trees instead of fuzzy hash but it's not a big issue right now
> due the so low number of pages available.
> 
> Returning to your question in my tree I enlarged the hashtable to 13 bit.
> This mean that in the best case I'll be able to address in O(1) up to 8192
> pages. Here I have 32752 pages so as worse I'll have 4 pages chained on
> every hash entry. 13 bits of hash-depth will alloc for the hash 32k of
> memory (really not an issue ;).
> 
> In the stock kernel instead the hash size is 2^11 = 2048 so in the worst
> case I would have 16 pages chained in the same hash entry.

unfortunately it doesn't work that way.  i've measured the buffer hash
function in real-world tests, and found that the worst case is
significantly bad (try this: most buckets are unused, while several
buckets out of 32K buckets have hundreds of buffers).  i have a new hash
function that works very well, and even helps inter-run variance and
perceived interactive response.  i'll post more on this soon.

but also the page hash function uses the hash table size as a shift value
when computing the index, so it may combine the interesting bits in a
different (worse) way when you change the hash table size.  i'm planning
to instrument the page hash to see exactly what's going on.

> >that the buffer hash table is orders of magnitude larger, yet hashes about
> >the same number of objects.  can someone provide history on the design of
> >the page hash function?
> 
> I can't help you into this, but looks Ok to me ;). If somebody did the
> math on it I'd like to try understanding it.

IMHO, i'd think if the buffer cache has a large hash table, you'd want the
page cache to have a table as large or larger.  both track roughly the
same number of objects...  and the page cache is probably used more often
than the buffer cache.  i can experiment with this; i've already been
looking at this behavior for a couple of weeks.

a good reason to try this is to get immediate scalability gains in terms
of the number of objects these hash functions can handle before lookup
time becomes unacceptible.

> >also, can you tell what improvement you expect from the additional logic
> >in try_to_free_buffers() ?
> 
> Eh, my shrink_mmap() is is a black magic and it's long to explain what I
> thought ;). Well one of the reasons is that ext2 take used the superblock
> all the time and so when I reach an used buffers I'll put back at the top
> of the lru list since I don't want to go in swap because there are some
> unfreeable superblock that live forever at the end of the pagemap
> lru_list.

i agree that the whole point of the modification is to help the system
choose the best page to get rid of -- swapping a superblock is probably a
bad idea.  :)

someone mentioned putting these on a separate LRU list, or something like
that.  maybe you should try that instead?  i suspect it might be cleaner
logic?

> Note also (you didn't asked about that but I bet you noticed that ;) that
> in my tree I also made every pagemap entry L1 cacheline aliged. I asked to
> people that was complainig about page colouring (and I still don't know
> what is exactly page colouring , I only have a guess but I would like to
> read something about implementation details, pointers???) to try out my
> patch to see if it made differences; but I had no feedback :(. I also
> made the irq_state entry cacheline aligned (when I understood the
> cacheline issue I agreed with it).

if i can, i'd like to separate out the individual modifications and try
them each compared to a stock kernel.  that usually shows exactly which
changes are useful.

if you want to learn more about page coloring, here are some excellent
references:

Hennessy & Patterson, "Computer Architecture: A Quantitative Approach,"
2nd edition, Morgan Kaufman Publishers, 1998.  look in the chapter on CPU
caches (i'd cite the page number here, but my copy is at home).

Lynch, William, "The Interaction of Virtual Memory and Cache Memory,"
Technical Report CSL-TR-93-587, Stanford University Department of
Electrical Engineering and Computer Science, October 1993.

	- Chuck Lever
--
corporate:	<chuckl@netscape.com>
personal:	<chucklever@netscape.net> or <cel@monkey.org>

The Linux Scalability project:
	http://www.citi.umich.edu/projects/citi-netscape/

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-05 20:24       ` [patch] arca-vm-2.2.5 Horst von Brand
@ 1999-04-05 23:25         ` Andrea Arcangeli
  1999-04-05 23:37           ` Horst von Brand
  0 siblings, 1 reply; 51+ messages in thread
From: Andrea Arcangeli @ 1999-04-05 23:25 UTC (permalink / raw)
  To: Horst von Brand; +Cc: Mark Hemment, linux-kernel, linux-mm

On Mon, 5 Apr 1999, Horst von Brand wrote:

>If you link new pages in at the start (would make sense, IMHO, since they
>will probably be used soon) you can just use the pointer as cookie.

You can have two points of the kernel that are sleeping waiting to alloc
memory for a cache page at the same time.

Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-05 23:25         ` Andrea Arcangeli
@ 1999-04-05 23:37           ` Horst von Brand
  1999-04-06  1:23             ` Andrea Arcangeli
  0 siblings, 1 reply; 51+ messages in thread
From: Horst von Brand @ 1999-04-05 23:37 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Horst von Brand, Mark Hemment, linux-kernel, linux-mm

Andrea Arcangeli <andrea@e-mind.com> said:
> On Mon, 5 Apr 1999, Horst von Brand wrote:
> >If you link new pages in at the start (would make sense, IMHO, since they
> >will probably be used soon) you can just use the pointer as cookie.

> You can have two points of the kernel that are sleeping waiting to alloc
> memory for a cache page at the same time.

So what? One wakes up, finds the same pointer it stashed away ==> Installs
new page (changing pointer) via short way. Second wakes up, finds pointer
changed ==> goes long way to do its job.

Or am I overlooking something stupid?
-- 
Dr. Horst H. von Brand                       mailto:vonbrand@inf.utfsm.cl
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-05 21:31     ` Chuck Lever
@ 1999-04-06  0:15       ` Andrea Arcangeli
  1999-04-06  2:14         ` Doug Ledford
  1999-04-06  5:52         ` Chuck Lever
  1999-04-06 14:02       ` Stephen C. Tweedie
  1 sibling, 2 replies; 51+ messages in thread
From: Andrea Arcangeli @ 1999-04-06  0:15 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-kernel, linux-mm

On Mon, 5 Apr 1999, Chuck Lever wrote:

>buckets out of 32K buckets have hundreds of buffers).  i have a new hash
>function that works very well, and even helps inter-run variance and
>perceived interactive response.  i'll post more on this soon.

Cool! ;)) But could you tell me _how_ do you design an hash function? Are
you doing math or do you use instinct? I never gone into the details of
benchmarking or designing an hash function simply because I don't like
fuzzy hash and instead I want to replace them with RB-trees all over the
place (and I know the math about RB-trees), but I like to learn how to
design an hash function anyway (even if I don't want to discover that
myself without reading docs as usual ;).

>but also the page hash function uses the hash table size as a shift value
>when computing the index, so it may combine the interesting bits in a
>different (worse) way when you change the hash table size.  i'm planning
>to instrument the page hash to see exactly what's going on.

Agreed. This is true. I thought about that and I am resizing the hash
table size to the original 11 bit now (since you are confirming that I
broken the hash function).

>IMHO, i'd think if the buffer cache has a large hash table, you'd want the
>page cache to have a table as large or larger.  both track roughly the
>same number of objects...  and the page cache is probably used more often

Agreed.

>than the buffer cache.  i can experiment with this; i've already been
>looking at this behavior for a couple of weeks.

Sure!! page cache is far more used than the buffer cache!! Also think that
most of the time with the page cache we have to do a _double_ query in
order to insert a new entry in the page cache.

>a good reason to try this is to get immediate scalability gains in terms
>of the number of objects these hash functions can handle before lookup
>time becomes unacceptible.

I am not going to spend time on fuzzy hash because my instinct tells me to
drop them.

>i agree that the whole point of the modification is to help the system
>choose the best page to get rid of -- swapping a superblock is probably a
>bad idea.  :)

The point is not to swapout it (it's not possible to swapout a buffer),
but the point is that if the unfreeable buffer will stay at the end of the
lru list, it will increase the swapping pressure because when we'll probe
the lru list we'll notice a fixed amount of unfreeable pages (and a fixed
unfreeable amount of lru_cache will mean that we have to go in swap_out()
in order to free them or to swapout something else. As just said my new
shrink_mmap() is been completly redesigned and reimplemented by me and has
nothing to do with the old one. It made perfect sense to me but I think
it's not so obvious as the old shrink_mmap()...

>someone mentioned putting these on a separate LRU list, or something like
>that.  maybe you should try that instead?  i suspect it might be cleaner
>logic?

Theorically yes, but it won't be a cleaner coding according to me. Always
putting them at the top of the lru list when we catch them browsing the
end of lru list (as I am doing now) will remove the issue without
inserting special code. And the superblock thing is not the only reason I
am handling the special "buffer-in_use" case in my shrink_mmap().

>if i can, i'd like to separate out the individual modifications and try
>them each compared to a stock kernel.  that usually shows exactly which
>changes are useful.

Fine, I can't do that myself simply because I don't have the time , but if
you want to do the finegrined testing I'll be glad to send you my code
separated and backported to 2.2.5. ;)) Thanks!!

The first thing I am interested to bench is the pagemap struct cacheline
aligned. If it won't improve performances I want to drop it because it's
causing a relevant waste of memory.

here is the pagemap L1-aligned patch against 2.2.5 (I hope to have adapted
it correctly to 2.2.5 even if it's a bit late and I drunk some beer before
came back to home ;):

--- linux/include/linux/mm.h	Tue Mar  9 01:55:28 1999
+++ mm.h	Tue Apr  6 02:00:22 1999
@@ -131,0 +133,6 @@
+#ifdef __SMP__
+	/* cacheline alignment */
+	char dummy[(sizeof(void *) * 7 +
+		    sizeof(unsigned long) * 2 +
+		    sizeof(atomic_t)) % L1_CACHE_BYTES];
+#endif
Index: linux/mm/page_alloc.c
diff -u linux/mm/page_alloc.c:1.1.1.3 linux/mm/page_alloc.c:1.1.2.28
--- linux/mm/page_alloc.c:1.1.1.3	Tue Jan 26 19:32:27 1999
+++ linux/mm/page_alloc.c	Fri Apr  2 01:12:37 1999
@@ -315,7 +318,7 @@
 	freepages.min = i;
 	freepages.low = i * 2;
 	freepages.high = i * 3;
-	mem_map = (mem_map_t *) LONG_ALIGN(start_mem);
+	mem_map = (mem_map_t *) L1_CACHE_ALIGN(start_mem);
 	p = mem_map + MAP_NR(end_mem);
 	start_mem = LONG_ALIGN((unsigned long) p);
 	memset(mem_map, 0, start_mem - (unsigned long) mem_map);

And here instead I extracted from my CVS tree the other thing I am not
sure about and that you was asking for (this will cause less waste of
memory but I am very courious to know how it will make differences on a
4-way SMP):

Index: arch/i386/kernel/irq.c
===================================================================
RCS file: /var/cvs/linux/arch/i386/kernel/irq.c,v
retrieving revision 1.1.1.3
retrieving revision 1.1.2.11
diff -u -r1.1.1.3 -r1.1.2.11
--- irq.c	1999/02/20 15:38:00	1.1.1.3
+++ linux/arch/i386/kernel/irq.c	1999/04/04 01:22:53	1.1.2.11
@@ -139,7 +139,7 @@
 /*
  * Controller mappings for all interrupt sources:
  */
-irq_desc_t irq_desc[NR_IRQS] = { [0 ... NR_IRQS-1] = { 0, &no_irq_type, }};
+irq_desc_t irq_desc[NR_IRQS] __cacheline_aligned = { [0 ... NR_IRQS-1] = { 0, &no_irq_type, }};

:wq

 /*
Index: arch/i386/kernel/irq.h
===================================================================
RCS file: /var/cvs/linux/arch/i386/kernel/irq.h,v
retrieving revision 1.1.1.3
diff -u -r1.1.1.3 irq.h
--- irq.h	1999/02/20 15:38:01	1.1.1.3
+++ linux/arch/i386/kernel/irq.h	1999/04/01 22:53:07
@@ -39,6 +39,9 @@
 	struct hw_interrupt_type *handler;	/* handle/enable/disable functions */
 	struct irqaction *action;		/* IRQ action list */
 	unsigned int depth;			/* Disable depth for nested irq disables */
+#ifdef __SMP__
+	unsigned int unused[4];
+#endif
 } irq_desc_t;

 /*

>if you want to learn more about page coloring, here are some excellent
>references:
>
>Hennessy & Patterson, "Computer Architecture: A Quantitative Approach,"
>2nd edition, Morgan Kaufman Publishers, 1998.  look in the chapter on CPU
>caches (i'd cite the page number here, but my copy is at home).
>
>Lynch, William, "The Interaction of Virtual Memory and Cache Memory,"
>Technical Report CSL-TR-93-587, Stanford University Department of
>Electrical Engineering and Computer Science, October 1993.

Thanks!! but sigh, I don't have too much money to buy them right now...
But I'll save them and I'll buy them ASAP. In the meantime I'll be in the
hope of GPL'd docs...

Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-05 23:37           ` Horst von Brand
@ 1999-04-06  1:23             ` Andrea Arcangeli
  0 siblings, 0 replies; 51+ messages in thread
From: Andrea Arcangeli @ 1999-04-06  1:23 UTC (permalink / raw)
  To: Horst von Brand; +Cc: Mark Hemment, linux-kernel, linux-mm

On Mon, 5 Apr 1999, Horst von Brand wrote:

>So what? One wakes up, finds the same pointer it stashed away ==> Installs
>new page (changing pointer) via short way. Second wakes up, finds pointer
>changed ==> goes long way to do its job.
>
>Or am I overlooking something stupid?

What if the page that was at the start of the chain gets removed, the page
that we are allocing gets inserted and then the same page that gets
released before will be inserted again?


	page0 -> page1

remove page 0

	page1

anoher piece of code need ourpage and go to alloc it -> insert ourpage

	ourpage -> page1

insert page0 again

	page0 -> ourpage -> page1

Now we have alloced memory succesfully for ourpage and we are going to
insert it. page0 is still here and if we don't take the slow way we'll add
it twice:

	ourpage -> page0 -> ourpage -> page1

Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-06  0:15       ` Andrea Arcangeli
@ 1999-04-06  2:14         ` Doug Ledford
  1999-04-06 13:04           ` Andrea Arcangeli
  1999-04-06  5:52         ` Chuck Lever
  1 sibling, 1 reply; 51+ messages in thread
From: Doug Ledford @ 1999-04-06  2:14 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Chuck Lever, linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2895 bytes --]

Andrea Arcangeli wrote:
> 
> On Mon, 5 Apr 1999, Chuck Lever wrote:
> 
> >buckets out of 32K buckets have hundreds of buffers).  i have a new hash
> >function that works very well, and even helps inter-run variance and
> >perceived interactive response.  i'll post more on this soon.
> 
> Cool! ;)) But could you tell me _how_ do you design an hash function? Are
> you doing math or do you use instinct? I never gone into the details of
> benchmarking or designing an hash function simply because I don't like
> fuzzy hash and instead I want to replace them with RB-trees all over the
> place (and I know the math about RB-trees), but I like to learn how to
> design an hash function anyway (even if I don't want to discover that
> myself without reading docs as usual ;).
> 
> >but also the page hash function uses the hash table size as a shift value
> >when computing the index, so it may combine the interesting bits in a
> >different (worse) way when you change the hash table size.  i'm planning
> >to instrument the page hash to see exactly what's going on.
> 
> Agreed. This is true. I thought about that and I am resizing the hash
> table size to the original 11 bit now (since you are confirming that I
> broken the hash function).
> 
> >IMHO, i'd think if the buffer cache has a large hash table, you'd want the
> >page cache to have a table as large or larger.  both track roughly the
> >same number of objects...  and the page cache is probably used more often
> 
> Agreed.

Hmmm...I've talked about this a few times to Alan Cox and Stephen
Tweedie.  I didn't bother to instrument the hash function because in
this case I knew it was tuned to the size of the inode structs.  But, I
did implement a variable sized page cache hash table array.  I did this
because at 512MB of RAM, having only 2048 hash buckets (11 bits) became
a real bottleneck in page cache hash entry lookups.  Essentially, when
the number of physical pages per hash buckets gets too large, then hash
chains can grow excessively long and take entirely too long to look up
when the page is not in the page cache (or when it is but it's at the
end of a long chain).  Since these lookups occur as frequently as they
do, they get to be excessive quickly.  I'm attaching the patch I've been
using here for a while.  Feel free to try it out.  One of the best tests
for the effect of this patch is a bonnie run on a machine with more disk
speed than CPU power.  In that case the page cache lookup becomes a
bottleneck.  In other conditions, you see the same disk speed results
but with smaller % CPU utilization.  On a PII400 dual machine with 512MB
RAM and a 4 Cheetah RAID0 array it made a 10MByte/s difference in
throughput.  On another machine it made a roughly 6 to 8% difference in
CPU utilization.


-- 
  Doug Ledford   <dledford@redhat.com>
   Opinions expressed are my own, but
      they should be everybody's.

[-- Attachment #2: page-2.2.2.patch --]
[-- Type: application/octet-stream, Size: 3509 bytes --]

--- linux/include/linux/pagemap.h.page	Thu Jan 28 17:43:17 1999
+++ linux/include/linux/pagemap.h	Mon Mar  8 07:37:54 1999
@@ -17,13 +17,13 @@
 	return PAGE_OFFSET + PAGE_SIZE * (page - mem_map);
 }
 
-#define PAGE_HASH_BITS 11
-#define PAGE_HASH_SIZE (1 << PAGE_HASH_BITS)
-
 #define PAGE_AGE_VALUE 16
-
 extern unsigned long page_cache_size; /* # of pages currently in the hash table */
-extern struct page * page_hash_table[PAGE_HASH_SIZE];
+extern struct page ** page_hash_table;
+extern unsigned long page_hash_size;
+extern unsigned long page_hash_mask;
+extern unsigned long page_hash_bits;
+extern long page_cache_init(long, long);
 
 /*
  * We use a power-of-two hash table to avoid a modulus,
@@ -35,8 +35,8 @@
 {
 #define i (((unsigned long) inode)/(sizeof(struct inode) & ~ (sizeof(struct inode) - 1)))
 #define o (offset >> PAGE_SHIFT)
-#define s(x) ((x)+((x)>>PAGE_HASH_BITS))
-	return s(i+o) & (PAGE_HASH_SIZE-1);
+#define s(x) ((x)+((x)>>page_hash_bits))
+	return s(i+o) & page_hash_mask;
 #undef i
 #undef o
 #undef s
--- linux/init/main.c.page	Mon Feb 22 22:44:53 1999
+++ linux/init/main.c	Mon Mar  8 07:37:54 1999
@@ -19,6 +19,7 @@
 #include <linux/utsname.h>
 #include <linux/ioport.h>
 #include <linux/init.h>
+#include <linux/pagemap.h>
 #include <linux/smp_lock.h>
 #include <linux/blk.h>
 #include <linux/hdreg.h>
@@ -1149,6 +1150,7 @@
 		initrd_start = 0;
 	}
 #endif
+	memory_start = page_cache_init(memory_start, memory_end);
 	mem_init(memory_start,memory_end);
 	kmem_cache_sizes_init();
 #ifdef CONFIG_PROC_FS
--- linux/mm/filemap.c.page	Mon Feb 22 22:44:53 1999
+++ linux/mm/filemap.c	Mon Mar  8 07:37:54 1999
@@ -13,6 +13,7 @@
 #include <linux/shm.h>
 #include <linux/mman.h>
 #include <linux/locks.h>
+#include <linux/init.h>
 #include <linux/pagemap.h>
 #include <linux/swap.h>
 #include <linux/smp_lock.h>
@@ -23,6 +24,7 @@
 
 #include <asm/pgtable.h>
 #include <asm/uaccess.h>
+#include <asm/io.h>
 
 /*
  * Shared mappings implemented 30.11.1994. It's not fully working yet,
@@ -32,7 +34,45 @@
  */
 
 unsigned long page_cache_size = 0;
-struct page * page_hash_table[PAGE_HASH_SIZE];
+unsigned long page_hash_size = 0;
+unsigned long page_hash_bits = 0;
+unsigned long page_hash_mask = 0;
+struct page ** page_hash_table = NULL;
+
+/*
+ * The init function that starts off our page cache
+ */
+long __init page_cache_init( long mem_begin, long mem_end )
+{
+	long cache_begin;
+	long i;
+
+	for(i=0; i < sizeof(long) * 8; i++)
+		if(virt_to_phys((void *)mem_end) & (1<<i))
+			page_hash_bits = i;
+	page_hash_bits -= (PAGE_SHIFT + 2);
+	if(page_hash_bits < 10)
+		page_hash_bits = 10;
+	cache_begin = (mem_begin + (PAGE_SIZE - 1)) & ~PAGE_SIZE;
+	page_hash_size = 1 << page_hash_bits;
+	page_hash_mask = page_hash_size - 1;
+	if ( ((virt_to_phys((void *)cache_begin) <= 0xfffff) && 
+	      (virt_to_phys((void *)cache_begin) >= (0x9f000 - page_hash_size))) )
+	{
+		/*
+		 * Our structure would fall into the middle of the upper 
+		 * BIOS area of the computer, go above it to the 1MB mark
+		 * and start from there
+		 */
+		cache_begin = (long)phys_to_virt(0x100000);
+	}
+	page_hash_table = (struct page **)cache_begin;
+	memset(page_hash_table, 0, page_hash_size * sizeof(void));
+	printk(KERN_INFO "page_cache_init: allocated %ldk of RAM with %ld hash "
+		"buckets.\n", (page_hash_size * sizeof(void *)) / 1024,
+		page_hash_size);
+	return(cache_begin + (page_hash_size * sizeof(void *)));
+}
 
 /*
  * Simple routines for both non-shared and shared mappings.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-06  0:15       ` Andrea Arcangeli
  1999-04-06  2:14         ` Doug Ledford
@ 1999-04-06  5:52         ` Chuck Lever
  1999-04-06 13:09           ` Andrea Arcangeli
  1999-04-06 16:19           ` Eric W. Biederman
  1 sibling, 2 replies; 51+ messages in thread
From: Chuck Lever @ 1999-04-06  5:52 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel, linux-mm

On Tue, 6 Apr 1999, Andrea Arcangeli wrote:
> Cool! ;)) But could you tell me _how_ do you design an hash function? Are
> you doing math or do you use instinct?

math.  i'll post something about this soon.
 
> >but also the page hash function uses the hash table size as a shift value
> >when computing the index, so it may combine the interesting bits in a
> >different (worse) way when you change the hash table size.  i'm planning
> >to instrument the page hash to see exactly what's going on.
> 
> Agreed. This is true. I thought about that and I am resizing the hash
> table size to the original 11 bit now (since you are confirming that I
> broken the hash function).

i looked at doug's patch too, and it changes the "page_shift" value
depending on the size of the hash table.  again, this *may* cause unwanted
interactions making the hash function degenerate for certain table sizes.
but i'd like to instrument the hash to watch what really happens.


i ran some simple benchmarks on our 4-way Xeon PowerEdge to see what are
the effects of your patches.  here were the original patches against
2.2.5.

the page struct alignment patch:

> --- linux/include/linux/mm.h	Tue Mar  9 01:55:28 1999
> +++ mm.h	Tue Apr  6 02:00:22 1999
> @@ -131,0 +133,6 @@
> +#ifdef __SMP__
> +	/* cacheline alignment */
> +	char dummy[(sizeof(void *) * 7 +
> +		    sizeof(unsigned long) * 2 +
> +		    sizeof(atomic_t)) % L1_CACHE_BYTES];
> +#endif
> Index: linux/mm/page_alloc.c
> diff -u linux/mm/page_alloc.c:1.1.1.3 linux/mm/page_alloc.c:1.1.2.28
> --- linux/mm/page_alloc.c:1.1.1.3	Tue Jan 26 19:32:27 1999
> +++ linux/mm/page_alloc.c	Fri Apr  2 01:12:37 1999
> @@ -315,7 +318,7 @@
>  	freepages.min = i;
>  	freepages.low = i * 2;
>  	freepages.high = i * 3;
> -	mem_map = (mem_map_t *) LONG_ALIGN(start_mem);
> +	mem_map = (mem_map_t *) L1_CACHE_ALIGN(start_mem);
>  	p = mem_map + MAP_NR(end_mem);
>  	start_mem = LONG_ALIGN((unsigned long) p);
>  	memset(mem_map, 0, start_mem - (unsigned long) mem_map);

and the irq alignment patch:

> Index: arch/i386/kernel/irq.c
> ===================================================================
> RCS file: /var/cvs/linux/arch/i386/kernel/irq.c,v
> retrieving revision 1.1.1.3
> retrieving revision 1.1.2.11
> diff -u -r1.1.1.3 -r1.1.2.11
> --- irq.c	1999/02/20 15:38:00	1.1.1.3
> +++ linux/arch/i386/kernel/irq.c	1999/04/04 01:22:53	1.1.2.11
> @@ -139,7 +139,7 @@
>  /*
>   * Controller mappings for all interrupt sources:
>   */
> -irq_desc_t irq_desc[NR_IRQS] = { [0 ... NR_IRQS-1] = { 0, &no_irq_type, }};
> +irq_desc_t irq_desc[NR_IRQS] __cacheline_aligned = { [0 ... NR_IRQS-1] = { 0, &no_irq_type, }};
>  
>  /*
> Index: arch/i386/kernel/irq.h
> ===================================================================
> RCS file: /var/cvs/linux/arch/i386/kernel/irq.h,v
> retrieving revision 1.1.1.3
> diff -u -r1.1.1.3 irq.h
> --- irq.h	1999/02/20 15:38:01	1.1.1.3
> +++ linux/arch/i386/kernel/irq.h	1999/04/01 22:53:07
> @@ -39,6 +39,9 @@
>  	struct hw_interrupt_type *handler;	/* handle/enable/disable functions */
>  	struct irqaction *action;		/* IRQ action list */
>  	unsigned int depth;			/* Disable depth for nested irq disables */
> +#ifdef __SMP__
> +	unsigned int unused[4];
> +#endif
>  } irq_desc_t;
>  
>  /*

i ran 128 concurrent scripts on our 512M PowerEdge.  each instantiation of
the script runs the same programs in the same sequence.  the script is
designed to emulate a software development workload, so it contains
commands like cpio, cc, nroff, and ed. the VM+file working set was
contained in less than 150M, so this benchmark was CPU and memory bound.

i tested 4 different kernels:

ref:  a stock 2.2.5 kernel

p-al: a stock 2.2.5 kernel with your page struct alignment patch applied

irq:  a stock 2.2.5 kernel with your irq alignment patch applied

both: a stock 2.2.5 kernel with both patches applied

all kernels were compiled using egcs-1.1.1 with the same .config and
compiler optimizations.

the benchmark numbers are average throughput in "scripts per hour" for 4
consecutive runs.  this value is computed by measuring the elapsed time
for all scripts to complete, then multiplying by the number of concurrent
scripts.  (s= is standard deviation; it indicates roughly the inter-run
variance)

ref:    4176.4  (s=27.45)

p-al:	4207.9  (s=8.1)

irq:	4228.8  (s=11.70)

both:	4207.9  (s=13.34)

the irq patch is a clear win over the reference kernel: it shows a
consistent 1.25% improvement in overall throughput, and the performance
difference is more than a standard deviation.  also, the variance appears
to be less with the irq kernel.  i would bet on a more I/O bound load the
improvement would be even more stark.

i'm not certain why the combination kernel performance was worse than the
irq-only kernel.

> >Lynch, William, "The Interaction of Virtual Memory and Cache Memory,"
> >Technical Report CSL-TR-93-587, Stanford University Department of
> >Electrical Engineering and Computer Science, October 1993.
> 
> Thanks!! but sigh, I don't have too much money to buy them right now...
> But I'll save them and I'll buy them ASAP. In the meantime I'll be in the
> hope of GPL'd docs...

"Lynch" is a PhD thesis available in postscript at Stanford's web site for
free.  it's a study of different coloring methodologies, so it's fairly
broad.

	- Chuck Lever
--
corporate:	<chuckl@netscape.com>
personal:	<chucklever@netscape.net> or <cel@monkey.org>

The Linux Scalability project:
	http://www.citi.umich.edu/projects/citi-netscape/

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-06  2:14         ` Doug Ledford
@ 1999-04-06 13:04           ` Andrea Arcangeli
  1999-04-06 21:31             ` Stephen C. Tweedie
  0 siblings, 1 reply; 51+ messages in thread
From: Andrea Arcangeli @ 1999-04-06 13:04 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Chuck Lever, linux-kernel, linux-mm

On Mon, 5 Apr 1999, Doug Ledford wrote:

>Hmmm...I've talked about this a few times to Alan Cox and Stephen
>Tweedie.  I didn't bother to instrument the hash function because in
>this case I knew it was tuned to the size of the inode structs.  But, I
>did implement a variable sized page cache hash table array.  I did this

Well it's strightforward. I just did the same some time ago for the buffer
hash table. But I agree with Chuck that enlarging the hash size could harm
the hash function distrubution (I should think about it some more though).

I also think that I'll implement the cookie thing suggested by Mark since
I am too much courious to see how much it will help (even if my mind is
driven by RB-trees ;).

Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-06  5:52         ` Chuck Lever
@ 1999-04-06 13:09           ` Andrea Arcangeli
  1999-04-06 16:19           ` Eric W. Biederman
  1 sibling, 0 replies; 51+ messages in thread
From: Andrea Arcangeli @ 1999-04-06 13:09 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-kernel, linux-mm

On Tue, 6 Apr 1999, Chuck Lever wrote:

>math.  i'll post something about this soon.

Cool! thanks.

>i looked at doug's patch too, and it changes the "page_shift" value
>depending on the size of the hash table.  again, this *may* cause unwanted
>interactions making the hash function degenerate for certain table sizes.

Agreed.

>ref:  a stock 2.2.5 kernel
>
>p-al: a stock 2.2.5 kernel with your page struct alignment patch applied
>
>irq:  a stock 2.2.5 kernel with your irq alignment patch applied
>
>both: a stock 2.2.5 kernel with both patches applied

*snip*

>ref:    4176.4  (s=27.45)
>
>p-al:	4207.9  (s=8.1)
		   ^^^ it made _difference_
>
>irq:	4228.8  (s=11.70)
>
>both:	4207.9  (s=13.34)
		   ^^^^^ strange...

>the irq patch is a clear win over the reference kernel: it shows a

Good ;)

>consistent 1.25% improvement in overall throughput, and the performance
>difference is more than a standard deviation.  also, the variance appears
>to be less with the irq kernel.  i would bet on a more I/O bound load the
>improvement would be even more stark.
>
>i'm not certain why the combination kernel performance was worse than the
>irq-only kernel.

Hmm I'll think about that...

>"Lynch" is a PhD thesis available in postscript at Stanford's web site for
>free.  it's a study of different coloring methodologies, so it's fairly
>broad.

Thanks!! I'll search for it soon.

Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-05  0:22   ` Andrea Arcangeli
  1999-04-05 13:23     ` Mark Hemment
  1999-04-05 21:31     ` Chuck Lever
@ 1999-04-06 14:00     ` Stephen C. Tweedie
  1999-04-06 16:29       ` Andrea Arcangeli
  2 siblings, 1 reply; 51+ messages in thread
From: Stephen C. Tweedie @ 1999-04-06 14:00 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Chuck Lever, linux-kernel, linux-mm, Stephen C. Tweedie, Linus Torvalds

Hi,

On Mon, 5 Apr 1999 02:22:35 +0200 (CEST), Andrea Arcangeli
<andrea@e-mind.com> said:

> The page hash function change is from Stephen (I did it here too because I
> completly agreed with it). The point is that shm entries uses the lower
> bits of the pagemap->offset field.

_All_ swap entries do.  shm entries never enter the page cache so that's
not a problem, but the swap cache _is_ a problem.

> Eh, my shrink_mmap() is is a black magic and it's long to explain what I
> thought ;).

It is hard to have a meaningful discussion about it otherwise!

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-05 21:31     ` Chuck Lever
  1999-04-06  0:15       ` Andrea Arcangeli
@ 1999-04-06 14:02       ` Stephen C. Tweedie
  1999-04-06 15:38         ` Chuck Lever
  1 sibling, 1 reply; 51+ messages in thread
From: Stephen C. Tweedie @ 1999-04-06 14:02 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Andrea Arcangeli, linux-kernel, linux-mm, Stephen Tweedie

Hi, 

On Mon, 5 Apr 1999 17:31:43 -0400 (EDT), Chuck Lever <cel@monkey.org>
said:

> hmmm.  wouldn't you think that hashing with the low order bits in the
> offset would cause two different offsets against the same page to result
> in the hash function generating different output?  

We always, always use page-aligned lookups for the page cache.
(Actually there is one exception: certain obsolete a.out binaries, which
are demand paged with the pages beginning at offset 1K into the binary.
We don't support cache coherency for those and we don't support them at
all on filesystems with a >1k block size.  It doesn't impact on the hash
issue.)

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-06 14:02       ` Stephen C. Tweedie
@ 1999-04-06 15:38         ` Chuck Lever
  1999-04-06 17:16           ` Andrea Arcangeli
  0 siblings, 1 reply; 51+ messages in thread
From: Chuck Lever @ 1999-04-06 15:38 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Andrea Arcangeli, linux-kernel, linux-mm

On Tue, 6 Apr 1999, Stephen C. Tweedie wrote:
> On Mon, 5 Apr 1999 17:31:43 -0400 (EDT), Chuck Lever <cel@monkey.org>
> said:
> 
> > hmmm.  wouldn't you think that hashing with the low order bits in the
> > offset would cause two different offsets against the same page to result
> > in the hash function generating different output?  
> 
> We always, always use page-aligned lookups for the page cache.
> (Actually there is one exception: certain obsolete a.out binaries, which
> are demand paged with the pages beginning at offset 1K into the binary.
> We don't support cache coherency for those and we don't support them at
> all on filesystems with a >1k block size.  It doesn't impact on the hash
> issue.)

i guess i'm confused then.  what good does this change do:

2.2.5 pagemap.h:
#define i (((unsigned long) inode)/(sizeof(struct inode) &
~ (sizeof(struct inode) - 1)))
#define o (offset >> PAGE_SHIFT)
#define s(x) ((x)+((x)>>PAGE_HASH_BITS))
	return s(i+o) & (PAGE_HASH_SIZE-1);

2.2.5-arca pagemap.h:
	return s(i+o+offset) & (PAGE_HASH_SIZE-1);
                    ^^^^^^^

btw, do you know if there is a special reason why PAGE_HASH_BITS is 11?

	- Chuck Lever
--
corporate:	<chuckl@netscape.com>
personal:	<chucklever@netscape.net> or <cel@monkey.org>

The Linux Scalability project:
	http://www.citi.umich.edu/projects/citi-netscape/

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-06  5:52         ` Chuck Lever
  1999-04-06 13:09           ` Andrea Arcangeli
@ 1999-04-06 16:19           ` Eric W. Biederman
  1999-04-06 20:26             ` Andrea Arcangeli
  1 sibling, 1 reply; 51+ messages in thread
From: Eric W. Biederman @ 1999-04-06 16:19 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Andrea Arcangeli, linux-kernel, linux-mm

>>>>> "CL" == Chuck Lever <cel@monkey.org> writes:

CL> On Tue, 6 Apr 1999, Andrea Arcangeli wrote:
>> Cool! ;)) But could you tell me _how_ do you design an hash function? Are
>> you doing math or do you use instinct?

CL> math.  i'll post something about this soon.

>> >but also the page hash function uses the hash table size as a shift value
>> >when computing the index, so it may combine the interesting bits in a
>> >different (worse) way when you change the hash table size.  i'm planning
>> >to instrument the page hash to see exactly what's going on.
>> 
>> Agreed. This is true. I thought about that and I am resizing the hash
>> table size to the original 11 bit now (since you are confirming that I
>> broken the hash function).

CL> i looked at doug's patch too, and it changes the "page_shift" value
CL> depending on the size of the hash table.  again, this *may* cause unwanted
CL> interactions making the hash function degenerate for certain table sizes.
CL> but i'd like to instrument the hash to watch what really happens.

CL> i ran some simple benchmarks on our 4-way Xeon PowerEdge to see what are
CL> the effects of your patches.  here were the original patches against
CL> 2.2.5.

CL> the page struct alignment patch:

>> --- linux/include/linux/mm.h	Tue Mar  9 01:55:28 1999
>> +++ mm.h	Tue Apr  6 02:00:22 1999
>> @@ -131,0 +133,6 @@
>> +#ifdef __SMP__
>> +	/* cacheline alignment */
>> +	char dummy[(sizeof(void *) * 7 +
>> +		    sizeof(unsigned long) * 2 +
>> +		    sizeof(atomic_t)) % L1_CACHE_BYTES];
>> +#endif

Am I the only one to notice that this little bit of code is totally wrong.
It happens to get it right for cache sizes of 16 & 32 with the current struct
page but the code is 100% backwords.

Eric

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-06 14:00     ` Stephen C. Tweedie
@ 1999-04-06 16:29       ` Andrea Arcangeli
  0 siblings, 0 replies; 51+ messages in thread
From: Andrea Arcangeli @ 1999-04-06 16:29 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Chuck Lever, linux-kernel, linux-mm

On Tue, 6 Apr 1999, Stephen C. Tweedie wrote:

>_All_ swap entries do.  shm entries never enter the page cache so that's

My mistake excuse me.

Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-06 15:38         ` Chuck Lever
@ 1999-04-06 17:16           ` Andrea Arcangeli
  1999-04-06 18:07             ` Andrea Arcangeli
  1999-04-06 20:47             ` Chuck Lever
  0 siblings, 2 replies; 51+ messages in thread
From: Andrea Arcangeli @ 1999-04-06 17:16 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Stephen C. Tweedie, linux-kernel, linux-mm

On Tue, 6 Apr 1999, Chuck Lever wrote:

>> We always, always use page-aligned lookups for the page cache.
>> (Actually there is one exception: certain obsolete a.out binaries, which
>> are demand paged with the pages beginning at offset 1K into the binary.
>> We don't support cache coherency for those and we don't support them at
>> all on filesystems with a >1k block size.  It doesn't impact on the hash
>> issue.)
>
>i guess i'm confused then.  what good does this change do:

Hmm I think I misunderstood you point, Chuck. I thought you was
complaining about the fact that some hash entry could be unused and other
overloaded but I was just assuming that the offset is always page-aligned.
I could write a simulation to check the hash function...

Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-06 17:16           ` Andrea Arcangeli
@ 1999-04-06 18:07             ` Andrea Arcangeli
  1999-04-06 21:22               ` Stephen C. Tweedie
  1999-04-06 20:47             ` Chuck Lever
  1 sibling, 1 reply; 51+ messages in thread
From: Andrea Arcangeli @ 1999-04-06 18:07 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Stephen C. Tweedie, linux-kernel, linux-mm

On Tue, 6 Apr 1999, Andrea Arcangeli wrote:

>I could write a simulation to check the hash function...

I was looking at the inode pointer part of the hash function and I think
something like this should be better.

Index: include/linux/pagemap.h
===================================================================
RCS file: /var/cvs/linux/include/linux/pagemap.h,v
retrieving revision 1.1.2.14
diff -u -r1.1.2.14 pagemap.h
--- pagemap.h	1999/04/05 23:33:20	1.1.2.14
+++ linux/include/linux/pagemap.h	1999/04/06 18:08:32
@@ -32,7 +39,7 @@
  */
 static inline unsigned long _page_hashfn(struct inode * inode, unsigned long offset)
 {
-#define i (((unsigned long) inode)/(sizeof(struct inode) & ~ (sizeof(struct inode) - 1)))
+#define i (((unsigned long) inode-PAGE_OFFSET)/(sizeof(struct inode) & ~ (sizeof(struct inode) - 1)))
 #define o (offset >> PAGE_SHIFT)
 #define s(x) ((x)+((x)>>PAGE_HASH_BITS))
 	return s(i+o) & (PAGE_HASH_SIZE-1);
(((unsigned long) inode-PAGE_OFFSET)/(sizeof(struct inode) & ~ (sizeof(struct inode) - 1)))

I am not sure if it will make difference, but at least it looks smarter to me
because it will remove a not interesting amount of information from the
input of the hash.

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-06 16:19           ` Eric W. Biederman
@ 1999-04-06 20:26             ` Andrea Arcangeli
  1999-04-07  5:00               ` Eric W. Biederman
  0 siblings, 1 reply; 51+ messages in thread
From: Andrea Arcangeli @ 1999-04-06 20:26 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Chuck Lever, linux-kernel, linux-mm

On 6 Apr 1999, Eric W. Biederman wrote:

>>> --- linux/include/linux/mm.h	Tue Mar  9 01:55:28 1999
>>> +++ mm.h	Tue Apr  6 02:00:22 1999
>>> @@ -131,0 +133,6 @@
>>> +#ifdef __SMP__
>>> +	/* cacheline alignment */
>>> +	char dummy[(sizeof(void *) * 7 +
>>> +		    sizeof(unsigned long) * 2 +
>>> +		    sizeof(atomic_t)) % L1_CACHE_BYTES];
>>> +#endif
>
>Am I the only one to notice that this little bit of code is totally wrong.
>It happens to get it right for cache sizes of 16 & 32 with the current struct
>page but the code is 100% backwords.

Does something like this looks like better?

Index: mm/page_alloc.c
===================================================================
RCS file: /var/cvs/linux/mm/page_alloc.c,v
retrieving revision 1.1.1.3
diff -u -r1.1.1.3 page_alloc.c
--- page_alloc.c	1999/01/26 18:32:27	1.1.1.3
+++ linux/mm/page_alloc.c	1999/04/06 20:04:59
@@ -315,7 +315,7 @@
 	freepages.min = i;
 	freepages.low = i * 2;
 	freepages.high = i * 3;
-	mem_map = (mem_map_t *) LONG_ALIGN(start_mem);
+	mem_map = (mem_map_t *) L1_CACHE_ALIGN(start_mem);
 	p = mem_map + MAP_NR(end_mem);
 	start_mem = LONG_ALIGN((unsigned long) p);
 	memset(mem_map, 0, start_mem - (unsigned long) mem_map);
Index: include/linux/mm.h
===================================================================
RCS file: /var/cvs/linux/include/linux/mm.h,v
retrieving revision 1.1.1.4
diff -u -r1.1.1.4 mm.h
--- mm.h	1999/03/09 00:55:28	1.1.1.4
+++ linux/include/linux/mm.h	1999/04/06 20:24:43
@@ -129,6 +129,14 @@
 	struct wait_queue *wait;
 	struct page **pprev_hash;
 	struct buffer_head * buffers;
+#ifdef	__SMP__
+#define	MEM_MAP_L1_WRAP ((sizeof(void *) * 7 +			\
+			  sizeof(unsigned long) * 2 +		\
+			  sizeof(atomic_t)) % L1_CACHE_BYTES)
+	/* cacheline alignment */
+	char dummy[MEM_MAP_L1_WRAP ? L1_CACHE_BYTES - MEM_MAP_L1_WRAP : 0];
+#undef	MEM_MAP_L1_WRAP
+#endif
 } mem_map_t;
 
 /* Page flag bit values */



Thanks.

Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-06 17:16           ` Andrea Arcangeli
  1999-04-06 18:07             ` Andrea Arcangeli
@ 1999-04-06 20:47             ` Chuck Lever
  1999-04-06 21:04               ` Andrea Arcangeli
  1999-04-06 21:11               ` Stephen C. Tweedie
  1 sibling, 2 replies; 51+ messages in thread
From: Chuck Lever @ 1999-04-06 20:47 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Stephen C. Tweedie, linux-kernel, linux-mm

On Tue, 6 Apr 1999, Andrea Arcangeli wrote:
> >i guess i'm confused then.  what good does this change do:
> 
> Hmm I think I misunderstood you point, Chuck. I thought you was
> complaining about the fact that some hash entry could be unused and other
> overloaded

yes, you understood correctly.

> but I was just assuming that the offset is always page-aligned.
> I could write a simulation to check the hash function...

i didn't realize that the "offset" argument would always be page-aligned.
but still, why does it help to add the unshifted "offset"?  doesn't seem
like there's any new information in that.

	- Chuck Lever
--
corporate:	<chuckl@netscape.com>
personal:	<chucklever@netscape.net> or <cel@monkey.org>

The Linux Scalability project:
	http://www.citi.umich.edu/projects/citi-netscape/

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-06 20:47             ` Chuck Lever
@ 1999-04-06 21:04               ` Andrea Arcangeli
  1999-04-06 21:11               ` Stephen C. Tweedie
  1 sibling, 0 replies; 51+ messages in thread
From: Andrea Arcangeli @ 1999-04-06 21:04 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Stephen C. Tweedie, linux-kernel, linux-mm

On Tue, 6 Apr 1999, Chuck Lever wrote:

>but still, why does it help to add the unshifted "offset"?  doesn't seem
>like there's any new information in that.

The unshifted offset is useful for swap cache pages. When the page is a
swap cache page the page->offset really is the swap entry that tell us
where the page is been swapped out. And in such case page->offset is not
page aligned.

Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-06 20:47             ` Chuck Lever
  1999-04-06 21:04               ` Andrea Arcangeli
@ 1999-04-06 21:11               ` Stephen C. Tweedie
  1 sibling, 0 replies; 51+ messages in thread
From: Stephen C. Tweedie @ 1999-04-06 21:11 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Andrea Arcangeli, Stephen C. Tweedie, linux-kernel, linux-mm

Hi,

On Tue, 6 Apr 1999 16:47:23 -0400 (EDT), Chuck Lever <cel@monkey.org>
said:

>> but I was just assuming that the offset is always page-aligned.
>> I could write a simulation to check the hash function...

> i didn't realize that the "offset" argument would always be page-aligned.
> but still, why does it help to add the unshifted "offset"?  doesn't seem
> like there's any new information in that.

It is always aligned for the page cache (hence mixing in the lower bits
to the hash function shouldn't change anything), but the swap cache uses
the lower bits extensively, concentrating the swap cache into just a few
hash buckets unless we make this change.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-06 18:07             ` Andrea Arcangeli
@ 1999-04-06 21:22               ` Stephen C. Tweedie
  1999-04-06 22:19                 ` Ingo Molnar
  1999-04-06 22:31                 ` Andrea Arcangeli
  0 siblings, 2 replies; 51+ messages in thread
From: Stephen C. Tweedie @ 1999-04-06 21:22 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Chuck Lever, Stephen C. Tweedie, linux-kernel, linux-mm

Hi,

On Tue, 6 Apr 1999 20:07:57 +0200 (CEST), Andrea Arcangeli
<andrea@e-mind.com> said:

> I was looking at the inode pointer part of the hash function and I think
> something like this should be better.

> -#define i (((unsigned long) inode)/(sizeof(struct inode) & ~ (sizeof(struct inode) - 1)))
> +#define i (((unsigned long) inode-PAGE_OFFSET)/(sizeof(struct inode) & ~ (sizeof(struct inode) - 1)))

This just ends up adding or subtracting a constant to the hash function,
so won't have any effect at all on the occupancy distribution of the
hash buckets.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-06 13:04           ` Andrea Arcangeli
@ 1999-04-06 21:31             ` Stephen C. Tweedie
  1999-04-06 22:27               ` Andrea Arcangeli
  0 siblings, 1 reply; 51+ messages in thread
From: Stephen C. Tweedie @ 1999-04-06 21:31 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Doug Ledford, Chuck Lever, linux-kernel, linux-mm, Stephen Tweedie

Hi,

On Tue, 6 Apr 1999 15:04:36 +0200 (CEST), Andrea Arcangeli
<andrea@e-mind.com> said:

> I also think that I'll implement the cookie thing suggested by Mark since
> I am too much courious to see how much it will help (even if my mind is
> driven by RB-trees ;).

Trees are bad for this sort of thing in general: they can be as fast as
hashing for lookup, but are much slower for insert and delete
operations.  Insert and delete for the page cache _must_ be fast.
Expanding a hash table does not cost enough to offset the performance
problems.

--Stephen

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-06 21:22               ` Stephen C. Tweedie
@ 1999-04-06 22:19                 ` Ingo Molnar
  1999-04-06 22:40                   ` David Miller
  1999-04-08  8:09                   ` Carlo Daffara
  1999-04-06 22:31                 ` Andrea Arcangeli
  1 sibling, 2 replies; 51+ messages in thread
From: Ingo Molnar @ 1999-04-06 22:19 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Andrea Arcangeli, Chuck Lever, linux-kernel, linux-mm

On Tue, 6 Apr 1999, Stephen C. Tweedie wrote:

-#define i (((unsigned long) inode)/(sizeof(struct inode) \
		 & ~ (sizeof(struct inode) - 1)))
+#define i (((unsigned long) inode-PAGE_OFFSET)/(sizeof(struct inode) \
		 & ~ (sizeof(struct inode) - 1)))

> This just ends up adding or subtracting a constant to the hash function,
> so won't have any effect at all on the occupancy distribution of the
> hash buckets.

btw. shouldnt it rather be something like: 

#define log2(x) \
({                                                              \
        int __res;                                              \
                                                                \
        switch (x) {                                            \
                case 0x040 ... 0x07f: __res = 0x040; break;     \
                case 0x080 ... 0x0ff: __res = 0x080; break;     \
                case 0x100 ... 0x1ff: __res = 0x100; break;     \
                case 0x200 ... 0x3ff: __res = 0x200; break;     \
                case 0x400 ... 0x7ff: __res = 0x400; break;     \
                case 0x800 ... 0xfff: __res = 0x800; break;     \
        }                                                       \
        __res;                                                  \
})

#define i (((unsigned long) inode)/log2(sizeof(struct inode)))

because otherwise we 'over-estimate' and include systematic bits in the
supposedly random index. Eg. in 2.2.5 the size of struct inode is 0x110,
which means 'i' will have values 0x11, 0x22, 0x33, for inodes within the
same page. (we'd preferably have i == 0x01, 0x02, 0x03...)

the hash was designed when struct inode has a size of 512 i think, for
which the original #define was correct. But for 0x110 it isnt IMHO. (my
above macros produce i==0,1,2...0xf iterated over inode addresses within
the same physical page.) 

-- mingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-06 21:31             ` Stephen C. Tweedie
@ 1999-04-06 22:27               ` Andrea Arcangeli
  1999-04-07 12:27                 ` Stephen C. Tweedie
  0 siblings, 1 reply; 51+ messages in thread
From: Andrea Arcangeli @ 1999-04-06 22:27 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Doug Ledford, Chuck Lever, linux-kernel, linux-mm,
	Linus Torvalds, Mark Hemment

On Tue, 6 Apr 1999, Stephen C. Tweedie wrote:

>Trees are bad for this sort of thing in general: they can be as fast as
>hashing for lookup, but are much slower for insert and delete
>operations.  Insert and delete for the page cache _must_ be fast.

It's not so obvious to me. I sure agree that an O(n) insertion/deletion is
far too slow but a O(log(n)) for everything could be rasonable to me. And
trees don't worry about unluky hash behavior.

And for the record my plan would be to put the top of the tree directly in
the inode struct. And then to queue all cached pages that belongs to such
inode in the per-inode cache-tree. So there would be no need to always
check also the inode field in find_inode() and reading small files would
be _drammatically_ fast and immediate even if there are 5giga of cache
allocated. I think this behavior will become interesting.

Maybe you are 100% right in telling me that RB-trees will lose, but I
would like to try it out someday...

Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-06 21:22               ` Stephen C. Tweedie
  1999-04-06 22:19                 ` Ingo Molnar
@ 1999-04-06 22:31                 ` Andrea Arcangeli
  1 sibling, 0 replies; 51+ messages in thread
From: Andrea Arcangeli @ 1999-04-06 22:31 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Chuck Lever, linux-kernel, linux-mm

On Tue, 6 Apr 1999, Stephen C. Tweedie wrote:

>> -#define i (((unsigned long) inode)/(sizeof(struct inode) & ~ (sizeof(struct inode) - 1)))
>> +#define i (((unsigned long) inode-PAGE_OFFSET)/(sizeof(struct inode) & ~ (sizeof(struct inode) - 1)))
>
>This just ends up adding or subtracting a constant to the hash function,
>so won't have any effect at all on the occupancy distribution of the
>hash buckets.

My point is that PAGE_HASH_BITS is < of 32/2.

Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-06 22:19                 ` Ingo Molnar
@ 1999-04-06 22:40                   ` David Miller
  1999-04-06 22:49                     ` Ingo Molnar
  1999-04-08  8:09                   ` Carlo Daffara
  1 sibling, 1 reply; 51+ messages in thread
From: David Miller @ 1999-04-06 22:40 UTC (permalink / raw)
  To: mingo; +Cc: sct, andrea, cel, linux-kernel, linux-mm

   On Tue, 6 Apr 1999, Stephen C. Tweedie wrote:

   -#define i (((unsigned long) inode)/(sizeof(struct inode) \
		    & ~ (sizeof(struct inode) - 1)))
   +#define i (((unsigned long) inode-PAGE_OFFSET)/(sizeof(struct inode) \
		    & ~ (sizeof(struct inode) - 1)))

 ...

   btw. shouldnt it rather be something like: 

   #define log2(x) \

Look at the code just the 'i' in question will output :-)

     mov    inode, %o0
     srlx   %o0, 3, %o4

So on sparc64 atleast, it amounts to "inode >> 3".  So:

(sizeof(struct inode) & ~ (sizeof(struct inode) - 1))

is 8 on sparc64.  The 'i' construct is just meant to get rid of the
"non significant" lower bits of the inode pointer and it does so very
nicely. :-)

Later,
David S. Miller
davem@redhat.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-06 22:40                   ` David Miller
@ 1999-04-06 22:49                     ` Ingo Molnar
  1999-04-06 22:53                       ` David Miller
  0 siblings, 1 reply; 51+ messages in thread
From: Ingo Molnar @ 1999-04-06 22:49 UTC (permalink / raw)
  To: David Miller; +Cc: sct, andrea, cel, linux-kernel, linux-mm

On Tue, 6 Apr 1999, David Miller wrote:

>    #define log2(x) \
> 
> Look at the code just the 'i' in question will output :-)
> 
>      mov    inode, %o0
>      srlx   %o0, 3, %o4
> 
> So on sparc64 atleast, it amounts to "inode >> 3".  So:

(yep it's the same on x86 too, given sizeof(struct inode) == 0x110) 

> (sizeof(struct inode) & ~ (sizeof(struct inode) - 1))
> 
> is 8 on sparc64.  The 'i' construct is just meant to get rid of the
> "non significant" lower bits of the inode pointer and it does so very
> nicely. :-)

but it's not just the lower 3 bits that are unsignificant. It's 8
unsignificant bits. (8.1 actually :) It should be 'inode >> 8' (which is
done by the log2 solution). Unless i'm misunderstanding something. The
thing we are AFAIU trying to avoid is 'x/sizeof(inode)', which can be a
costy division. So i've substituted it with
'x/nearest_power_of_2(sizeof(inode))' which is just as good in getting rid
of insignificant bits. 

-- mingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-06 22:49                     ` Ingo Molnar
@ 1999-04-06 22:53                       ` David Miller
  1999-04-07 15:59                         ` Gabriel Paubert
  0 siblings, 1 reply; 51+ messages in thread
From: David Miller @ 1999-04-06 22:53 UTC (permalink / raw)
  To: mingo; +Cc: sct, andrea, cel, linux-kernel, linux-mm

   It should be 'inode >> 8' (which is done by the log2
   solution). Unless i'm misunderstanding something.

Consider that:

(((unsigned long) inode) >> (sizeof(struct inode) & ~ (sizeof(struct inode) - 1)))

sort of approximates this and avoids the funny looking log2 macro. :-)

Later,
David S. Miller
davem@redhat.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-06 20:26             ` Andrea Arcangeli
@ 1999-04-07  5:00               ` Eric W. Biederman
  1999-04-07 11:36                 ` Andrea Arcangeli
  0 siblings, 1 reply; 51+ messages in thread
From: Eric W. Biederman @ 1999-04-07  5:00 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Chuck Lever, linux-kernel, linux-mm

>>>>> "AA" == Andrea Arcangeli <andrea@e-mind.com> writes:

AA> Does something like this looks like better?
yes. But:

#define __l1_cache_aligned __attribute__((aligned (L1_CACHE_BYTES)))
...

}
#ifdef SMP
__l1_cache_aligned
#endif
mem_map_t;

Looks even better, and is even maintainable.
Note: I haven't tried actually tried to compile this fragment ...

Eric
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [patch] only-one-cache-query [was Re: [patch] arca-vm-2.2.5]
  1999-04-05 15:56       ` Andrea Arcangeli
@ 1999-04-07 11:28         ` Andrea Arcangeli
  1999-04-07 13:06           ` Stephen C. Tweedie
                             ` (2 more replies)
  0 siblings, 3 replies; 51+ messages in thread
From: Andrea Arcangeli @ 1999-04-07 11:28 UTC (permalink / raw)
  To: Mark Hemment, Chuck Lever
  Cc: linux-kernel, linux-mm, Linus Torvalds, Stephen C. Tweedie

On Mon, 5 Apr 1999, Andrea Arcangeli wrote:

>>  This does double the size of the page-hash, and would require profiling
>>to determine if it is worthwhile.
>
>I don't think it's worthwhile simply because most of the time you'll have
>only to pass as _worse_ one or two chains of the hash entry. And passing
>one or two chains will be far more ligther than the new mechanism.

Last night I had a new idea on how to cleanly and trivially avoid the two
cache query. We only need to know if we slept or not in GFP. We don't ever
need to think if some pages are been added to our hash bucket. It's far
simpler and the real-world case will perform better (yes while swapping
out heavily we'll lose a bit of performances, but when we are I/O bound an
additional query on the cache won't harm at all, and due the simpler
desing we'll get more performances and memory for all the good not
swapping time). And last but not the least: it will work without changes
even if the page-cache will change it's internal structure ;)).

Changing GFP is not an option (it would add overhead to the normal case
when we are not interested about such info), so I had the idea of adding a
gfp_sleeping_cookie increased every time GFP _may_ sleeps. It will add
zero overhead but it will let us know if we _may_ have slept or not.

I am running the code while writing this and I'll release very soon a
2.2.5_arca8.bz2 with this new improvement included.

Here it is the diff against clean 2.2.5:

Index: fs/dcache.c
===================================================================
RCS file: /var/cvs/linux/fs/dcache.c,v
retrieving revision 1.1.1.6
diff -u -r1.1.1.6 dcache.c
--- dcache.c	1999/03/24 00:47:09	1.1.1.6
+++ linux/fs/dcache.c	1999/04/07 01:12:46
@@ -285,6 +285,7 @@
  */
 void prune_dcache(int count)
 {
+	gfp_sleeping_cookie++;
 	for (;;) {
 		struct dentry *dentry;
 		struct list_head *tmp = dentry_unused.prev;
Index: include/linux/mm.h
===================================================================
RCS file: /var/cvs/linux/include/linux/mm.h,v
retrieving revision 1.1.1.4
diff -u -r1.1.1.4 mm.h
--- mm.h	1999/03/09 00:55:28	1.1.1.4
+++ linux/include/linux/mm.h	1999/04/07 01:11:43
@@ -259,6 +259,9 @@
 #define __get_dma_pages(gfp_mask, order) __get_free_pages((gfp_mask) | GFP_DMA,(order))
 extern unsigned long FASTCALL(__get_free_pages(int gfp_mask, unsigned long gfp_order));
 
+/* automagically increase the gfp-cookie if GFP slept -arca */
+extern int gfp_sleeping_cookie;
+
 extern inline unsigned long get_free_page(int gfp_mask)
 {
 	unsigned long page;
Index: ipc/shm.c
===============================================================ile: /var/cvs/linux/ipc/shm.c,v
retrieving revision 1.1.1.1
diff -u -r1.1.1.1 shm.c
--- shm.c	1999/01/18 01:27:58	1.1.1.1
+++ linux/ipc/shm.c	1999/04/07 01:11:43
@@ -800,6 +800,7 @@
 	if (atomic_read(&mem_map[MAP_NR(pte_page(page))].count) != 1)
 		goto check_table;
 	shp->shm_pages[idx] = swap_nr;
+	gfp_sleeping_cookie++;
 	rw_swap_page_nocache (WRITE, swap_nr, (char *) pte_page(page));
 	free_page(pte_page(page));
 	swap_successes++;
Index: mm/filemap.c
===================================================================
RCS file: /var/cvs/linux/mm/filemap.c,v
retrieving revision 1.1.1.8
diff -u -r1.1.1.8 filemap.c
--- filemap.c	1999/03/24 00:53:22	1.1.1.8
+++ linux/mm/filemap.c	1999/04/07 01:11:43
@@ -272,16 +272,29 @@
 	struct page ** hash;
 
 	offset &= PAGE_MASK;
+	if (offset >= inode->i_size)
+		goto out;
+
+	hash = page_hash(inode, offset);
+	page = __find_page(inode, offset, *hash);
+
+	if (page)
+	{
+		/* fast path -arca */
+		release_page(page);
+		return page_cache;
+	}
+
 	switch (page_cache) {
+		int gfp_cookie;
 	case 0:
+		gfp_cookie = gfp_sleeping_cookie;
 		page_cache = __get_free_page(GFP_USER);
 		if (!page_cache)
 			break;
+		if (gfp_cookie != gfp_sleeping_cookie)
+			page = __find_page(inode, offset, *hash);
 	default:
-		if (offset >= inode->i_size)
-			break;
-		hash = page_hash(inode, offset);
-		page = __find_page(inode, offset, *hash);
 		if (!page) {
 			/*
 			 * Ok, add the new page to the hash-queues...
@@ -293,6 +306,7 @@
 		}
 		release_page(page);
 	}
+ out:
 	return page_cache;
 }
 
@@ -586,6 +600,7 @@
 	size_t pos, pgpos, page_cache;
 	int reada_ok;
 	int max_readahead = get_max_readahead(inode);
+	struct page **hash;
 
 	page_cache = 0;
 
@@ -630,8 +645,9 @@
 			filp->f_ramax = max_readahead;
 	}
 
+	hash = page_hash(inode, pos & PAGE_MASK);
 	for (;;) {
-		struct page *page, **hash;
+		struct page *page;
 
 		if (pos >= inode->i_size)
 			break;
@@ -639,7 +655,6 @@
 		/*
 		 * Try to find the data in the page cache..
 		 */
-		hash = page_hash(inode, pos & PAGE_MASK);
 		page = __find_page(inode, pos & PAGE_MASK, *hash);
 		if (!page)
 			goto no_cached_page;
@@ -696,15 +711,19 @@
 		 * page..
 		 */
 		if (!page_cache) {
+			int gfp_cookie = gfp_sleeping_cookie;
 			page_cache = __get_free_page(GFP_USER);
-			/*
-			 * That could have slept, so go around to the
-			 * very beginning..
-			 */
-			if (page_cache)
+			if (!page_cache)
+			{
+				desc->error = -ENOMEM;
+				break;
+			}
+			if (gfp_cookie != gfp_sleeping_cookie)
+				/*
+				 * We slept, so go around to the
+				 * very beginning..
+				 */
 				continue;
-			desc->error = -ENOMEM;
-			break;
 		}
 
 		/*
@@ -941,6 +960,7 @@
 	unsigned long offset, reada, i;
 	struct page * page, **hash;
 	unsigned long old_page, new_page;
+	int gfp_cookie;
 
 	new_page = 0;
 	offset = (address & PAGE_MASK) - area->vm_start + area->vm_offset;
@@ -999,18 +1019,8 @@
 	return new_page;
 
 no_cached_page:
-	/*
-	 * Try to read in an entire cluster at once.
-	 */
-	reada   = offset;
-	reada >>= PAGE_SHIFT + page_cluster;
-	reada <<= PAGE_SHIFT + page_cluster;
-
-	for (i = 1 << page_cluster; i > 0; --i, reada += PAGE_SIZE)
-		new_page = try_to_read_ahead(file, reada, new_page);
-
-	if (!new_page)
-		new_page = __get_free_page(GFP_USER);
+	gfp_cookie = gfp_sleeping_cookie;
+	new_page = __get_free_page(GFP_USER);
 	if (!new_page)
 		goto no_page;
 
@@ -1020,19 +1030,29 @@
 	 * cache.. The page we just got may be useful if we
 	 * can't share, so don't get rid of it here.
 	 */
-	page = find_page(inode, offset);
-	if (page)
-		goto found_page;
+	if (gfp_cookie == gfp_sleeping_cookie ||
+	    !(page = __find_page(inode, offset, *hash)))
+	{
+		/*
+		 * Now, create a new page-cache page from the page we got
+		 */
+		page = mem_map + MAP_NR(new_page);
+		new_page = 0;
+		add_to_page_cache(page, inode, offset, hash);
 
+		if (inode->i_op->readpage(file, page) != 0)
+			goto failure;
+	}
+
 	/*
-	 * Now, create a new page-cache page from the page we got
+	 * Try to read in an entire cluster at once.
 	 */
-	page = mem_map + MAP_NR(new_page);
-	new_page = 0;
-	add_to_page_cache(page, inode, offset, hash);
+	reada   = offset;
+	reada >>= PAGE_SHIFT + page_cluster;
+	reada <<= PAGE_SHIFT + page_cluster;
 
-	if (inode->i_op->readpage(file, page) != 0)
-		goto failure;
+	for (i = 1 << page_cluster; i > 0; --i, reada += PAGE_SIZE)
+		new_page = try_to_read_ahead(file, reada, new_page);
 
 	goto found_page;
 
Index: mm/swap_state.c
===================================================================
RCS file: /var/cvs/linux/mm/swap_state.c,v
retrieving revision 1.1.1.1
diff -u -r1.1.1.1 swap_state.c
--- swap_state.c	1999/01/18 01:27:02	1.1.1.1
+++ linux/mm/swap_state.c	1999/04/07 01:11:43
@@ -285,6 +285,7 @@
 {
 	struct page *found_page = 0, *new_page;
 	unsigned long new_page_addr;
+	int gfp_cookie;
 	
 #ifdef DEBUG_SWAP
 	printk("DebugVM: read_swap_cache_async entry %08lx%s\n",
@@ -302,6 +303,7 @@
 	if (found_page)
 		goto out_free_swap;
 
+	gfp_cookie = gfp_sleeping_cookie;
 	new_page_addr = __get_free_page(GFP_USER);
 	if (!new_page_addr)
 		goto out_free_swap;	/* Out of memory */
@@ -310,8 +312,8 @@
 	/*
 	 * Check the swap cache again, in case we stalled above.
 	 */
-	found_page = lookup_swap_cache(entry);
-	if (found_page)
+	if (gfp_cookie != gfp_sleeping_cookie &&
+	    (found_page = lookup_swap_cache(entry)))
 		goto out_free_page;
 	/* 
 	 * Add it to the swap cache and read its contents.
Index: mm/vmscan.c
===================================================================
RCS file: /var/cvs/linux/mm/vmscan.c,v
retrieving revision 1.1.1.5
diff -u -r1.1.1.5 vmscan.c
--- vmscan.c	1999/02/06 13:22:09	1.1.1.5
+++ linux/mm/vmscan.c	1999/04/07 01:11:43
@@ -20,6 +20,8 @@
 
 #include <asm/pgtable.h>
 
+int gfp_sleeping_cookie;
+
 /*
  * The swap-out functions return 1 if they successfully
  * threw something out, and we got a free page. It returns
@@ -106,6 +108,8 @@
 	 */
 	if (!(gfp_mask & __GFP_IO))
 		return 0;
+
+	gfp_sleeping_cookie++;
 
 	/*
 	 * Ok, it's really dirty. That means that


I would ask Chuck to bench this my new code (it should be an obvious
improvement).

Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-07  5:00               ` Eric W. Biederman
@ 1999-04-07 11:36                 ` Andrea Arcangeli
  0 siblings, 0 replies; 51+ messages in thread
From: Andrea Arcangeli @ 1999-04-07 11:36 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Chuck Lever, linux-kernel, linux-mm

On 7 Apr 1999, Eric W. Biederman wrote:

>#define __l1_cache_aligned __attribute__((aligned (L1_CACHE_BYTES)))
>...
>
>}
>#ifdef SMP
>__l1_cache_aligned
>#endif
>mem_map_t;
>
>Looks even better, and is even maintainable.

Agreed should work fine, but I think gcc-2.7.2.3 is not able to 32bit
align (why?? and could somebody confirm or invalidate that?). Here
egcs-1.1.1 32bit align the fragment fine (tried in userspace now).

Thanks.

Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-06 22:27               ` Andrea Arcangeli
@ 1999-04-07 12:27                 ` Stephen C. Tweedie
  1999-04-25  3:22                   ` Chuck Lever
  0 siblings, 1 reply; 51+ messages in thread
From: Stephen C. Tweedie @ 1999-04-07 12:27 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Doug Ledford, Chuck Lever, linux-kernel,
	linux-mm, Linus Torvalds, Mark Hemment

Hi,

On Wed, 7 Apr 1999 00:27:21 +0200 (CEST), Andrea Arcangeli
<andrea@e-mind.com> said:

> It's not so obvious to me. I sure agree that an O(n) insertion/deletion is
> far too slow but a O(log(n)) for everything could be rasonable to me. And
> trees don't worry about unluky hash behavior.

Trees are O(log n) for insert/delete, with a high constant of
proportionality (ie. there can be quite a lot of work to be done even
for small log n).  Trees also occupy more memory per node.  Hashes are
O(1) for insert and delete, and are a _fast_ O(1).  The page cache needs
fast insert and delete.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] only-one-cache-query [was Re: [patch] arca-vm-2.2.5]
  1999-04-07 11:28         ` [patch] only-one-cache-query [was Re: [patch] arca-vm-2.2.5] Andrea Arcangeli
@ 1999-04-07 13:06           ` Stephen C. Tweedie
  1999-04-07 13:49             ` Andrea Arcangeli
  1999-04-07 13:42           ` Andrea Arcangeli
  1999-04-07 13:47           ` Ingo Molnar
  2 siblings, 1 reply; 51+ messages in thread
From: Stephen C. Tweedie @ 1999-04-07 13:06 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mark Hemment, Chuck Lever, linux-kernel, linux-mm,
	Linus Torvalds, Stephen C. Tweedie

Hi,

On Wed, 7 Apr 1999 13:28:33 +0200 (CEST), Andrea Arcangeli
<andrea@e-mind.com> said:

> Last night I had a new idea on how to cleanly and trivially avoid the two
> cache query. We only need to know if we slept or not in GFP. 

No, that's not a viable solution in the long term.  We have been
steadily fine-graining the locking in the kernel by replacing the global
kernel lock with per-facility or per-resource locks, and in the future
we can expect to see the page cache also running without the global
lock.

At that point, we will need to drop any spinlocks we hold before calling
get_free_page(), because the scheduler will only drop the global lock
automatically if we sleep and we can't sleep with any other locks held.
Now, even if we _don't_ sleep, another CPU can get in to mess with the
page cache while we are doing allocation stuff.

> I am running the code while writing this and I'll release very soon a
> 2.2.5_arca8.bz2 with this new improvement included.

Fine, but it's a short-term fix.  The hash-cookie mechanism has the
advantage of being suitable for a fine-grain-locked page cache in the
future.

--Stephen

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] only-one-cache-query [was Re: [patch] arca-vm-2.2.5]
  1999-04-07 11:28         ` [patch] only-one-cache-query [was Re: [patch] arca-vm-2.2.5] Andrea Arcangeli
  1999-04-07 13:06           ` Stephen C. Tweedie
@ 1999-04-07 13:42           ` Andrea Arcangeli
  1999-04-07 13:47           ` Ingo Molnar
  2 siblings, 0 replies; 51+ messages in thread
From: Andrea Arcangeli @ 1999-04-07 13:42 UTC (permalink / raw)
  To: Mark Hemment, Chuck Lever
  Cc: linux-kernel, linux-mm, Linus Torvalds, Stephen C. Tweedie

On Wed, 7 Apr 1999, Andrea Arcangeli wrote:

>I would ask Chuck to bench this my new code (it should be an obvious
>improvement).

Don't try it!! I have a bug in the code that is corrupting the cache
(noticed now).

I am working on a fix now.

I just noticed that kmem_cache_reap can theorically sleep in down() so
here I moved the cookie++ in do_try_to_free_pages(), it will be less
finegrined but safer... But I still have corruption so I am looking at
filemap.c now...

BTW, why kmem_cache_reap() need a serializing semaphore?

Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] only-one-cache-query [was Re: [patch] arca-vm-2.2.5]
  1999-04-07 11:28         ` [patch] only-one-cache-query [was Re: [patch] arca-vm-2.2.5] Andrea Arcangeli
  1999-04-07 13:06           ` Stephen C. Tweedie
  1999-04-07 13:42           ` Andrea Arcangeli
@ 1999-04-07 13:47           ` Ingo Molnar
  1999-04-07 14:08             ` Andrea Arcangeli
  2 siblings, 1 reply; 51+ messages in thread
From: Ingo Molnar @ 1999-04-07 13:47 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mark Hemment, Chuck Lever, linux-kernel, linux-mm,
	Linus Torvalds, Stephen C. Tweedie

On Wed, 7 Apr 1999, Andrea Arcangeli wrote:

>  void prune_dcache(int count)
>  {
> +	gfp_sleeping_cookie++;

this can be done via an existing variable, kstat.ctxsw, no need to add yet
another 'have we scheduled' flag. But the whole approach is quite flawed
and volatile, it simply relies on us having the big kernel lock. With more
finegrained SMP locking in that area we will have big problems preserving
that solution.

-- mingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] only-one-cache-query [was Re: [patch] arca-vm-2.2.5]
  1999-04-07 13:06           ` Stephen C. Tweedie
@ 1999-04-07 13:49             ` Andrea Arcangeli
  0 siblings, 0 replies; 51+ messages in thread
From: Andrea Arcangeli @ 1999-04-07 13:49 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Mark Hemment, Chuck Lever, linux-kernel, linux-mm, Linus Torvalds

On Wed, 7 Apr 1999, Stephen C. Tweedie wrote:

>At that point, we will need to drop any spinlocks we hold before calling
>get_free_page(), because the scheduler will only drop the global lock
>automatically if we sleep and we can't sleep with any other locks held.
>Now, even if we _don't_ sleep, another CPU can get in to mess with the
>page cache while we are doing allocation stuff.

Yes, agreed.

Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] only-one-cache-query [was Re: [patch] arca-vm-2.2.5]
  1999-04-07 13:47           ` Ingo Molnar
@ 1999-04-07 14:08             ` Andrea Arcangeli
  0 siblings, 0 replies; 51+ messages in thread
From: Andrea Arcangeli @ 1999-04-07 14:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mark Hemment, Chuck Lever, linux-kernel, linux-mm,
	Linus Torvalds, Stephen C. Tweedie

On Wed, 7 Apr 1999, Ingo Molnar wrote:

>this can be done via an existing variable, kstat.ctxsw, no need to add yet
>another 'have we scheduled' flag. But the whole approach is quite flawed

I just thought about ctxsw but kstat.context_swtch could be not increased
(if next == prev), and if we release the lock we must return finding the
page.

Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-06 22:53                       ` David Miller
@ 1999-04-07 15:59                         ` Gabriel Paubert
  1999-04-07 21:07                           ` Arvind Sankar
  0 siblings, 1 reply; 51+ messages in thread
From: Gabriel Paubert @ 1999-04-07 15:59 UTC (permalink / raw)
  To: davem; +Cc: mingo, sct, andrea, cel, linux-kernel, linux-mm

On Tue, 6 Apr 1999, David Miller wrote:

>    Date: Wed, 7 Apr 1999 00:49:18 +0200 (CEST)
>    From: Ingo Molnar <mingo@chiara.csoma.elte.hu>
> 
>    It should be 'inode >> 8' (which is done by the log2
>    solution). Unless i'm misunderstanding something.
> 
> Consider that:
> 
> (((unsigned long) inode) >> (sizeof(struct inode) & ~ (sizeof(struct inode) - 1)))
> 
> sort of approximates this and avoids the funny looking log2 macro. :-)

May I disagree ? Compute this expression in the case sizeof(struct inode) 
is a large power of 2. Say 0x100, the shift count becomes (0x100 & ~0xff),
or 0x100. Shifts by amounts larger than or equal to the word size are
undefined in C AFAIR (and in practice on most architectures which take
the shift count modulo some power of 2). 

I have needed quite often a log2 estimate of integer values but I don't
know of any tricks or expression to make it fast on machines which don't
have an instruction which counts the number of most significant zero bits. 
It is trivial to count the number of least significant zero bits if you
have an instruction which counts the most significant zero bits, but not
the other way around. 

	Regards,
	Gabriel.

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-07 15:59                         ` Gabriel Paubert
@ 1999-04-07 21:07                           ` Arvind Sankar
  1999-04-09  6:58                             ` Eric W. Biederman
  0 siblings, 1 reply; 51+ messages in thread
From: Arvind Sankar @ 1999-04-07 21:07 UTC (permalink / raw)
  To: Gabriel Paubert, davem; +Cc: mingo, sct, andrea, cel, linux-kernel, linux-mm

On Wed, Apr 07, 1999 at 05:59:04PM +0200, Gabriel Paubert wrote:
> 
> 
> On Tue, 6 Apr 1999, David Miller wrote:
> 
> >    Date: Wed, 7 Apr 1999 00:49:18 +0200 (CEST)
> >    From: Ingo Molnar <mingo@chiara.csoma.elte.hu>
> > 
> >    It should be 'inode >> 8' (which is done by the log2
> >    solution). Unless i'm misunderstanding something.
> > 
> > Consider that:
> > 
> > (((unsigned long) inode) >> (sizeof(struct inode) & ~ (sizeof(struct inode) - 1)))
> > 
> > sort of approximates this and avoids the funny looking log2 macro. :-)
> 
> May I disagree ? Compute this expression in the case sizeof(struct inode) 
> is a large power of 2. Say 0x100, the shift count becomes (0x100 & ~0xff),
> or 0x100. Shifts by amounts larger than or equal to the word size are
> undefined in C AFAIR (and in practice on most architectures which take
> the shift count modulo some power of 2). 
> 

typo there, I guess. the >> should be an integer division. Since the divisor is
a constant power of 2, the compiler will optimize it into a shift.

--  arvind
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-06 22:19                 ` Ingo Molnar
  1999-04-06 22:40                   ` David Miller
@ 1999-04-08  8:09                   ` Carlo Daffara
  1 sibling, 0 replies; 51+ messages in thread
From: Carlo Daffara @ 1999-04-08  8:09 UTC (permalink / raw)
  To: andreae; +Cc: linux-kernel, linux-mm

I have tried your arca-vm patches, and must say that this is by far the
best for interactive response. We are making a custom low cost workstation
based on gnome (with only 32m ram) and 2.2.5 with your patch makes it
perfectly usable even with multiple applets and netscape running.
I have seen no strange side effects, even after VERY heavy usage.
Any chance that your VM will be part of a future 2.2 or (eventually) 2.3?
Maybe as a "pluggable" VM?
best regards, and keep up the great work!!

						Carlo Daffara
						Conecta Telematica

PS: by the way, 2.2.5 compiled with egcs 2.91.66 is perfectly fine...

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-07 21:07                           ` Arvind Sankar
@ 1999-04-09  6:58                             ` Eric W. Biederman
  1999-04-09  9:27                               ` Gabriel Paubert
  0 siblings, 1 reply; 51+ messages in thread
From: Eric W. Biederman @ 1999-04-09  6:58 UTC (permalink / raw)
  To: Arvind Sankar
  Cc: Gabriel Paubert, davem, mingo, sct, andrea, cel, linux-kernel, linux-mm

>>>>> "AS" == Arvind Sankar <arvinds@MIT.EDU> writes:

AS> On Wed, Apr 07, 1999 at 05:59:04PM +0200, Gabriel Paubert wrote:
>> 
>> 
>> On Tue, 6 Apr 1999, David Miller wrote:
>> 
>> >    Date: Wed, 7 Apr 1999 00:49:18 +0200 (CEST)
>> >    From: Ingo Molnar <mingo@chiara.csoma.elte.hu>
>> > 
>> >    It should be 'inode >> 8' (which is done by the log2
>> >    solution). Unless i'm misunderstanding something.
>> > 
>> > Consider that:
>> > 
>> > (((unsigned long) inode) >> (sizeof(struct inode) & ~ (sizeof(struct inode) - 1)))
>> > 
>> > sort of approximates this and avoids the funny looking log2 macro. :-)
>> 
>> May I disagree ? Compute this expression in the case sizeof(struct inode) 
>> is a large power of 2. Say 0x100, the shift count becomes (0x100 & ~0xff),
>> or 0x100. Shifts by amounts larger than or equal to the word size are
>> undefined in C AFAIR (and in practice on most architectures which take
>> the shift count modulo some power of 2). 
>> 

AS> typo there, I guess. the >> should be an integer division. Since the divisor is
AS> a constant power of 2, the compiler will optimize it into a shift.

Actually I believe:
#define DIVISOR(x) (x  & ~((x >> 1) | ~(x >> 1)))

(((unsigned long) inode) / DIVISOR(sizeof(struct inode)))

Is the magic formula.

A smart compiler can figure out the shift, and the DIVISOR macro
makes x into a power of two.

Eric
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-09  6:58                             ` Eric W. Biederman
@ 1999-04-09  9:27                               ` Gabriel Paubert
  1999-04-09 15:40                                 ` Eric W. Biederman
  0 siblings, 1 reply; 51+ messages in thread
From: Gabriel Paubert @ 1999-04-09  9:27 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Arvind Sankar, davem, mingo, sct, andrea, cel, linux-kernel, linux-mm

On 9 Apr 1999, Eric W. Biederman wrote:

> AS> typo there, I guess. the >> should be an integer division. Since the divisor is
> AS> a constant power of 2, the compiler will optimize it into a shift.
> 
> Actually I believe:
> #define DIVISOR(x) (x  & ~((x >> 1) | ~(x >> 1)))
                           ^^^^^^^^^^^^^^^^^^^^^^^

interesting formula. Unless I'm wrong, set y=x>>1 and evaluate it again:

	~(y | ~y)

which should give zero on any binary machine. So I think you've come up
with a sophisticated way of generating an OOPS :-) at least on
architectures which don't silently and happily divide by zero. 

I've needed it quite often but I don't know of any short formula which
computes the log2 of an integer (whether rounded up or down does not
matter) with standard C operators. 

I've been looking for it quite carefully, and I don't even think that it
is possible: there is an example on how to do this in the HP-UX assembler
documentation, it takes 18 machine instructions (no loops, performing a
binary search) and it has been written by people who, as you would
expect, know very well the tricks of the architecture (cascaded
conditional nullification to perform branchless if then else: the then
branch is a shift, the else branch is an add).

Code size or time are not the problem here since it is a compile time
constant, but trying to find a simple C expression to approximate log2 is
probably futile (and not worth the effort in this case). 

	Gabriel. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-09  9:27                               ` Gabriel Paubert
@ 1999-04-09 15:40                                 ` Eric W. Biederman
  0 siblings, 0 replies; 51+ messages in thread
From: Eric W. Biederman @ 1999-04-09 15:40 UTC (permalink / raw)
  To: Gabriel Paubert
  Cc: Eric W. Biederman, Arvind Sankar, davem, mingo, sct, andrea, cel,
	linux-kernel, linux-mm

>>>>> "GP" == Gabriel Paubert <paubert@iram.es> writes:

GP> On 9 Apr 1999, Eric W. Biederman wrote:

AS> typo there, I guess. the >> should be an integer division. Since the divisor is
AS> a constant power of 2, the compiler will optimize it into a shift.
>> 
>> Actually I believe:
>> #define DIVISOR(x) (x  & ~((x >> 1) | ~(x >> 1)))
GP>                            ^^^^^^^^^^^^^^^^^^^^^^^

GP> interesting formula. Unless I'm wrong, set y=x>>1 and evaluate it again:

GP> 	~(y | ~y)

GP> which should give zero on any binary machine.
Duh.  I forgot that the high bits got set.

GP> So I think you've come up
GP> with a sophisticated way of generating an OOPS :-) at least on
GP> architectures which don't silently and happily divide by zero. 

GP> I've needed it quite often but I don't know of any short formula which
GP> computes the log2 of an integer (whether rounded up or down does not
GP> matter) with standard C operators.

Well I wasn't trying for log2 but instead for truncating to the nearest
power of two, in a form that the compiler could compute, at compile time.

GP> I've been looking for it quite carefully, and I don't even think that it
GP> is possible: there is an example on how to do this in the HP-UX assembler
GP> documentation, it takes 18 machine instructions (no loops, performing a
GP> binary search) and it has been written by people who, as you would
GP> expect, know very well the tricks of the architecture (cascaded
GP> conditional nullification to perform branchless if then else: the then
GP> branch is a shift, the else branch is an add).

GP> Code size or time are not the problem here since it is a compile time
GP> constant, but trying to find a simple C expression to approximate log2 is
GP> probably futile (and not worth the effort in this case). 

True but it is fun :)

Eric

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-05 13:23     ` Mark Hemment
  1999-04-05 15:56       ` Andrea Arcangeli
  1999-04-05 20:24       ` [patch] arca-vm-2.2.5 Horst von Brand
@ 1999-04-17 11:12       ` Andrea Arcangeli
  2 siblings, 0 replies; 51+ messages in thread
From: Andrea Arcangeli @ 1999-04-17 11:12 UTC (permalink / raw)
  To: Mark Hemment
  Cc: linux-kernel, linux-mm, David S. Miller, Stephen C. Tweedie, Chuck Lever

On Mon, 5 Apr 1999, Mark Hemment wrote:

>  The page structure needs to be as small as possible.  If its size
>happens to L1 align, then that is great, but otherwise it isn't worth the
>effort - the extra memory used to store the "padding" is much better used
>else where.
>  Most accesses to the page struct are reads, this means it can live in
>the Shared state across mutilple L1 caches.  The "slightly" common
>operation of incremented the ref-count/changing-flag-bits doesn't really
>come into play often enough to matter.
>  Keeping the struct small can result in part of the page struct of
>interest in the L1 cache, along with part of the next one.  As it isn't a
>heavily modified structure, with no spin locks, "false sharing" isn't a
>problem.  Besides, the VM isn't threaded, so it isn't going to be playing
>ping-pong with the cache lines anyway.

I think the same thing applys to many places where we are using the slab
and we are right now requesting L1 cache alignment but the code is not
going to be SMP threaded for real (no spinlocks in the struct or/and
everything protected by the big kernel lock).

Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch] arca-vm-2.2.5
  1999-04-07 12:27                 ` Stephen C. Tweedie
@ 1999-04-25  3:22                   ` Chuck Lever
  0 siblings, 0 replies; 51+ messages in thread
From: Chuck Lever @ 1999-04-25  3:22 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Andrea Arcangeli, linux-kernel, linux-mm

On Wed, 7 Apr 1999, Stephen C. Tweedie wrote:
> On Wed, 7 Apr 1999 00:27:21 +0200 (CEST), Andrea Arcangeli
> <andrea@e-mind.com> said:
> 
> > It's not so obvious to me. I sure agree that an O(n) insertion/deletion is
> > far too slow but a O(log(n)) for everything could be rasonable to me. And
> > trees don't worry about unluky hash behavior.
> 
> Trees are O(log n) for insert/delete, with a high constant of
> proportionality (ie. there can be quite a lot of work to be done even
> for small log n).  Trees also occupy more memory per node.  Hashes are
> O(1) for insert and delete, and are a _fast_ O(1).  The page cache needs
> fast insert and delete.

i had a few minutes the other day, so i extracted just the rb-tree part of
2.2.5-arca10, and benchmarked it against 2.2.5 and against the hash tuning
patch i'm working on.  i ran this on our 512M 4-way Dell PowerEdge 6300.

ref - 2.2.5 kernel with 4000 process slots and b_state fix

rbt - 2.2.5 kernel with 4000 process slots and rbtree patch applied
        (b_state fix *not* applied)
hash - 2.2.5 kernel with an older version of my hash tuning patch applied
        (b_state fix applied)

160 concurrent scripts.  all of the benchmark fits in memory.

ref:  3725.8 s=15.23
rbt:  3893.3 s=9.82
hash: 4007.3 s=15.95

"hash" tunes the page cache, the buffer cache, the dentry cache, and the
inode cache.  "rbt" just replaces the page cache with per-inode rbtrees.
i think the rbtree patch compares pretty favorably for very large memory
machines.

	- Chuck Lever
--
corporate:	<chuckl@netscape.com>
personal:	<chucklever@netscape.net> or <cel@monkey.org>

The Linux Scalability project:
	http://www.citi.umich.edu/projects/linux-scalability/


--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~1999-04-25  3:22 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1999-04-01 23:32 [patch] arca-vm-2.2.5 Andrea Arcangeli
1999-04-04 21:07 ` Chuck Lever
1999-04-05  0:22   ` Andrea Arcangeli
1999-04-05 13:23     ` Mark Hemment
1999-04-05 15:56       ` Andrea Arcangeli
1999-04-07 11:28         ` [patch] only-one-cache-query [was Re: [patch] arca-vm-2.2.5] Andrea Arcangeli
1999-04-07 13:06           ` Stephen C. Tweedie
1999-04-07 13:49             ` Andrea Arcangeli
1999-04-07 13:42           ` Andrea Arcangeli
1999-04-07 13:47           ` Ingo Molnar
1999-04-07 14:08             ` Andrea Arcangeli
1999-04-05 20:24       ` [patch] arca-vm-2.2.5 Horst von Brand
1999-04-05 23:25         ` Andrea Arcangeli
1999-04-05 23:37           ` Horst von Brand
1999-04-06  1:23             ` Andrea Arcangeli
1999-04-17 11:12       ` Andrea Arcangeli
1999-04-05 21:31     ` Chuck Lever
1999-04-06  0:15       ` Andrea Arcangeli
1999-04-06  2:14         ` Doug Ledford
1999-04-06 13:04           ` Andrea Arcangeli
1999-04-06 21:31             ` Stephen C. Tweedie
1999-04-06 22:27               ` Andrea Arcangeli
1999-04-07 12:27                 ` Stephen C. Tweedie
1999-04-25  3:22                   ` Chuck Lever
1999-04-06  5:52         ` Chuck Lever
1999-04-06 13:09           ` Andrea Arcangeli
1999-04-06 16:19           ` Eric W. Biederman
1999-04-06 20:26             ` Andrea Arcangeli
1999-04-07  5:00               ` Eric W. Biederman
1999-04-07 11:36                 ` Andrea Arcangeli
1999-04-06 14:02       ` Stephen C. Tweedie
1999-04-06 15:38         ` Chuck Lever
1999-04-06 17:16           ` Andrea Arcangeli
1999-04-06 18:07             ` Andrea Arcangeli
1999-04-06 21:22               ` Stephen C. Tweedie
1999-04-06 22:19                 ` Ingo Molnar
1999-04-06 22:40                   ` David Miller
1999-04-06 22:49                     ` Ingo Molnar
1999-04-06 22:53                       ` David Miller
1999-04-07 15:59                         ` Gabriel Paubert
1999-04-07 21:07                           ` Arvind Sankar
1999-04-09  6:58                             ` Eric W. Biederman
1999-04-09  9:27                               ` Gabriel Paubert
1999-04-09 15:40                                 ` Eric W. Biederman
1999-04-08  8:09                   ` Carlo Daffara
1999-04-06 22:31                 ` Andrea Arcangeli
1999-04-06 20:47             ` Chuck Lever
1999-04-06 21:04               ` Andrea Arcangeli
1999-04-06 21:11               ` Stephen C. Tweedie
1999-04-06 14:00     ` Stephen C. Tweedie
1999-04-06 16:29       ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox