* [patch] arca-vm-2.2.5 @ 1999-04-01 23:32 Andrea Arcangeli 1999-04-04 21:07 ` Chuck Lever 0 siblings, 1 reply; 51+ messages in thread From: Andrea Arcangeli @ 1999-04-01 23:32 UTC (permalink / raw) To: linux-kernel, linux-mm; +Cc: Chuck Lever, Stephen C. Tweedie Well in the last days I had new design ideas on the VM (I mean shrink_mmap() and friends). I finished implementing them and the result looks like impressive under heavy VM load. I would like if people that runs linux under high VM load would try it out my new VM code. ftp://e-mind.com/pub/linux/arca-tree/2.2.5_arca2.gz If you try it out please feedback... Thanks! Andrea Arcangeli -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-01 23:32 [patch] arca-vm-2.2.5 Andrea Arcangeli @ 1999-04-04 21:07 ` Chuck Lever 1999-04-05 0:22 ` Andrea Arcangeli 0 siblings, 1 reply; 51+ messages in thread From: Chuck Lever @ 1999-04-04 21:07 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-kernel, linux-mm, Stephen C. Tweedie On Fri, 2 Apr 1999, Andrea Arcangeli wrote: > Well in the last days I had new design ideas on the VM (I mean > shrink_mmap() and friends). I finished implementing them and the result > looks like impressive under heavy VM load. > > I would like if people that runs linux under high VM load would try it out > my new VM code. > > ftp://e-mind.com/pub/linux/arca-tree/2.2.5_arca2.gz i noticed a couple of differences with your original modifications, right off the bat. first, i notice you've altered the page hash function and quadrupled the size of the hash table. do you have measurements/benchmarks that show that the page hash was not working well? can you say how a plain 2.2.5 kernel compares to one that has just the page hash changes without the rest of your VM modifications? the reason i ask is because i've played with that hash table, and found most changes to it cause undesirable increases in system CPU utilization. although, it *is* highly interesting that the buffer hash table is orders of magnitude larger, yet hashes about the same number of objects. can someone provide history on the design of the page hash function? also, can you tell what improvement you expect from the additional logic in try_to_free_buffers() ? - Chuck Lever -- corporate: <chuckl@netscape.com> personal: <chucklever@netscape.net> or <cel@monkey.org> The Linux Scalability project: http://www.citi.umich.edu/projects/citi-netscape/ -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-04 21:07 ` Chuck Lever @ 1999-04-05 0:22 ` Andrea Arcangeli 1999-04-05 13:23 ` Mark Hemment ` (2 more replies) 0 siblings, 3 replies; 51+ messages in thread From: Andrea Arcangeli @ 1999-04-05 0:22 UTC (permalink / raw) To: Chuck Lever; +Cc: linux-kernel, linux-mm, Stephen C. Tweedie, Linus Torvalds On Sun, 4 Apr 1999, Chuck Lever wrote: >> ftp://e-mind.com/pub/linux/arca-tree/2.2.5_arca2.gz *snip* >first, i notice you've altered the page hash function and quadrupled the The page hash function change is from Stephen (I did it here too because I completly agreed with it). The point is that shm entries uses the lower bits of the pagemap->offset field. >size of the hash table. do you have measurements/benchmarks that show >that the page hash was not working well? can you say how a plain 2.2.5 The page_hash looked like to me a quite obvious improvement while swapping in/out shm entreis (it will improve the swap cache queries) but looks my comment below... >kernel compares to one that has just the page hash changes without the >rest of your VM modifications? the reason i ask is because i've played The reason of that is that it's an obvious improvement. And since it's statically allocated (not dynamically allocated at boot in function of the memory size) a bit larger default can be desiderable, I can safely alloc some more bit of memory (some decade of kbyte) without harming the avalilable mm here. Well, as I just said many times I think someday we'll need RB-trees instead of fuzzy hash but it's not a big issue right now due the so low number of pages available. Returning to your question in my tree I enlarged the hashtable to 13 bit. This mean that in the best case I'll be able to address in O(1) up to 8192 pages. Here I have 32752 pages so as worse I'll have 4 pages chained on every hash entry. 13 bits of hash-depth will alloc for the hash 32k of memory (really not an issue ;). In the stock kernel instead the hash size is 2^11 = 2048 so in the worst case I would have 16 pages chained in the same hash entry. >with that hash table, and found most changes to it cause undesirable >increases in system CPU utilization. although, it *is* highly interesting Swapping out/in shm entries is not a so frequent task as doing normal query on the page cache. So I am removing the patch here. Thanks for the info, I really didn't thought about this... For the record this is the hash-function change we are talking about: Index: pagemap.h =================================================================== RCS file: /var/cvs/linux/include/linux/pagemap.h,v retrieving revision 1.1.1.2 retrieving revision 1.1.2.12 diff -u -r1.1.1.2 -r1.1.2.12 --- pagemap.h 1999/01/23 16:29:55 1.1.1.2 +++ pagemap.h 1999/04/01 23:12:37 1.1.2.12 @@ -36,7 +35,7 @@ #define i (((unsigned long) inode)/(sizeof(struct inode) & ~ (sizeof(struct inode) - 1))) #define o (offset >> PAGE_SHIFT) #define s(x) ((x)+((x)>>PAGE_HASH_BITS)) - return s(i+o) & (PAGE_HASH_SIZE-1); + return s(i+o+offset) & (PAGE_HASH_SIZE-1); #undef i #undef o #undef s >that the buffer hash table is orders of magnitude larger, yet hashes about >the same number of objects. can someone provide history on the design of >the page hash function? I can't help you into this, but looks Ok to me ;). If somebody did the math on it I'd like to try understanding it. >also, can you tell what improvement you expect from the additional logic >in try_to_free_buffers() ? Eh, my shrink_mmap() is is a black magic and it's long to explain what I thought ;). Well one of the reasons is that ext2 take used the superblock all the time and so when I reach an used buffers I'll put back at the top of the lru list since I don't want to go in swap because there are some unfreeable superblock that live forever at the end of the pagemap lru_list. Note also (you didn't asked about that but I bet you noticed that ;) that in my tree I also made every pagemap entry L1 cacheline aliged. I asked to people that was complainig about page colouring (and I still don't know what is exactly page colouring , I only have a guess but I would like to read something about implementation details, pointers???) to try out my patch to see if it made differences; but I had no feedback :(. I also made the irq_state entry cacheline aligned (when I understood the cacheline issue I agreed with it). Many thanks for commenting and reading my new experimental code (rock solid here). I'll release now a: ftp://e-mind.com/pub/linux/arca-tree/2.2.5_arca4.bz2 It will have my latest stuff I did (flushtime-bugfix included and sane sysctl values included too) in the last days plus the old hash-function for the reasons you pointed out to me now. If you'll find some spare time to try out the new patch let me know the numbers! ;)) Andrea Arcangeli -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-05 0:22 ` Andrea Arcangeli @ 1999-04-05 13:23 ` Mark Hemment 1999-04-05 15:56 ` Andrea Arcangeli ` (2 more replies) 1999-04-05 21:31 ` Chuck Lever 1999-04-06 14:00 ` Stephen C. Tweedie 2 siblings, 3 replies; 51+ messages in thread From: Mark Hemment @ 1999-04-05 13:23 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-kernel, linux-mm Hi Andrea/All, I'm just throwing some ideas around..... > The page_hash looked like to me a quite obvious improvement while swapping > in/out shm entreis (it will improve the swap cache queries) but looks my > comment below... One worthwhile improvement to the page-hash is to reduce the need to re-check the hash after a blocking operation (ie. page allocation). Changing the page-hash from an array of page ptrs to an array of; typedef struct page_hash_s { struct page *ph_page; unsigned int ph_cookie; } page_hash_t; page_hash_t page_hash[PAGE_HASH_TABLE]; Whenever a new page is linked into a hash line, the ph_cookie for that line is incremented. Before a page allocation, take a copy of the cookie for the hash line where the page will be inserted. If, after the allocation, the cookie hasn't changed, then there is no reason to re-search the hash. This does double the size of the page-hash, and would require profiling to determine if it is worthwhile. (Note: If the VM sub-system was threaded, there would be a different solution than the above.) > Note also (you didn't asked about that but I bet you noticed that ;) that > in my tree I also made every pagemap entry L1 cacheline aliged. I asked to > people that was complainig about page colouring (and I still don't know > what is exactly page colouring , I only have a guess but I would like to > read something about implementation details, pointers???) to try out my > patch to see if it made differences; but I had no feedback :(. I also > made the irq_state entry cacheline aligned (when I understood the > cacheline issue I agreed with it). Yikes!!!! The page structure needs to be as small as possible. If its size happens to L1 align, then that is great, but otherwise it isn't worth the effort - the extra memory used to store the "padding" is much better used else where. Most accesses to the page struct are reads, this means it can live in the Shared state across mutilple L1 caches. The "slightly" common operation of incremented the ref-count/changing-flag-bits doesn't really come into play often enough to matter. Keeping the struct small can result in part of the page struct of interest in the L1 cache, along with part of the next one. As it isn't a heavily modified structure, with no spin locks, "false sharing" isn't a problem. Besides, the VM isn't threaded, so it isn't going to be playing ping-pong with the cache lines anyway. OK, with a smaller than h/w cache sized structure, whose size isn't a a power of 2 (assuming cache-line sizes only come in power of 2) there may need to be 2 cache-line faults to load the required members of a struct. Smaller the page struct, the less lightly this is to happen. If you want to reduce the page struct size, it is possible to unionize the "inode" and "buffers" pointers into one member, indicating which is active via bits in the flag fields. (Note: If you do this change, the flag-bit doesn't need to be set for the Swap-Cache state - this is a special case for anonymous pages - looks like it drops out nicely.) Adding an extra member to page-struct, to count I/O write operations on a page, is useful. When a page is under going I/O, it has the PG_locked bit set. Fine. If the I/O is write, then in many cases there is no reason to block waiting for the I/O to finish. eg. Under a fault for page not present, provided the returned mapping for the page is read-only, why wait for the write operation to finish? There is no need. Why use a count for I/O write ops? Take the case of a WP fault. This does a page allocation as its first operation, just so the following operations occur atomically. It may be possible (I haven't checked it through fully) to simply I/O write lock the faulting page, this (along with some other code re-arrangement in do_wp_page() and shrink_mmap()) could allow this allocation to occur later - when we are more sure that it is requried. Page-colouring is a bast*rd. Modifying the page allocator to take a 'suggested' colour into account isn't too difficult. I messed around with this about 16mths ago (for added performance, changed the coalescing from eager to lazy). The problem end is page fragmentation. Some of the fragmentation can be over come with a more intelligent page reaper, but going as far as adding pte-chaining is way, way, over the top. Going the BSD 4.4 route, with linked lists of shadow objects, is possible, and gets very interesting. :) Some of the fragmentation probs can be handled better by dropping the static virt-to-phys mapping, and making it dynamic. Allowing the page allocator to give out virtually contigious pages (as well as the current physically contigious pages) gives it more flexibility. This has a _lot_ of issues, not least with device drivers - either they have strict requirements on the type of memory they perform I/O on (ie. they accept phys contigious only), or they need to be changed to perform scatter/gather. Regards, Mark -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-05 13:23 ` Mark Hemment @ 1999-04-05 15:56 ` Andrea Arcangeli 1999-04-07 11:28 ` [patch] only-one-cache-query [was Re: [patch] arca-vm-2.2.5] Andrea Arcangeli 1999-04-05 20:24 ` [patch] arca-vm-2.2.5 Horst von Brand 1999-04-17 11:12 ` Andrea Arcangeli 2 siblings, 1 reply; 51+ messages in thread From: Andrea Arcangeli @ 1999-04-05 15:56 UTC (permalink / raw) To: Mark Hemment; +Cc: linux-kernel, linux-mm On Mon, 5 Apr 1999, Mark Hemment wrote: >Hi Andrea/All, > > I'm just throwing some ideas around..... > >> The page_hash looked like to me a quite obvious improvement while swapping >> in/out shm entreis (it will improve the swap cache queries) but looks my >> comment below... > > One worthwhile improvement to the page-hash is to reduce the need to >re-check the hash after a blocking operation (ie. page allocation). Yes, I just thought about that some time ago. > Changing the page-hash from an array of page ptrs to an array of; > typedef struct page_hash_s { > struct page *ph_page; > unsigned int ph_cookie; > } page_hash_t; > > page_hash_t page_hash[PAGE_HASH_TABLE]; > > Whenever a new page is linked into a hash line, the ph_cookie for that >line is incremented. > Before a page allocation, take a copy of the cookie for the hash line >where the page will be inserted. If, after the allocation, the cookie >hasn't changed, then there is no reason to re-search the hash. > > This does double the size of the page-hash, and would require profiling >to determine if it is worthwhile. I don't think it's worthwhile simply because most of the time you'll have only to pass as _worse_ one or two chains of the hash entry. And passing one or two chains will be far more ligther than the new mechanism. If you have gigabyte of cache you will be hurted by performances anyway and you'll enlarge the hash size anyway and so you'll return in no need of the mechanizm. But maybe I am plain wrong. If you'll try let me know the profiling (I am very courious). At least the bench will be trivial ;). Just do a 'time cat /usr/bin/* >/dev/null' with 100mbyte of cache and then with 10mbyte of memory usable for the cache with the two kernels and you'll see if it's worthwhile. > (Note: If the VM sub-system was threaded, there would be a different >solution than the above.) Which is the other solution? And which is the point of the threaded issue? We are just playing games with threaded issues exactly because we have a sleep to allocate memory. >> Note also (you didn't asked about that but I bet you noticed that ;) that >> in my tree I also made every pagemap entry L1 cacheline aliged. I asked to >> people that was complainig about page colouring (and I still don't know >> what is exactly page colouring , I only have a guess but I would like to >> read something about implementation details, pointers???) to try out my >> patch to see if it made differences; but I had no feedback :(. I also >> made the irq_state entry cacheline aligned (when I understood the >> cacheline issue I agreed with it). > > Yikes!!!! > The page structure needs to be as small as possible. If its size >happens to L1 align, then that is great, but otherwise it isn't worth the >effort - the extra memory used to store the "padding" is much better used >else where. Note that in UP I am not enlarging the pagemap struct size. I will agree that the padding is much better used else where when somebody that complains about page colouring will show me no difference in numbers between L1 cache alignment and zero padding. > Most accesses to the page struct are reads, this means it can live in >the Shared state across mutilple L1 caches. The "slightly" common >operation of incremented the ref-count/changing-flag-bits doesn't really >come into play often enough to matter. Also setting swapcache/locked/update and whatever SMP-locking flags etc..etc... >problem. Besides, the VM isn't threaded, so it isn't going to be playing >ping-pong with the cache lines anyway. It's not threaded now but I was just making shrink_mmap() threaded here. With my new pagemap_lru it's trivial to make shrink_mmap() threaded (just add some spinlock to add_cache/del_cache/mkyoung() and use by hand releasekernellock reaquirekernellock). > If you want to reduce the page struct size, it is possible to unionize >the "inode" and "buffers" pointers into one member, indicating which is >active via bits in the flag fields. (Note: If you do this change, the Agreed! >flag-bit doesn't need to be set for the Swap-Cache state - this is a >special case for anonymous pages - looks like it drops out nicely.) I think that PG_swap_cache is simply overhead to get performances. It's not needed right now _too_, because you only need to compare page->inode with &swapper_inode to know if it's a swap cache page or not. The ->buffer field has nothing to do with that. > Adding an extra member to page-struct, to count I/O write operations on >a page, is useful. When a page is under going I/O, it has the PG_locked >bit set. Fine. If the I/O is write, then in many cases there is no >reason to block waiting for the I/O to finish. eg. Under a fault for Agreed it would make sense, even if I don't know how much this will make difference while I am sure that the complexity will increase a good bit... >page not present, provided the returned mapping for the page is read-only, >why wait for the write operation to finish? There is no need. > Why use a count for I/O write ops? Take the case of a WP fault. This >does a page allocation as its first operation, just so the following >operations occur atomically. It may be possible (I haven't checked it >through fully) to simply I/O write lock the faulting page, this (along >with some other code re-arrangement in do_wp_page() and shrink_mmap()) >could allow this allocation to occur later - when we are more sure that it >is requried. When you return from a fault if you don't map a new page you'll get a second fault. If you fault you need the (maybe writable) page alloced _now_. Maybe I am missing your point... Many thanks your interesting comments/thoughts. Andrea Arcangeli BTW, Did you seen my kmem_cache_destroy()? I got zero feedback about it ;). It worked fine so far here and it's always included in my arca-tree patches. Alexander Viro (or others?, i don't remeber well) asked for it in order to avoid a two level modules. -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* [patch] only-one-cache-query [was Re: [patch] arca-vm-2.2.5] 1999-04-05 15:56 ` Andrea Arcangeli @ 1999-04-07 11:28 ` Andrea Arcangeli 1999-04-07 13:06 ` Stephen C. Tweedie ` (2 more replies) 0 siblings, 3 replies; 51+ messages in thread From: Andrea Arcangeli @ 1999-04-07 11:28 UTC (permalink / raw) To: Mark Hemment, Chuck Lever Cc: linux-kernel, linux-mm, Linus Torvalds, Stephen C. Tweedie On Mon, 5 Apr 1999, Andrea Arcangeli wrote: >> This does double the size of the page-hash, and would require profiling >>to determine if it is worthwhile. > >I don't think it's worthwhile simply because most of the time you'll have >only to pass as _worse_ one or two chains of the hash entry. And passing >one or two chains will be far more ligther than the new mechanism. Last night I had a new idea on how to cleanly and trivially avoid the two cache query. We only need to know if we slept or not in GFP. We don't ever need to think if some pages are been added to our hash bucket. It's far simpler and the real-world case will perform better (yes while swapping out heavily we'll lose a bit of performances, but when we are I/O bound an additional query on the cache won't harm at all, and due the simpler desing we'll get more performances and memory for all the good not swapping time). And last but not the least: it will work without changes even if the page-cache will change it's internal structure ;)). Changing GFP is not an option (it would add overhead to the normal case when we are not interested about such info), so I had the idea of adding a gfp_sleeping_cookie increased every time GFP _may_ sleeps. It will add zero overhead but it will let us know if we _may_ have slept or not. I am running the code while writing this and I'll release very soon a 2.2.5_arca8.bz2 with this new improvement included. Here it is the diff against clean 2.2.5: Index: fs/dcache.c =================================================================== RCS file: /var/cvs/linux/fs/dcache.c,v retrieving revision 1.1.1.6 diff -u -r1.1.1.6 dcache.c --- dcache.c 1999/03/24 00:47:09 1.1.1.6 +++ linux/fs/dcache.c 1999/04/07 01:12:46 @@ -285,6 +285,7 @@ */ void prune_dcache(int count) { + gfp_sleeping_cookie++; for (;;) { struct dentry *dentry; struct list_head *tmp = dentry_unused.prev; Index: include/linux/mm.h =================================================================== RCS file: /var/cvs/linux/include/linux/mm.h,v retrieving revision 1.1.1.4 diff -u -r1.1.1.4 mm.h --- mm.h 1999/03/09 00:55:28 1.1.1.4 +++ linux/include/linux/mm.h 1999/04/07 01:11:43 @@ -259,6 +259,9 @@ #define __get_dma_pages(gfp_mask, order) __get_free_pages((gfp_mask) | GFP_DMA,(order)) extern unsigned long FASTCALL(__get_free_pages(int gfp_mask, unsigned long gfp_order)); +/* automagically increase the gfp-cookie if GFP slept -arca */ +extern int gfp_sleeping_cookie; + extern inline unsigned long get_free_page(int gfp_mask) { unsigned long page; Index: ipc/shm.c ===============================================================ile: /var/cvs/linux/ipc/shm.c,v retrieving revision 1.1.1.1 diff -u -r1.1.1.1 shm.c --- shm.c 1999/01/18 01:27:58 1.1.1.1 +++ linux/ipc/shm.c 1999/04/07 01:11:43 @@ -800,6 +800,7 @@ if (atomic_read(&mem_map[MAP_NR(pte_page(page))].count) != 1) goto check_table; shp->shm_pages[idx] = swap_nr; + gfp_sleeping_cookie++; rw_swap_page_nocache (WRITE, swap_nr, (char *) pte_page(page)); free_page(pte_page(page)); swap_successes++; Index: mm/filemap.c =================================================================== RCS file: /var/cvs/linux/mm/filemap.c,v retrieving revision 1.1.1.8 diff -u -r1.1.1.8 filemap.c --- filemap.c 1999/03/24 00:53:22 1.1.1.8 +++ linux/mm/filemap.c 1999/04/07 01:11:43 @@ -272,16 +272,29 @@ struct page ** hash; offset &= PAGE_MASK; + if (offset >= inode->i_size) + goto out; + + hash = page_hash(inode, offset); + page = __find_page(inode, offset, *hash); + + if (page) + { + /* fast path -arca */ + release_page(page); + return page_cache; + } + switch (page_cache) { + int gfp_cookie; case 0: + gfp_cookie = gfp_sleeping_cookie; page_cache = __get_free_page(GFP_USER); if (!page_cache) break; + if (gfp_cookie != gfp_sleeping_cookie) + page = __find_page(inode, offset, *hash); default: - if (offset >= inode->i_size) - break; - hash = page_hash(inode, offset); - page = __find_page(inode, offset, *hash); if (!page) { /* * Ok, add the new page to the hash-queues... @@ -293,6 +306,7 @@ } release_page(page); } + out: return page_cache; } @@ -586,6 +600,7 @@ size_t pos, pgpos, page_cache; int reada_ok; int max_readahead = get_max_readahead(inode); + struct page **hash; page_cache = 0; @@ -630,8 +645,9 @@ filp->f_ramax = max_readahead; } + hash = page_hash(inode, pos & PAGE_MASK); for (;;) { - struct page *page, **hash; + struct page *page; if (pos >= inode->i_size) break; @@ -639,7 +655,6 @@ /* * Try to find the data in the page cache.. */ - hash = page_hash(inode, pos & PAGE_MASK); page = __find_page(inode, pos & PAGE_MASK, *hash); if (!page) goto no_cached_page; @@ -696,15 +711,19 @@ * page.. */ if (!page_cache) { + int gfp_cookie = gfp_sleeping_cookie; page_cache = __get_free_page(GFP_USER); - /* - * That could have slept, so go around to the - * very beginning.. - */ - if (page_cache) + if (!page_cache) + { + desc->error = -ENOMEM; + break; + } + if (gfp_cookie != gfp_sleeping_cookie) + /* + * We slept, so go around to the + * very beginning.. + */ continue; - desc->error = -ENOMEM; - break; } /* @@ -941,6 +960,7 @@ unsigned long offset, reada, i; struct page * page, **hash; unsigned long old_page, new_page; + int gfp_cookie; new_page = 0; offset = (address & PAGE_MASK) - area->vm_start + area->vm_offset; @@ -999,18 +1019,8 @@ return new_page; no_cached_page: - /* - * Try to read in an entire cluster at once. - */ - reada = offset; - reada >>= PAGE_SHIFT + page_cluster; - reada <<= PAGE_SHIFT + page_cluster; - - for (i = 1 << page_cluster; i > 0; --i, reada += PAGE_SIZE) - new_page = try_to_read_ahead(file, reada, new_page); - - if (!new_page) - new_page = __get_free_page(GFP_USER); + gfp_cookie = gfp_sleeping_cookie; + new_page = __get_free_page(GFP_USER); if (!new_page) goto no_page; @@ -1020,19 +1030,29 @@ * cache.. The page we just got may be useful if we * can't share, so don't get rid of it here. */ - page = find_page(inode, offset); - if (page) - goto found_page; + if (gfp_cookie == gfp_sleeping_cookie || + !(page = __find_page(inode, offset, *hash))) + { + /* + * Now, create a new page-cache page from the page we got + */ + page = mem_map + MAP_NR(new_page); + new_page = 0; + add_to_page_cache(page, inode, offset, hash); + if (inode->i_op->readpage(file, page) != 0) + goto failure; + } + /* - * Now, create a new page-cache page from the page we got + * Try to read in an entire cluster at once. */ - page = mem_map + MAP_NR(new_page); - new_page = 0; - add_to_page_cache(page, inode, offset, hash); + reada = offset; + reada >>= PAGE_SHIFT + page_cluster; + reada <<= PAGE_SHIFT + page_cluster; - if (inode->i_op->readpage(file, page) != 0) - goto failure; + for (i = 1 << page_cluster; i > 0; --i, reada += PAGE_SIZE) + new_page = try_to_read_ahead(file, reada, new_page); goto found_page; Index: mm/swap_state.c =================================================================== RCS file: /var/cvs/linux/mm/swap_state.c,v retrieving revision 1.1.1.1 diff -u -r1.1.1.1 swap_state.c --- swap_state.c 1999/01/18 01:27:02 1.1.1.1 +++ linux/mm/swap_state.c 1999/04/07 01:11:43 @@ -285,6 +285,7 @@ { struct page *found_page = 0, *new_page; unsigned long new_page_addr; + int gfp_cookie; #ifdef DEBUG_SWAP printk("DebugVM: read_swap_cache_async entry %08lx%s\n", @@ -302,6 +303,7 @@ if (found_page) goto out_free_swap; + gfp_cookie = gfp_sleeping_cookie; new_page_addr = __get_free_page(GFP_USER); if (!new_page_addr) goto out_free_swap; /* Out of memory */ @@ -310,8 +312,8 @@ /* * Check the swap cache again, in case we stalled above. */ - found_page = lookup_swap_cache(entry); - if (found_page) + if (gfp_cookie != gfp_sleeping_cookie && + (found_page = lookup_swap_cache(entry))) goto out_free_page; /* * Add it to the swap cache and read its contents. Index: mm/vmscan.c =================================================================== RCS file: /var/cvs/linux/mm/vmscan.c,v retrieving revision 1.1.1.5 diff -u -r1.1.1.5 vmscan.c --- vmscan.c 1999/02/06 13:22:09 1.1.1.5 +++ linux/mm/vmscan.c 1999/04/07 01:11:43 @@ -20,6 +20,8 @@ #include <asm/pgtable.h> +int gfp_sleeping_cookie; + /* * The swap-out functions return 1 if they successfully * threw something out, and we got a free page. It returns @@ -106,6 +108,8 @@ */ if (!(gfp_mask & __GFP_IO)) return 0; + + gfp_sleeping_cookie++; /* * Ok, it's really dirty. That means that I would ask Chuck to bench this my new code (it should be an obvious improvement). Andrea Arcangeli -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] only-one-cache-query [was Re: [patch] arca-vm-2.2.5] 1999-04-07 11:28 ` [patch] only-one-cache-query [was Re: [patch] arca-vm-2.2.5] Andrea Arcangeli @ 1999-04-07 13:06 ` Stephen C. Tweedie 1999-04-07 13:49 ` Andrea Arcangeli 1999-04-07 13:42 ` Andrea Arcangeli 1999-04-07 13:47 ` Ingo Molnar 2 siblings, 1 reply; 51+ messages in thread From: Stephen C. Tweedie @ 1999-04-07 13:06 UTC (permalink / raw) To: Andrea Arcangeli Cc: Mark Hemment, Chuck Lever, linux-kernel, linux-mm, Linus Torvalds, Stephen C. Tweedie Hi, On Wed, 7 Apr 1999 13:28:33 +0200 (CEST), Andrea Arcangeli <andrea@e-mind.com> said: > Last night I had a new idea on how to cleanly and trivially avoid the two > cache query. We only need to know if we slept or not in GFP. No, that's not a viable solution in the long term. We have been steadily fine-graining the locking in the kernel by replacing the global kernel lock with per-facility or per-resource locks, and in the future we can expect to see the page cache also running without the global lock. At that point, we will need to drop any spinlocks we hold before calling get_free_page(), because the scheduler will only drop the global lock automatically if we sleep and we can't sleep with any other locks held. Now, even if we _don't_ sleep, another CPU can get in to mess with the page cache while we are doing allocation stuff. > I am running the code while writing this and I'll release very soon a > 2.2.5_arca8.bz2 with this new improvement included. Fine, but it's a short-term fix. The hash-cookie mechanism has the advantage of being suitable for a fine-grain-locked page cache in the future. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] only-one-cache-query [was Re: [patch] arca-vm-2.2.5] 1999-04-07 13:06 ` Stephen C. Tweedie @ 1999-04-07 13:49 ` Andrea Arcangeli 0 siblings, 0 replies; 51+ messages in thread From: Andrea Arcangeli @ 1999-04-07 13:49 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Mark Hemment, Chuck Lever, linux-kernel, linux-mm, Linus Torvalds On Wed, 7 Apr 1999, Stephen C. Tweedie wrote: >At that point, we will need to drop any spinlocks we hold before calling >get_free_page(), because the scheduler will only drop the global lock >automatically if we sleep and we can't sleep with any other locks held. >Now, even if we _don't_ sleep, another CPU can get in to mess with the >page cache while we are doing allocation stuff. Yes, agreed. Andrea Arcangeli -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] only-one-cache-query [was Re: [patch] arca-vm-2.2.5] 1999-04-07 11:28 ` [patch] only-one-cache-query [was Re: [patch] arca-vm-2.2.5] Andrea Arcangeli 1999-04-07 13:06 ` Stephen C. Tweedie @ 1999-04-07 13:42 ` Andrea Arcangeli 1999-04-07 13:47 ` Ingo Molnar 2 siblings, 0 replies; 51+ messages in thread From: Andrea Arcangeli @ 1999-04-07 13:42 UTC (permalink / raw) To: Mark Hemment, Chuck Lever Cc: linux-kernel, linux-mm, Linus Torvalds, Stephen C. Tweedie On Wed, 7 Apr 1999, Andrea Arcangeli wrote: >I would ask Chuck to bench this my new code (it should be an obvious >improvement). Don't try it!! I have a bug in the code that is corrupting the cache (noticed now). I am working on a fix now. I just noticed that kmem_cache_reap can theorically sleep in down() so here I moved the cookie++ in do_try_to_free_pages(), it will be less finegrined but safer... But I still have corruption so I am looking at filemap.c now... BTW, why kmem_cache_reap() need a serializing semaphore? Andrea Arcangeli -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] only-one-cache-query [was Re: [patch] arca-vm-2.2.5] 1999-04-07 11:28 ` [patch] only-one-cache-query [was Re: [patch] arca-vm-2.2.5] Andrea Arcangeli 1999-04-07 13:06 ` Stephen C. Tweedie 1999-04-07 13:42 ` Andrea Arcangeli @ 1999-04-07 13:47 ` Ingo Molnar 1999-04-07 14:08 ` Andrea Arcangeli 2 siblings, 1 reply; 51+ messages in thread From: Ingo Molnar @ 1999-04-07 13:47 UTC (permalink / raw) To: Andrea Arcangeli Cc: Mark Hemment, Chuck Lever, linux-kernel, linux-mm, Linus Torvalds, Stephen C. Tweedie On Wed, 7 Apr 1999, Andrea Arcangeli wrote: > void prune_dcache(int count) > { > + gfp_sleeping_cookie++; this can be done via an existing variable, kstat.ctxsw, no need to add yet another 'have we scheduled' flag. But the whole approach is quite flawed and volatile, it simply relies on us having the big kernel lock. With more finegrained SMP locking in that area we will have big problems preserving that solution. -- mingo -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] only-one-cache-query [was Re: [patch] arca-vm-2.2.5] 1999-04-07 13:47 ` Ingo Molnar @ 1999-04-07 14:08 ` Andrea Arcangeli 0 siblings, 0 replies; 51+ messages in thread From: Andrea Arcangeli @ 1999-04-07 14:08 UTC (permalink / raw) To: Ingo Molnar Cc: Mark Hemment, Chuck Lever, linux-kernel, linux-mm, Linus Torvalds, Stephen C. Tweedie On Wed, 7 Apr 1999, Ingo Molnar wrote: >this can be done via an existing variable, kstat.ctxsw, no need to add yet >another 'have we scheduled' flag. But the whole approach is quite flawed I just thought about ctxsw but kstat.context_swtch could be not increased (if next == prev), and if we release the lock we must return finding the page. Andrea Arcangeli -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-05 13:23 ` Mark Hemment 1999-04-05 15:56 ` Andrea Arcangeli @ 1999-04-05 20:24 ` Horst von Brand 1999-04-05 23:25 ` Andrea Arcangeli 1999-04-17 11:12 ` Andrea Arcangeli 2 siblings, 1 reply; 51+ messages in thread From: Horst von Brand @ 1999-04-05 20:24 UTC (permalink / raw) To: Mark Hemment; +Cc: Andrea Arcangeli, linux-kernel, linux-mm Mark Hemment <markhe@sco.COM> said: [...] > One worthwhile improvement to the page-hash is to reduce the need to > re-check the hash after a blocking operation (ie. page allocation). > Changing the page-hash from an array of page ptrs to an array of; > typedef struct page_hash_s { > struct page *ph_page; > unsigned int ph_cookie; > } page_hash_t; > > page_hash_t page_hash[PAGE_HASH_TABLE]; > > Whenever a new page is linked into a hash line, the ph_cookie for that > line is incremented. If you link new pages in at the start (would make sense, IMHO, since they will probably be used soon) you can just use the pointer as cookie. -- Dr. Horst H. von Brand mailto:vonbrand@inf.utfsm.cl Departamento de Informatica Fono: +56 32 654431 Universidad Tecnica Federico Santa Maria +56 32 654239 Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513 -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-05 20:24 ` [patch] arca-vm-2.2.5 Horst von Brand @ 1999-04-05 23:25 ` Andrea Arcangeli 1999-04-05 23:37 ` Horst von Brand 0 siblings, 1 reply; 51+ messages in thread From: Andrea Arcangeli @ 1999-04-05 23:25 UTC (permalink / raw) To: Horst von Brand; +Cc: Mark Hemment, linux-kernel, linux-mm On Mon, 5 Apr 1999, Horst von Brand wrote: >If you link new pages in at the start (would make sense, IMHO, since they >will probably be used soon) you can just use the pointer as cookie. You can have two points of the kernel that are sleeping waiting to alloc memory for a cache page at the same time. Andrea Arcangeli -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-05 23:25 ` Andrea Arcangeli @ 1999-04-05 23:37 ` Horst von Brand 1999-04-06 1:23 ` Andrea Arcangeli 0 siblings, 1 reply; 51+ messages in thread From: Horst von Brand @ 1999-04-05 23:37 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Horst von Brand, Mark Hemment, linux-kernel, linux-mm Andrea Arcangeli <andrea@e-mind.com> said: > On Mon, 5 Apr 1999, Horst von Brand wrote: > >If you link new pages in at the start (would make sense, IMHO, since they > >will probably be used soon) you can just use the pointer as cookie. > You can have two points of the kernel that are sleeping waiting to alloc > memory for a cache page at the same time. So what? One wakes up, finds the same pointer it stashed away ==> Installs new page (changing pointer) via short way. Second wakes up, finds pointer changed ==> goes long way to do its job. Or am I overlooking something stupid? -- Dr. Horst H. von Brand mailto:vonbrand@inf.utfsm.cl Departamento de Informatica Fono: +56 32 654431 Universidad Tecnica Federico Santa Maria +56 32 654239 Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513 -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-05 23:37 ` Horst von Brand @ 1999-04-06 1:23 ` Andrea Arcangeli 0 siblings, 0 replies; 51+ messages in thread From: Andrea Arcangeli @ 1999-04-06 1:23 UTC (permalink / raw) To: Horst von Brand; +Cc: Mark Hemment, linux-kernel, linux-mm On Mon, 5 Apr 1999, Horst von Brand wrote: >So what? One wakes up, finds the same pointer it stashed away ==> Installs >new page (changing pointer) via short way. Second wakes up, finds pointer >changed ==> goes long way to do its job. > >Or am I overlooking something stupid? What if the page that was at the start of the chain gets removed, the page that we are allocing gets inserted and then the same page that gets released before will be inserted again? page0 -> page1 remove page 0 page1 anoher piece of code need ourpage and go to alloc it -> insert ourpage ourpage -> page1 insert page0 again page0 -> ourpage -> page1 Now we have alloced memory succesfully for ourpage and we are going to insert it. page0 is still here and if we don't take the slow way we'll add it twice: ourpage -> page0 -> ourpage -> page1 Andrea Arcangeli -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-05 13:23 ` Mark Hemment 1999-04-05 15:56 ` Andrea Arcangeli 1999-04-05 20:24 ` [patch] arca-vm-2.2.5 Horst von Brand @ 1999-04-17 11:12 ` Andrea Arcangeli 2 siblings, 0 replies; 51+ messages in thread From: Andrea Arcangeli @ 1999-04-17 11:12 UTC (permalink / raw) To: Mark Hemment Cc: linux-kernel, linux-mm, David S. Miller, Stephen C. Tweedie, Chuck Lever On Mon, 5 Apr 1999, Mark Hemment wrote: > The page structure needs to be as small as possible. If its size >happens to L1 align, then that is great, but otherwise it isn't worth the >effort - the extra memory used to store the "padding" is much better used >else where. > Most accesses to the page struct are reads, this means it can live in >the Shared state across mutilple L1 caches. The "slightly" common >operation of incremented the ref-count/changing-flag-bits doesn't really >come into play often enough to matter. > Keeping the struct small can result in part of the page struct of >interest in the L1 cache, along with part of the next one. As it isn't a >heavily modified structure, with no spin locks, "false sharing" isn't a >problem. Besides, the VM isn't threaded, so it isn't going to be playing >ping-pong with the cache lines anyway. I think the same thing applys to many places where we are using the slab and we are right now requesting L1 cache alignment but the code is not going to be SMP threaded for real (no spinlocks in the struct or/and everything protected by the big kernel lock). Andrea Arcangeli -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-05 0:22 ` Andrea Arcangeli 1999-04-05 13:23 ` Mark Hemment @ 1999-04-05 21:31 ` Chuck Lever 1999-04-06 0:15 ` Andrea Arcangeli 1999-04-06 14:02 ` Stephen C. Tweedie 1999-04-06 14:00 ` Stephen C. Tweedie 2 siblings, 2 replies; 51+ messages in thread From: Chuck Lever @ 1999-04-05 21:31 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-kernel, linux-mm On Mon, 5 Apr 1999, Andrea Arcangeli wrote: > >first, i notice you've altered the page hash function and quadrupled the > > The page hash function change is from Stephen (I did it here too because I > completly agreed with it). The point is that shm entries uses the lower > bits of the pagemap->offset field. hmmm. wouldn't you think that hashing with the low order bits in the offset would cause two different offsets against the same page to result in the hash function generating different output? the performance difference isn't too bad with Stephen's hash function -- i tried it out last night, and it's on the order of 1% slower on a 233Mhz K6 for general lookups. i didn't try shm (not sure how i would :) > >kernel compares to one that has just the page hash changes without the > >rest of your VM modifications? the reason i ask is because i've played > > The reason of that is that it's an obvious improvement. And since it's > statically allocated (not dynamically allocated at boot in function of the > memory size) a bit larger default can be desiderable, I can safely alloc > some more bit of memory (some decade of kbyte) without harming the > avalilable mm here. Well, as I just said many times I think someday we'll > need RB-trees instead of fuzzy hash but it's not a big issue right now > due the so low number of pages available. > > Returning to your question in my tree I enlarged the hashtable to 13 bit. > This mean that in the best case I'll be able to address in O(1) up to 8192 > pages. Here I have 32752 pages so as worse I'll have 4 pages chained on > every hash entry. 13 bits of hash-depth will alloc for the hash 32k of > memory (really not an issue ;). > > In the stock kernel instead the hash size is 2^11 = 2048 so in the worst > case I would have 16 pages chained in the same hash entry. unfortunately it doesn't work that way. i've measured the buffer hash function in real-world tests, and found that the worst case is significantly bad (try this: most buckets are unused, while several buckets out of 32K buckets have hundreds of buffers). i have a new hash function that works very well, and even helps inter-run variance and perceived interactive response. i'll post more on this soon. but also the page hash function uses the hash table size as a shift value when computing the index, so it may combine the interesting bits in a different (worse) way when you change the hash table size. i'm planning to instrument the page hash to see exactly what's going on. > >that the buffer hash table is orders of magnitude larger, yet hashes about > >the same number of objects. can someone provide history on the design of > >the page hash function? > > I can't help you into this, but looks Ok to me ;). If somebody did the > math on it I'd like to try understanding it. IMHO, i'd think if the buffer cache has a large hash table, you'd want the page cache to have a table as large or larger. both track roughly the same number of objects... and the page cache is probably used more often than the buffer cache. i can experiment with this; i've already been looking at this behavior for a couple of weeks. a good reason to try this is to get immediate scalability gains in terms of the number of objects these hash functions can handle before lookup time becomes unacceptible. > >also, can you tell what improvement you expect from the additional logic > >in try_to_free_buffers() ? > > Eh, my shrink_mmap() is is a black magic and it's long to explain what I > thought ;). Well one of the reasons is that ext2 take used the superblock > all the time and so when I reach an used buffers I'll put back at the top > of the lru list since I don't want to go in swap because there are some > unfreeable superblock that live forever at the end of the pagemap > lru_list. i agree that the whole point of the modification is to help the system choose the best page to get rid of -- swapping a superblock is probably a bad idea. :) someone mentioned putting these on a separate LRU list, or something like that. maybe you should try that instead? i suspect it might be cleaner logic? > Note also (you didn't asked about that but I bet you noticed that ;) that > in my tree I also made every pagemap entry L1 cacheline aliged. I asked to > people that was complainig about page colouring (and I still don't know > what is exactly page colouring , I only have a guess but I would like to > read something about implementation details, pointers???) to try out my > patch to see if it made differences; but I had no feedback :(. I also > made the irq_state entry cacheline aligned (when I understood the > cacheline issue I agreed with it). if i can, i'd like to separate out the individual modifications and try them each compared to a stock kernel. that usually shows exactly which changes are useful. if you want to learn more about page coloring, here are some excellent references: Hennessy & Patterson, "Computer Architecture: A Quantitative Approach," 2nd edition, Morgan Kaufman Publishers, 1998. look in the chapter on CPU caches (i'd cite the page number here, but my copy is at home). Lynch, William, "The Interaction of Virtual Memory and Cache Memory," Technical Report CSL-TR-93-587, Stanford University Department of Electrical Engineering and Computer Science, October 1993. - Chuck Lever -- corporate: <chuckl@netscape.com> personal: <chucklever@netscape.net> or <cel@monkey.org> The Linux Scalability project: http://www.citi.umich.edu/projects/citi-netscape/ -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-05 21:31 ` Chuck Lever @ 1999-04-06 0:15 ` Andrea Arcangeli 1999-04-06 2:14 ` Doug Ledford 1999-04-06 5:52 ` Chuck Lever 1999-04-06 14:02 ` Stephen C. Tweedie 1 sibling, 2 replies; 51+ messages in thread From: Andrea Arcangeli @ 1999-04-06 0:15 UTC (permalink / raw) To: Chuck Lever; +Cc: linux-kernel, linux-mm On Mon, 5 Apr 1999, Chuck Lever wrote: >buckets out of 32K buckets have hundreds of buffers). i have a new hash >function that works very well, and even helps inter-run variance and >perceived interactive response. i'll post more on this soon. Cool! ;)) But could you tell me _how_ do you design an hash function? Are you doing math or do you use instinct? I never gone into the details of benchmarking or designing an hash function simply because I don't like fuzzy hash and instead I want to replace them with RB-trees all over the place (and I know the math about RB-trees), but I like to learn how to design an hash function anyway (even if I don't want to discover that myself without reading docs as usual ;). >but also the page hash function uses the hash table size as a shift value >when computing the index, so it may combine the interesting bits in a >different (worse) way when you change the hash table size. i'm planning >to instrument the page hash to see exactly what's going on. Agreed. This is true. I thought about that and I am resizing the hash table size to the original 11 bit now (since you are confirming that I broken the hash function). >IMHO, i'd think if the buffer cache has a large hash table, you'd want the >page cache to have a table as large or larger. both track roughly the >same number of objects... and the page cache is probably used more often Agreed. >than the buffer cache. i can experiment with this; i've already been >looking at this behavior for a couple of weeks. Sure!! page cache is far more used than the buffer cache!! Also think that most of the time with the page cache we have to do a _double_ query in order to insert a new entry in the page cache. >a good reason to try this is to get immediate scalability gains in terms >of the number of objects these hash functions can handle before lookup >time becomes unacceptible. I am not going to spend time on fuzzy hash because my instinct tells me to drop them. >i agree that the whole point of the modification is to help the system >choose the best page to get rid of -- swapping a superblock is probably a >bad idea. :) The point is not to swapout it (it's not possible to swapout a buffer), but the point is that if the unfreeable buffer will stay at the end of the lru list, it will increase the swapping pressure because when we'll probe the lru list we'll notice a fixed amount of unfreeable pages (and a fixed unfreeable amount of lru_cache will mean that we have to go in swap_out() in order to free them or to swapout something else. As just said my new shrink_mmap() is been completly redesigned and reimplemented by me and has nothing to do with the old one. It made perfect sense to me but I think it's not so obvious as the old shrink_mmap()... >someone mentioned putting these on a separate LRU list, or something like >that. maybe you should try that instead? i suspect it might be cleaner >logic? Theorically yes, but it won't be a cleaner coding according to me. Always putting them at the top of the lru list when we catch them browsing the end of lru list (as I am doing now) will remove the issue without inserting special code. And the superblock thing is not the only reason I am handling the special "buffer-in_use" case in my shrink_mmap(). >if i can, i'd like to separate out the individual modifications and try >them each compared to a stock kernel. that usually shows exactly which >changes are useful. Fine, I can't do that myself simply because I don't have the time , but if you want to do the finegrined testing I'll be glad to send you my code separated and backported to 2.2.5. ;)) Thanks!! The first thing I am interested to bench is the pagemap struct cacheline aligned. If it won't improve performances I want to drop it because it's causing a relevant waste of memory. here is the pagemap L1-aligned patch against 2.2.5 (I hope to have adapted it correctly to 2.2.5 even if it's a bit late and I drunk some beer before came back to home ;): --- linux/include/linux/mm.h Tue Mar 9 01:55:28 1999 +++ mm.h Tue Apr 6 02:00:22 1999 @@ -131,0 +133,6 @@ +#ifdef __SMP__ + /* cacheline alignment */ + char dummy[(sizeof(void *) * 7 + + sizeof(unsigned long) * 2 + + sizeof(atomic_t)) % L1_CACHE_BYTES]; +#endif Index: linux/mm/page_alloc.c diff -u linux/mm/page_alloc.c:1.1.1.3 linux/mm/page_alloc.c:1.1.2.28 --- linux/mm/page_alloc.c:1.1.1.3 Tue Jan 26 19:32:27 1999 +++ linux/mm/page_alloc.c Fri Apr 2 01:12:37 1999 @@ -315,7 +318,7 @@ freepages.min = i; freepages.low = i * 2; freepages.high = i * 3; - mem_map = (mem_map_t *) LONG_ALIGN(start_mem); + mem_map = (mem_map_t *) L1_CACHE_ALIGN(start_mem); p = mem_map + MAP_NR(end_mem); start_mem = LONG_ALIGN((unsigned long) p); memset(mem_map, 0, start_mem - (unsigned long) mem_map); And here instead I extracted from my CVS tree the other thing I am not sure about and that you was asking for (this will cause less waste of memory but I am very courious to know how it will make differences on a 4-way SMP): Index: arch/i386/kernel/irq.c =================================================================== RCS file: /var/cvs/linux/arch/i386/kernel/irq.c,v retrieving revision 1.1.1.3 retrieving revision 1.1.2.11 diff -u -r1.1.1.3 -r1.1.2.11 --- irq.c 1999/02/20 15:38:00 1.1.1.3 +++ linux/arch/i386/kernel/irq.c 1999/04/04 01:22:53 1.1.2.11 @@ -139,7 +139,7 @@ /* * Controller mappings for all interrupt sources: */ -irq_desc_t irq_desc[NR_IRQS] = { [0 ... NR_IRQS-1] = { 0, &no_irq_type, }}; +irq_desc_t irq_desc[NR_IRQS] __cacheline_aligned = { [0 ... NR_IRQS-1] = { 0, &no_irq_type, }}; :wq /* Index: arch/i386/kernel/irq.h =================================================================== RCS file: /var/cvs/linux/arch/i386/kernel/irq.h,v retrieving revision 1.1.1.3 diff -u -r1.1.1.3 irq.h --- irq.h 1999/02/20 15:38:01 1.1.1.3 +++ linux/arch/i386/kernel/irq.h 1999/04/01 22:53:07 @@ -39,6 +39,9 @@ struct hw_interrupt_type *handler; /* handle/enable/disable functions */ struct irqaction *action; /* IRQ action list */ unsigned int depth; /* Disable depth for nested irq disables */ +#ifdef __SMP__ + unsigned int unused[4]; +#endif } irq_desc_t; /* >if you want to learn more about page coloring, here are some excellent >references: > >Hennessy & Patterson, "Computer Architecture: A Quantitative Approach," >2nd edition, Morgan Kaufman Publishers, 1998. look in the chapter on CPU >caches (i'd cite the page number here, but my copy is at home). > >Lynch, William, "The Interaction of Virtual Memory and Cache Memory," >Technical Report CSL-TR-93-587, Stanford University Department of >Electrical Engineering and Computer Science, October 1993. Thanks!! but sigh, I don't have too much money to buy them right now... But I'll save them and I'll buy them ASAP. In the meantime I'll be in the hope of GPL'd docs... Andrea Arcangeli -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-06 0:15 ` Andrea Arcangeli @ 1999-04-06 2:14 ` Doug Ledford 1999-04-06 13:04 ` Andrea Arcangeli 1999-04-06 5:52 ` Chuck Lever 1 sibling, 1 reply; 51+ messages in thread From: Doug Ledford @ 1999-04-06 2:14 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Chuck Lever, linux-kernel, linux-mm [-- Attachment #1: Type: text/plain, Size: 2895 bytes --] Andrea Arcangeli wrote: > > On Mon, 5 Apr 1999, Chuck Lever wrote: > > >buckets out of 32K buckets have hundreds of buffers). i have a new hash > >function that works very well, and even helps inter-run variance and > >perceived interactive response. i'll post more on this soon. > > Cool! ;)) But could you tell me _how_ do you design an hash function? Are > you doing math or do you use instinct? I never gone into the details of > benchmarking or designing an hash function simply because I don't like > fuzzy hash and instead I want to replace them with RB-trees all over the > place (and I know the math about RB-trees), but I like to learn how to > design an hash function anyway (even if I don't want to discover that > myself without reading docs as usual ;). > > >but also the page hash function uses the hash table size as a shift value > >when computing the index, so it may combine the interesting bits in a > >different (worse) way when you change the hash table size. i'm planning > >to instrument the page hash to see exactly what's going on. > > Agreed. This is true. I thought about that and I am resizing the hash > table size to the original 11 bit now (since you are confirming that I > broken the hash function). > > >IMHO, i'd think if the buffer cache has a large hash table, you'd want the > >page cache to have a table as large or larger. both track roughly the > >same number of objects... and the page cache is probably used more often > > Agreed. Hmmm...I've talked about this a few times to Alan Cox and Stephen Tweedie. I didn't bother to instrument the hash function because in this case I knew it was tuned to the size of the inode structs. But, I did implement a variable sized page cache hash table array. I did this because at 512MB of RAM, having only 2048 hash buckets (11 bits) became a real bottleneck in page cache hash entry lookups. Essentially, when the number of physical pages per hash buckets gets too large, then hash chains can grow excessively long and take entirely too long to look up when the page is not in the page cache (or when it is but it's at the end of a long chain). Since these lookups occur as frequently as they do, they get to be excessive quickly. I'm attaching the patch I've been using here for a while. Feel free to try it out. One of the best tests for the effect of this patch is a bonnie run on a machine with more disk speed than CPU power. In that case the page cache lookup becomes a bottleneck. In other conditions, you see the same disk speed results but with smaller % CPU utilization. On a PII400 dual machine with 512MB RAM and a 4 Cheetah RAID0 array it made a 10MByte/s difference in throughput. On another machine it made a roughly 6 to 8% difference in CPU utilization. -- Doug Ledford <dledford@redhat.com> Opinions expressed are my own, but they should be everybody's. [-- Attachment #2: page-2.2.2.patch --] [-- Type: application/octet-stream, Size: 3509 bytes --] --- linux/include/linux/pagemap.h.page Thu Jan 28 17:43:17 1999 +++ linux/include/linux/pagemap.h Mon Mar 8 07:37:54 1999 @@ -17,13 +17,13 @@ return PAGE_OFFSET + PAGE_SIZE * (page - mem_map); } -#define PAGE_HASH_BITS 11 -#define PAGE_HASH_SIZE (1 << PAGE_HASH_BITS) - #define PAGE_AGE_VALUE 16 - extern unsigned long page_cache_size; /* # of pages currently in the hash table */ -extern struct page * page_hash_table[PAGE_HASH_SIZE]; +extern struct page ** page_hash_table; +extern unsigned long page_hash_size; +extern unsigned long page_hash_mask; +extern unsigned long page_hash_bits; +extern long page_cache_init(long, long); /* * We use a power-of-two hash table to avoid a modulus, @@ -35,8 +35,8 @@ { #define i (((unsigned long) inode)/(sizeof(struct inode) & ~ (sizeof(struct inode) - 1))) #define o (offset >> PAGE_SHIFT) -#define s(x) ((x)+((x)>>PAGE_HASH_BITS)) - return s(i+o) & (PAGE_HASH_SIZE-1); +#define s(x) ((x)+((x)>>page_hash_bits)) + return s(i+o) & page_hash_mask; #undef i #undef o #undef s --- linux/init/main.c.page Mon Feb 22 22:44:53 1999 +++ linux/init/main.c Mon Mar 8 07:37:54 1999 @@ -19,6 +19,7 @@ #include <linux/utsname.h> #include <linux/ioport.h> #include <linux/init.h> +#include <linux/pagemap.h> #include <linux/smp_lock.h> #include <linux/blk.h> #include <linux/hdreg.h> @@ -1149,6 +1150,7 @@ initrd_start = 0; } #endif + memory_start = page_cache_init(memory_start, memory_end); mem_init(memory_start,memory_end); kmem_cache_sizes_init(); #ifdef CONFIG_PROC_FS --- linux/mm/filemap.c.page Mon Feb 22 22:44:53 1999 +++ linux/mm/filemap.c Mon Mar 8 07:37:54 1999 @@ -13,6 +13,7 @@ #include <linux/shm.h> #include <linux/mman.h> #include <linux/locks.h> +#include <linux/init.h> #include <linux/pagemap.h> #include <linux/swap.h> #include <linux/smp_lock.h> @@ -23,6 +24,7 @@ #include <asm/pgtable.h> #include <asm/uaccess.h> +#include <asm/io.h> /* * Shared mappings implemented 30.11.1994. It's not fully working yet, @@ -32,7 +34,45 @@ */ unsigned long page_cache_size = 0; -struct page * page_hash_table[PAGE_HASH_SIZE]; +unsigned long page_hash_size = 0; +unsigned long page_hash_bits = 0; +unsigned long page_hash_mask = 0; +struct page ** page_hash_table = NULL; + +/* + * The init function that starts off our page cache + */ +long __init page_cache_init( long mem_begin, long mem_end ) +{ + long cache_begin; + long i; + + for(i=0; i < sizeof(long) * 8; i++) + if(virt_to_phys((void *)mem_end) & (1<<i)) + page_hash_bits = i; + page_hash_bits -= (PAGE_SHIFT + 2); + if(page_hash_bits < 10) + page_hash_bits = 10; + cache_begin = (mem_begin + (PAGE_SIZE - 1)) & ~PAGE_SIZE; + page_hash_size = 1 << page_hash_bits; + page_hash_mask = page_hash_size - 1; + if ( ((virt_to_phys((void *)cache_begin) <= 0xfffff) && + (virt_to_phys((void *)cache_begin) >= (0x9f000 - page_hash_size))) ) + { + /* + * Our structure would fall into the middle of the upper + * BIOS area of the computer, go above it to the 1MB mark + * and start from there + */ + cache_begin = (long)phys_to_virt(0x100000); + } + page_hash_table = (struct page **)cache_begin; + memset(page_hash_table, 0, page_hash_size * sizeof(void)); + printk(KERN_INFO "page_cache_init: allocated %ldk of RAM with %ld hash " + "buckets.\n", (page_hash_size * sizeof(void *)) / 1024, + page_hash_size); + return(cache_begin + (page_hash_size * sizeof(void *))); +} /* * Simple routines for both non-shared and shared mappings. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-06 2:14 ` Doug Ledford @ 1999-04-06 13:04 ` Andrea Arcangeli 1999-04-06 21:31 ` Stephen C. Tweedie 0 siblings, 1 reply; 51+ messages in thread From: Andrea Arcangeli @ 1999-04-06 13:04 UTC (permalink / raw) To: Doug Ledford; +Cc: Chuck Lever, linux-kernel, linux-mm On Mon, 5 Apr 1999, Doug Ledford wrote: >Hmmm...I've talked about this a few times to Alan Cox and Stephen >Tweedie. I didn't bother to instrument the hash function because in >this case I knew it was tuned to the size of the inode structs. But, I >did implement a variable sized page cache hash table array. I did this Well it's strightforward. I just did the same some time ago for the buffer hash table. But I agree with Chuck that enlarging the hash size could harm the hash function distrubution (I should think about it some more though). I also think that I'll implement the cookie thing suggested by Mark since I am too much courious to see how much it will help (even if my mind is driven by RB-trees ;). Andrea Arcangeli -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-06 13:04 ` Andrea Arcangeli @ 1999-04-06 21:31 ` Stephen C. Tweedie 1999-04-06 22:27 ` Andrea Arcangeli 0 siblings, 1 reply; 51+ messages in thread From: Stephen C. Tweedie @ 1999-04-06 21:31 UTC (permalink / raw) To: Andrea Arcangeli Cc: Doug Ledford, Chuck Lever, linux-kernel, linux-mm, Stephen Tweedie Hi, On Tue, 6 Apr 1999 15:04:36 +0200 (CEST), Andrea Arcangeli <andrea@e-mind.com> said: > I also think that I'll implement the cookie thing suggested by Mark since > I am too much courious to see how much it will help (even if my mind is > driven by RB-trees ;). Trees are bad for this sort of thing in general: they can be as fast as hashing for lookup, but are much slower for insert and delete operations. Insert and delete for the page cache _must_ be fast. Expanding a hash table does not cost enough to offset the performance problems. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-06 21:31 ` Stephen C. Tweedie @ 1999-04-06 22:27 ` Andrea Arcangeli 1999-04-07 12:27 ` Stephen C. Tweedie 0 siblings, 1 reply; 51+ messages in thread From: Andrea Arcangeli @ 1999-04-06 22:27 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Doug Ledford, Chuck Lever, linux-kernel, linux-mm, Linus Torvalds, Mark Hemment On Tue, 6 Apr 1999, Stephen C. Tweedie wrote: >Trees are bad for this sort of thing in general: they can be as fast as >hashing for lookup, but are much slower for insert and delete >operations. Insert and delete for the page cache _must_ be fast. It's not so obvious to me. I sure agree that an O(n) insertion/deletion is far too slow but a O(log(n)) for everything could be rasonable to me. And trees don't worry about unluky hash behavior. And for the record my plan would be to put the top of the tree directly in the inode struct. And then to queue all cached pages that belongs to such inode in the per-inode cache-tree. So there would be no need to always check also the inode field in find_inode() and reading small files would be _drammatically_ fast and immediate even if there are 5giga of cache allocated. I think this behavior will become interesting. Maybe you are 100% right in telling me that RB-trees will lose, but I would like to try it out someday... Andrea Arcangeli -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-06 22:27 ` Andrea Arcangeli @ 1999-04-07 12:27 ` Stephen C. Tweedie 1999-04-25 3:22 ` Chuck Lever 0 siblings, 1 reply; 51+ messages in thread From: Stephen C. Tweedie @ 1999-04-07 12:27 UTC (permalink / raw) To: Andrea Arcangeli Cc: Stephen C. Tweedie, Doug Ledford, Chuck Lever, linux-kernel, linux-mm, Linus Torvalds, Mark Hemment Hi, On Wed, 7 Apr 1999 00:27:21 +0200 (CEST), Andrea Arcangeli <andrea@e-mind.com> said: > It's not so obvious to me. I sure agree that an O(n) insertion/deletion is > far too slow but a O(log(n)) for everything could be rasonable to me. And > trees don't worry about unluky hash behavior. Trees are O(log n) for insert/delete, with a high constant of proportionality (ie. there can be quite a lot of work to be done even for small log n). Trees also occupy more memory per node. Hashes are O(1) for insert and delete, and are a _fast_ O(1). The page cache needs fast insert and delete. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-07 12:27 ` Stephen C. Tweedie @ 1999-04-25 3:22 ` Chuck Lever 0 siblings, 0 replies; 51+ messages in thread From: Chuck Lever @ 1999-04-25 3:22 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Andrea Arcangeli, linux-kernel, linux-mm On Wed, 7 Apr 1999, Stephen C. Tweedie wrote: > On Wed, 7 Apr 1999 00:27:21 +0200 (CEST), Andrea Arcangeli > <andrea@e-mind.com> said: > > > It's not so obvious to me. I sure agree that an O(n) insertion/deletion is > > far too slow but a O(log(n)) for everything could be rasonable to me. And > > trees don't worry about unluky hash behavior. > > Trees are O(log n) for insert/delete, with a high constant of > proportionality (ie. there can be quite a lot of work to be done even > for small log n). Trees also occupy more memory per node. Hashes are > O(1) for insert and delete, and are a _fast_ O(1). The page cache needs > fast insert and delete. i had a few minutes the other day, so i extracted just the rb-tree part of 2.2.5-arca10, and benchmarked it against 2.2.5 and against the hash tuning patch i'm working on. i ran this on our 512M 4-way Dell PowerEdge 6300. ref - 2.2.5 kernel with 4000 process slots and b_state fix rbt - 2.2.5 kernel with 4000 process slots and rbtree patch applied (b_state fix *not* applied) hash - 2.2.5 kernel with an older version of my hash tuning patch applied (b_state fix applied) 160 concurrent scripts. all of the benchmark fits in memory. ref: 3725.8 s=15.23 rbt: 3893.3 s=9.82 hash: 4007.3 s=15.95 "hash" tunes the page cache, the buffer cache, the dentry cache, and the inode cache. "rbt" just replaces the page cache with per-inode rbtrees. i think the rbtree patch compares pretty favorably for very large memory machines. - Chuck Lever -- corporate: <chuckl@netscape.com> personal: <chucklever@netscape.net> or <cel@monkey.org> The Linux Scalability project: http://www.citi.umich.edu/projects/linux-scalability/ -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-06 0:15 ` Andrea Arcangeli 1999-04-06 2:14 ` Doug Ledford @ 1999-04-06 5:52 ` Chuck Lever 1999-04-06 13:09 ` Andrea Arcangeli 1999-04-06 16:19 ` Eric W. Biederman 1 sibling, 2 replies; 51+ messages in thread From: Chuck Lever @ 1999-04-06 5:52 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-kernel, linux-mm On Tue, 6 Apr 1999, Andrea Arcangeli wrote: > Cool! ;)) But could you tell me _how_ do you design an hash function? Are > you doing math or do you use instinct? math. i'll post something about this soon. > >but also the page hash function uses the hash table size as a shift value > >when computing the index, so it may combine the interesting bits in a > >different (worse) way when you change the hash table size. i'm planning > >to instrument the page hash to see exactly what's going on. > > Agreed. This is true. I thought about that and I am resizing the hash > table size to the original 11 bit now (since you are confirming that I > broken the hash function). i looked at doug's patch too, and it changes the "page_shift" value depending on the size of the hash table. again, this *may* cause unwanted interactions making the hash function degenerate for certain table sizes. but i'd like to instrument the hash to watch what really happens. i ran some simple benchmarks on our 4-way Xeon PowerEdge to see what are the effects of your patches. here were the original patches against 2.2.5. the page struct alignment patch: > --- linux/include/linux/mm.h Tue Mar 9 01:55:28 1999 > +++ mm.h Tue Apr 6 02:00:22 1999 > @@ -131,0 +133,6 @@ > +#ifdef __SMP__ > + /* cacheline alignment */ > + char dummy[(sizeof(void *) * 7 + > + sizeof(unsigned long) * 2 + > + sizeof(atomic_t)) % L1_CACHE_BYTES]; > +#endif > Index: linux/mm/page_alloc.c > diff -u linux/mm/page_alloc.c:1.1.1.3 linux/mm/page_alloc.c:1.1.2.28 > --- linux/mm/page_alloc.c:1.1.1.3 Tue Jan 26 19:32:27 1999 > +++ linux/mm/page_alloc.c Fri Apr 2 01:12:37 1999 > @@ -315,7 +318,7 @@ > freepages.min = i; > freepages.low = i * 2; > freepages.high = i * 3; > - mem_map = (mem_map_t *) LONG_ALIGN(start_mem); > + mem_map = (mem_map_t *) L1_CACHE_ALIGN(start_mem); > p = mem_map + MAP_NR(end_mem); > start_mem = LONG_ALIGN((unsigned long) p); > memset(mem_map, 0, start_mem - (unsigned long) mem_map); and the irq alignment patch: > Index: arch/i386/kernel/irq.c > =================================================================== > RCS file: /var/cvs/linux/arch/i386/kernel/irq.c,v > retrieving revision 1.1.1.3 > retrieving revision 1.1.2.11 > diff -u -r1.1.1.3 -r1.1.2.11 > --- irq.c 1999/02/20 15:38:00 1.1.1.3 > +++ linux/arch/i386/kernel/irq.c 1999/04/04 01:22:53 1.1.2.11 > @@ -139,7 +139,7 @@ > /* > * Controller mappings for all interrupt sources: > */ > -irq_desc_t irq_desc[NR_IRQS] = { [0 ... NR_IRQS-1] = { 0, &no_irq_type, }}; > +irq_desc_t irq_desc[NR_IRQS] __cacheline_aligned = { [0 ... NR_IRQS-1] = { 0, &no_irq_type, }}; > > /* > Index: arch/i386/kernel/irq.h > =================================================================== > RCS file: /var/cvs/linux/arch/i386/kernel/irq.h,v > retrieving revision 1.1.1.3 > diff -u -r1.1.1.3 irq.h > --- irq.h 1999/02/20 15:38:01 1.1.1.3 > +++ linux/arch/i386/kernel/irq.h 1999/04/01 22:53:07 > @@ -39,6 +39,9 @@ > struct hw_interrupt_type *handler; /* handle/enable/disable functions */ > struct irqaction *action; /* IRQ action list */ > unsigned int depth; /* Disable depth for nested irq disables */ > +#ifdef __SMP__ > + unsigned int unused[4]; > +#endif > } irq_desc_t; > > /* i ran 128 concurrent scripts on our 512M PowerEdge. each instantiation of the script runs the same programs in the same sequence. the script is designed to emulate a software development workload, so it contains commands like cpio, cc, nroff, and ed. the VM+file working set was contained in less than 150M, so this benchmark was CPU and memory bound. i tested 4 different kernels: ref: a stock 2.2.5 kernel p-al: a stock 2.2.5 kernel with your page struct alignment patch applied irq: a stock 2.2.5 kernel with your irq alignment patch applied both: a stock 2.2.5 kernel with both patches applied all kernels were compiled using egcs-1.1.1 with the same .config and compiler optimizations. the benchmark numbers are average throughput in "scripts per hour" for 4 consecutive runs. this value is computed by measuring the elapsed time for all scripts to complete, then multiplying by the number of concurrent scripts. (s= is standard deviation; it indicates roughly the inter-run variance) ref: 4176.4 (s=27.45) p-al: 4207.9 (s=8.1) irq: 4228.8 (s=11.70) both: 4207.9 (s=13.34) the irq patch is a clear win over the reference kernel: it shows a consistent 1.25% improvement in overall throughput, and the performance difference is more than a standard deviation. also, the variance appears to be less with the irq kernel. i would bet on a more I/O bound load the improvement would be even more stark. i'm not certain why the combination kernel performance was worse than the irq-only kernel. > >Lynch, William, "The Interaction of Virtual Memory and Cache Memory," > >Technical Report CSL-TR-93-587, Stanford University Department of > >Electrical Engineering and Computer Science, October 1993. > > Thanks!! but sigh, I don't have too much money to buy them right now... > But I'll save them and I'll buy them ASAP. In the meantime I'll be in the > hope of GPL'd docs... "Lynch" is a PhD thesis available in postscript at Stanford's web site for free. it's a study of different coloring methodologies, so it's fairly broad. - Chuck Lever -- corporate: <chuckl@netscape.com> personal: <chucklever@netscape.net> or <cel@monkey.org> The Linux Scalability project: http://www.citi.umich.edu/projects/citi-netscape/ -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-06 5:52 ` Chuck Lever @ 1999-04-06 13:09 ` Andrea Arcangeli 1999-04-06 16:19 ` Eric W. Biederman 1 sibling, 0 replies; 51+ messages in thread From: Andrea Arcangeli @ 1999-04-06 13:09 UTC (permalink / raw) To: Chuck Lever; +Cc: linux-kernel, linux-mm On Tue, 6 Apr 1999, Chuck Lever wrote: >math. i'll post something about this soon. Cool! thanks. >i looked at doug's patch too, and it changes the "page_shift" value >depending on the size of the hash table. again, this *may* cause unwanted >interactions making the hash function degenerate for certain table sizes. Agreed. >ref: a stock 2.2.5 kernel > >p-al: a stock 2.2.5 kernel with your page struct alignment patch applied > >irq: a stock 2.2.5 kernel with your irq alignment patch applied > >both: a stock 2.2.5 kernel with both patches applied *snip* >ref: 4176.4 (s=27.45) > >p-al: 4207.9 (s=8.1) ^^^ it made _difference_ > >irq: 4228.8 (s=11.70) > >both: 4207.9 (s=13.34) ^^^^^ strange... >the irq patch is a clear win over the reference kernel: it shows a Good ;) >consistent 1.25% improvement in overall throughput, and the performance >difference is more than a standard deviation. also, the variance appears >to be less with the irq kernel. i would bet on a more I/O bound load the >improvement would be even more stark. > >i'm not certain why the combination kernel performance was worse than the >irq-only kernel. Hmm I'll think about that... >"Lynch" is a PhD thesis available in postscript at Stanford's web site for >free. it's a study of different coloring methodologies, so it's fairly >broad. Thanks!! I'll search for it soon. Andrea Arcangeli -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-06 5:52 ` Chuck Lever 1999-04-06 13:09 ` Andrea Arcangeli @ 1999-04-06 16:19 ` Eric W. Biederman 1999-04-06 20:26 ` Andrea Arcangeli 1 sibling, 1 reply; 51+ messages in thread From: Eric W. Biederman @ 1999-04-06 16:19 UTC (permalink / raw) To: Chuck Lever; +Cc: Andrea Arcangeli, linux-kernel, linux-mm >>>>> "CL" == Chuck Lever <cel@monkey.org> writes: CL> On Tue, 6 Apr 1999, Andrea Arcangeli wrote: >> Cool! ;)) But could you tell me _how_ do you design an hash function? Are >> you doing math or do you use instinct? CL> math. i'll post something about this soon. >> >but also the page hash function uses the hash table size as a shift value >> >when computing the index, so it may combine the interesting bits in a >> >different (worse) way when you change the hash table size. i'm planning >> >to instrument the page hash to see exactly what's going on. >> >> Agreed. This is true. I thought about that and I am resizing the hash >> table size to the original 11 bit now (since you are confirming that I >> broken the hash function). CL> i looked at doug's patch too, and it changes the "page_shift" value CL> depending on the size of the hash table. again, this *may* cause unwanted CL> interactions making the hash function degenerate for certain table sizes. CL> but i'd like to instrument the hash to watch what really happens. CL> i ran some simple benchmarks on our 4-way Xeon PowerEdge to see what are CL> the effects of your patches. here were the original patches against CL> 2.2.5. CL> the page struct alignment patch: >> --- linux/include/linux/mm.h Tue Mar 9 01:55:28 1999 >> +++ mm.h Tue Apr 6 02:00:22 1999 >> @@ -131,0 +133,6 @@ >> +#ifdef __SMP__ >> + /* cacheline alignment */ >> + char dummy[(sizeof(void *) * 7 + >> + sizeof(unsigned long) * 2 + >> + sizeof(atomic_t)) % L1_CACHE_BYTES]; >> +#endif Am I the only one to notice that this little bit of code is totally wrong. It happens to get it right for cache sizes of 16 & 32 with the current struct page but the code is 100% backwords. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-06 16:19 ` Eric W. Biederman @ 1999-04-06 20:26 ` Andrea Arcangeli 1999-04-07 5:00 ` Eric W. Biederman 0 siblings, 1 reply; 51+ messages in thread From: Andrea Arcangeli @ 1999-04-06 20:26 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Chuck Lever, linux-kernel, linux-mm On 6 Apr 1999, Eric W. Biederman wrote: >>> --- linux/include/linux/mm.h Tue Mar 9 01:55:28 1999 >>> +++ mm.h Tue Apr 6 02:00:22 1999 >>> @@ -131,0 +133,6 @@ >>> +#ifdef __SMP__ >>> + /* cacheline alignment */ >>> + char dummy[(sizeof(void *) * 7 + >>> + sizeof(unsigned long) * 2 + >>> + sizeof(atomic_t)) % L1_CACHE_BYTES]; >>> +#endif > >Am I the only one to notice that this little bit of code is totally wrong. >It happens to get it right for cache sizes of 16 & 32 with the current struct >page but the code is 100% backwords. Does something like this looks like better? Index: mm/page_alloc.c =================================================================== RCS file: /var/cvs/linux/mm/page_alloc.c,v retrieving revision 1.1.1.3 diff -u -r1.1.1.3 page_alloc.c --- page_alloc.c 1999/01/26 18:32:27 1.1.1.3 +++ linux/mm/page_alloc.c 1999/04/06 20:04:59 @@ -315,7 +315,7 @@ freepages.min = i; freepages.low = i * 2; freepages.high = i * 3; - mem_map = (mem_map_t *) LONG_ALIGN(start_mem); + mem_map = (mem_map_t *) L1_CACHE_ALIGN(start_mem); p = mem_map + MAP_NR(end_mem); start_mem = LONG_ALIGN((unsigned long) p); memset(mem_map, 0, start_mem - (unsigned long) mem_map); Index: include/linux/mm.h =================================================================== RCS file: /var/cvs/linux/include/linux/mm.h,v retrieving revision 1.1.1.4 diff -u -r1.1.1.4 mm.h --- mm.h 1999/03/09 00:55:28 1.1.1.4 +++ linux/include/linux/mm.h 1999/04/06 20:24:43 @@ -129,6 +129,14 @@ struct wait_queue *wait; struct page **pprev_hash; struct buffer_head * buffers; +#ifdef __SMP__ +#define MEM_MAP_L1_WRAP ((sizeof(void *) * 7 + \ + sizeof(unsigned long) * 2 + \ + sizeof(atomic_t)) % L1_CACHE_BYTES) + /* cacheline alignment */ + char dummy[MEM_MAP_L1_WRAP ? L1_CACHE_BYTES - MEM_MAP_L1_WRAP : 0]; +#undef MEM_MAP_L1_WRAP +#endif } mem_map_t; /* Page flag bit values */ Thanks. Andrea Arcangeli -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-06 20:26 ` Andrea Arcangeli @ 1999-04-07 5:00 ` Eric W. Biederman 1999-04-07 11:36 ` Andrea Arcangeli 0 siblings, 1 reply; 51+ messages in thread From: Eric W. Biederman @ 1999-04-07 5:00 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Chuck Lever, linux-kernel, linux-mm >>>>> "AA" == Andrea Arcangeli <andrea@e-mind.com> writes: AA> Does something like this looks like better? yes. But: #define __l1_cache_aligned __attribute__((aligned (L1_CACHE_BYTES))) ... } #ifdef SMP __l1_cache_aligned #endif mem_map_t; Looks even better, and is even maintainable. Note: I haven't tried actually tried to compile this fragment ... Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-07 5:00 ` Eric W. Biederman @ 1999-04-07 11:36 ` Andrea Arcangeli 0 siblings, 0 replies; 51+ messages in thread From: Andrea Arcangeli @ 1999-04-07 11:36 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Chuck Lever, linux-kernel, linux-mm On 7 Apr 1999, Eric W. Biederman wrote: >#define __l1_cache_aligned __attribute__((aligned (L1_CACHE_BYTES))) >... > >} >#ifdef SMP >__l1_cache_aligned >#endif >mem_map_t; > >Looks even better, and is even maintainable. Agreed should work fine, but I think gcc-2.7.2.3 is not able to 32bit align (why?? and could somebody confirm or invalidate that?). Here egcs-1.1.1 32bit align the fragment fine (tried in userspace now). Thanks. Andrea Arcangeli -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-05 21:31 ` Chuck Lever 1999-04-06 0:15 ` Andrea Arcangeli @ 1999-04-06 14:02 ` Stephen C. Tweedie 1999-04-06 15:38 ` Chuck Lever 1 sibling, 1 reply; 51+ messages in thread From: Stephen C. Tweedie @ 1999-04-06 14:02 UTC (permalink / raw) To: Chuck Lever; +Cc: Andrea Arcangeli, linux-kernel, linux-mm, Stephen Tweedie Hi, On Mon, 5 Apr 1999 17:31:43 -0400 (EDT), Chuck Lever <cel@monkey.org> said: > hmmm. wouldn't you think that hashing with the low order bits in the > offset would cause two different offsets against the same page to result > in the hash function generating different output? We always, always use page-aligned lookups for the page cache. (Actually there is one exception: certain obsolete a.out binaries, which are demand paged with the pages beginning at offset 1K into the binary. We don't support cache coherency for those and we don't support them at all on filesystems with a >1k block size. It doesn't impact on the hash issue.) --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-06 14:02 ` Stephen C. Tweedie @ 1999-04-06 15:38 ` Chuck Lever 1999-04-06 17:16 ` Andrea Arcangeli 0 siblings, 1 reply; 51+ messages in thread From: Chuck Lever @ 1999-04-06 15:38 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Andrea Arcangeli, linux-kernel, linux-mm On Tue, 6 Apr 1999, Stephen C. Tweedie wrote: > On Mon, 5 Apr 1999 17:31:43 -0400 (EDT), Chuck Lever <cel@monkey.org> > said: > > > hmmm. wouldn't you think that hashing with the low order bits in the > > offset would cause two different offsets against the same page to result > > in the hash function generating different output? > > We always, always use page-aligned lookups for the page cache. > (Actually there is one exception: certain obsolete a.out binaries, which > are demand paged with the pages beginning at offset 1K into the binary. > We don't support cache coherency for those and we don't support them at > all on filesystems with a >1k block size. It doesn't impact on the hash > issue.) i guess i'm confused then. what good does this change do: 2.2.5 pagemap.h: #define i (((unsigned long) inode)/(sizeof(struct inode) & ~ (sizeof(struct inode) - 1))) #define o (offset >> PAGE_SHIFT) #define s(x) ((x)+((x)>>PAGE_HASH_BITS)) return s(i+o) & (PAGE_HASH_SIZE-1); 2.2.5-arca pagemap.h: return s(i+o+offset) & (PAGE_HASH_SIZE-1); ^^^^^^^ btw, do you know if there is a special reason why PAGE_HASH_BITS is 11? - Chuck Lever -- corporate: <chuckl@netscape.com> personal: <chucklever@netscape.net> or <cel@monkey.org> The Linux Scalability project: http://www.citi.umich.edu/projects/citi-netscape/ -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-06 15:38 ` Chuck Lever @ 1999-04-06 17:16 ` Andrea Arcangeli 1999-04-06 18:07 ` Andrea Arcangeli 1999-04-06 20:47 ` Chuck Lever 0 siblings, 2 replies; 51+ messages in thread From: Andrea Arcangeli @ 1999-04-06 17:16 UTC (permalink / raw) To: Chuck Lever; +Cc: Stephen C. Tweedie, linux-kernel, linux-mm On Tue, 6 Apr 1999, Chuck Lever wrote: >> We always, always use page-aligned lookups for the page cache. >> (Actually there is one exception: certain obsolete a.out binaries, which >> are demand paged with the pages beginning at offset 1K into the binary. >> We don't support cache coherency for those and we don't support them at >> all on filesystems with a >1k block size. It doesn't impact on the hash >> issue.) > >i guess i'm confused then. what good does this change do: Hmm I think I misunderstood you point, Chuck. I thought you was complaining about the fact that some hash entry could be unused and other overloaded but I was just assuming that the offset is always page-aligned. I could write a simulation to check the hash function... Andrea Arcangeli -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-06 17:16 ` Andrea Arcangeli @ 1999-04-06 18:07 ` Andrea Arcangeli 1999-04-06 21:22 ` Stephen C. Tweedie 1999-04-06 20:47 ` Chuck Lever 1 sibling, 1 reply; 51+ messages in thread From: Andrea Arcangeli @ 1999-04-06 18:07 UTC (permalink / raw) To: Chuck Lever; +Cc: Stephen C. Tweedie, linux-kernel, linux-mm On Tue, 6 Apr 1999, Andrea Arcangeli wrote: >I could write a simulation to check the hash function... I was looking at the inode pointer part of the hash function and I think something like this should be better. Index: include/linux/pagemap.h =================================================================== RCS file: /var/cvs/linux/include/linux/pagemap.h,v retrieving revision 1.1.2.14 diff -u -r1.1.2.14 pagemap.h --- pagemap.h 1999/04/05 23:33:20 1.1.2.14 +++ linux/include/linux/pagemap.h 1999/04/06 18:08:32 @@ -32,7 +39,7 @@ */ static inline unsigned long _page_hashfn(struct inode * inode, unsigned long offset) { -#define i (((unsigned long) inode)/(sizeof(struct inode) & ~ (sizeof(struct inode) - 1))) +#define i (((unsigned long) inode-PAGE_OFFSET)/(sizeof(struct inode) & ~ (sizeof(struct inode) - 1))) #define o (offset >> PAGE_SHIFT) #define s(x) ((x)+((x)>>PAGE_HASH_BITS)) return s(i+o) & (PAGE_HASH_SIZE-1); (((unsigned long) inode-PAGE_OFFSET)/(sizeof(struct inode) & ~ (sizeof(struct inode) - 1))) I am not sure if it will make difference, but at least it looks smarter to me because it will remove a not interesting amount of information from the input of the hash. -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-06 18:07 ` Andrea Arcangeli @ 1999-04-06 21:22 ` Stephen C. Tweedie 1999-04-06 22:19 ` Ingo Molnar 1999-04-06 22:31 ` Andrea Arcangeli 0 siblings, 2 replies; 51+ messages in thread From: Stephen C. Tweedie @ 1999-04-06 21:22 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Chuck Lever, Stephen C. Tweedie, linux-kernel, linux-mm Hi, On Tue, 6 Apr 1999 20:07:57 +0200 (CEST), Andrea Arcangeli <andrea@e-mind.com> said: > I was looking at the inode pointer part of the hash function and I think > something like this should be better. > -#define i (((unsigned long) inode)/(sizeof(struct inode) & ~ (sizeof(struct inode) - 1))) > +#define i (((unsigned long) inode-PAGE_OFFSET)/(sizeof(struct inode) & ~ (sizeof(struct inode) - 1))) This just ends up adding or subtracting a constant to the hash function, so won't have any effect at all on the occupancy distribution of the hash buckets. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-06 21:22 ` Stephen C. Tweedie @ 1999-04-06 22:19 ` Ingo Molnar 1999-04-06 22:40 ` David Miller 1999-04-08 8:09 ` Carlo Daffara 1999-04-06 22:31 ` Andrea Arcangeli 1 sibling, 2 replies; 51+ messages in thread From: Ingo Molnar @ 1999-04-06 22:19 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Andrea Arcangeli, Chuck Lever, linux-kernel, linux-mm On Tue, 6 Apr 1999, Stephen C. Tweedie wrote: -#define i (((unsigned long) inode)/(sizeof(struct inode) \ & ~ (sizeof(struct inode) - 1))) +#define i (((unsigned long) inode-PAGE_OFFSET)/(sizeof(struct inode) \ & ~ (sizeof(struct inode) - 1))) > This just ends up adding or subtracting a constant to the hash function, > so won't have any effect at all on the occupancy distribution of the > hash buckets. btw. shouldnt it rather be something like: #define log2(x) \ ({ \ int __res; \ \ switch (x) { \ case 0x040 ... 0x07f: __res = 0x040; break; \ case 0x080 ... 0x0ff: __res = 0x080; break; \ case 0x100 ... 0x1ff: __res = 0x100; break; \ case 0x200 ... 0x3ff: __res = 0x200; break; \ case 0x400 ... 0x7ff: __res = 0x400; break; \ case 0x800 ... 0xfff: __res = 0x800; break; \ } \ __res; \ }) #define i (((unsigned long) inode)/log2(sizeof(struct inode))) because otherwise we 'over-estimate' and include systematic bits in the supposedly random index. Eg. in 2.2.5 the size of struct inode is 0x110, which means 'i' will have values 0x11, 0x22, 0x33, for inodes within the same page. (we'd preferably have i == 0x01, 0x02, 0x03...) the hash was designed when struct inode has a size of 512 i think, for which the original #define was correct. But for 0x110 it isnt IMHO. (my above macros produce i==0,1,2...0xf iterated over inode addresses within the same physical page.) -- mingo -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-06 22:19 ` Ingo Molnar @ 1999-04-06 22:40 ` David Miller 1999-04-06 22:49 ` Ingo Molnar 1999-04-08 8:09 ` Carlo Daffara 1 sibling, 1 reply; 51+ messages in thread From: David Miller @ 1999-04-06 22:40 UTC (permalink / raw) To: mingo; +Cc: sct, andrea, cel, linux-kernel, linux-mm On Tue, 6 Apr 1999, Stephen C. Tweedie wrote: -#define i (((unsigned long) inode)/(sizeof(struct inode) \ & ~ (sizeof(struct inode) - 1))) +#define i (((unsigned long) inode-PAGE_OFFSET)/(sizeof(struct inode) \ & ~ (sizeof(struct inode) - 1))) ... btw. shouldnt it rather be something like: #define log2(x) \ Look at the code just the 'i' in question will output :-) mov inode, %o0 srlx %o0, 3, %o4 So on sparc64 atleast, it amounts to "inode >> 3". So: (sizeof(struct inode) & ~ (sizeof(struct inode) - 1)) is 8 on sparc64. The 'i' construct is just meant to get rid of the "non significant" lower bits of the inode pointer and it does so very nicely. :-) Later, David S. Miller davem@redhat.com -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-06 22:40 ` David Miller @ 1999-04-06 22:49 ` Ingo Molnar 1999-04-06 22:53 ` David Miller 0 siblings, 1 reply; 51+ messages in thread From: Ingo Molnar @ 1999-04-06 22:49 UTC (permalink / raw) To: David Miller; +Cc: sct, andrea, cel, linux-kernel, linux-mm On Tue, 6 Apr 1999, David Miller wrote: > #define log2(x) \ > > Look at the code just the 'i' in question will output :-) > > mov inode, %o0 > srlx %o0, 3, %o4 > > So on sparc64 atleast, it amounts to "inode >> 3". So: (yep it's the same on x86 too, given sizeof(struct inode) == 0x110) > (sizeof(struct inode) & ~ (sizeof(struct inode) - 1)) > > is 8 on sparc64. The 'i' construct is just meant to get rid of the > "non significant" lower bits of the inode pointer and it does so very > nicely. :-) but it's not just the lower 3 bits that are unsignificant. It's 8 unsignificant bits. (8.1 actually :) It should be 'inode >> 8' (which is done by the log2 solution). Unless i'm misunderstanding something. The thing we are AFAIU trying to avoid is 'x/sizeof(inode)', which can be a costy division. So i've substituted it with 'x/nearest_power_of_2(sizeof(inode))' which is just as good in getting rid of insignificant bits. -- mingo -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-06 22:49 ` Ingo Molnar @ 1999-04-06 22:53 ` David Miller 1999-04-07 15:59 ` Gabriel Paubert 0 siblings, 1 reply; 51+ messages in thread From: David Miller @ 1999-04-06 22:53 UTC (permalink / raw) To: mingo; +Cc: sct, andrea, cel, linux-kernel, linux-mm It should be 'inode >> 8' (which is done by the log2 solution). Unless i'm misunderstanding something. Consider that: (((unsigned long) inode) >> (sizeof(struct inode) & ~ (sizeof(struct inode) - 1))) sort of approximates this and avoids the funny looking log2 macro. :-) Later, David S. Miller davem@redhat.com -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-06 22:53 ` David Miller @ 1999-04-07 15:59 ` Gabriel Paubert 1999-04-07 21:07 ` Arvind Sankar 0 siblings, 1 reply; 51+ messages in thread From: Gabriel Paubert @ 1999-04-07 15:59 UTC (permalink / raw) To: davem; +Cc: mingo, sct, andrea, cel, linux-kernel, linux-mm On Tue, 6 Apr 1999, David Miller wrote: > Date: Wed, 7 Apr 1999 00:49:18 +0200 (CEST) > From: Ingo Molnar <mingo@chiara.csoma.elte.hu> > > It should be 'inode >> 8' (which is done by the log2 > solution). Unless i'm misunderstanding something. > > Consider that: > > (((unsigned long) inode) >> (sizeof(struct inode) & ~ (sizeof(struct inode) - 1))) > > sort of approximates this and avoids the funny looking log2 macro. :-) May I disagree ? Compute this expression in the case sizeof(struct inode) is a large power of 2. Say 0x100, the shift count becomes (0x100 & ~0xff), or 0x100. Shifts by amounts larger than or equal to the word size are undefined in C AFAIR (and in practice on most architectures which take the shift count modulo some power of 2). I have needed quite often a log2 estimate of integer values but I don't know of any tricks or expression to make it fast on machines which don't have an instruction which counts the number of most significant zero bits. It is trivial to count the number of least significant zero bits if you have an instruction which counts the most significant zero bits, but not the other way around. Regards, Gabriel. -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-07 15:59 ` Gabriel Paubert @ 1999-04-07 21:07 ` Arvind Sankar 1999-04-09 6:58 ` Eric W. Biederman 0 siblings, 1 reply; 51+ messages in thread From: Arvind Sankar @ 1999-04-07 21:07 UTC (permalink / raw) To: Gabriel Paubert, davem; +Cc: mingo, sct, andrea, cel, linux-kernel, linux-mm On Wed, Apr 07, 1999 at 05:59:04PM +0200, Gabriel Paubert wrote: > > > On Tue, 6 Apr 1999, David Miller wrote: > > > Date: Wed, 7 Apr 1999 00:49:18 +0200 (CEST) > > From: Ingo Molnar <mingo@chiara.csoma.elte.hu> > > > > It should be 'inode >> 8' (which is done by the log2 > > solution). Unless i'm misunderstanding something. > > > > Consider that: > > > > (((unsigned long) inode) >> (sizeof(struct inode) & ~ (sizeof(struct inode) - 1))) > > > > sort of approximates this and avoids the funny looking log2 macro. :-) > > May I disagree ? Compute this expression in the case sizeof(struct inode) > is a large power of 2. Say 0x100, the shift count becomes (0x100 & ~0xff), > or 0x100. Shifts by amounts larger than or equal to the word size are > undefined in C AFAIR (and in practice on most architectures which take > the shift count modulo some power of 2). > typo there, I guess. the >> should be an integer division. Since the divisor is a constant power of 2, the compiler will optimize it into a shift. -- arvind -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-07 21:07 ` Arvind Sankar @ 1999-04-09 6:58 ` Eric W. Biederman 1999-04-09 9:27 ` Gabriel Paubert 0 siblings, 1 reply; 51+ messages in thread From: Eric W. Biederman @ 1999-04-09 6:58 UTC (permalink / raw) To: Arvind Sankar Cc: Gabriel Paubert, davem, mingo, sct, andrea, cel, linux-kernel, linux-mm >>>>> "AS" == Arvind Sankar <arvinds@MIT.EDU> writes: AS> On Wed, Apr 07, 1999 at 05:59:04PM +0200, Gabriel Paubert wrote: >> >> >> On Tue, 6 Apr 1999, David Miller wrote: >> >> > Date: Wed, 7 Apr 1999 00:49:18 +0200 (CEST) >> > From: Ingo Molnar <mingo@chiara.csoma.elte.hu> >> > >> > It should be 'inode >> 8' (which is done by the log2 >> > solution). Unless i'm misunderstanding something. >> > >> > Consider that: >> > >> > (((unsigned long) inode) >> (sizeof(struct inode) & ~ (sizeof(struct inode) - 1))) >> > >> > sort of approximates this and avoids the funny looking log2 macro. :-) >> >> May I disagree ? Compute this expression in the case sizeof(struct inode) >> is a large power of 2. Say 0x100, the shift count becomes (0x100 & ~0xff), >> or 0x100. Shifts by amounts larger than or equal to the word size are >> undefined in C AFAIR (and in practice on most architectures which take >> the shift count modulo some power of 2). >> AS> typo there, I guess. the >> should be an integer division. Since the divisor is AS> a constant power of 2, the compiler will optimize it into a shift. Actually I believe: #define DIVISOR(x) (x & ~((x >> 1) | ~(x >> 1))) (((unsigned long) inode) / DIVISOR(sizeof(struct inode))) Is the magic formula. A smart compiler can figure out the shift, and the DIVISOR macro makes x into a power of two. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-09 6:58 ` Eric W. Biederman @ 1999-04-09 9:27 ` Gabriel Paubert 1999-04-09 15:40 ` Eric W. Biederman 0 siblings, 1 reply; 51+ messages in thread From: Gabriel Paubert @ 1999-04-09 9:27 UTC (permalink / raw) To: Eric W. Biederman Cc: Arvind Sankar, davem, mingo, sct, andrea, cel, linux-kernel, linux-mm On 9 Apr 1999, Eric W. Biederman wrote: > AS> typo there, I guess. the >> should be an integer division. Since the divisor is > AS> a constant power of 2, the compiler will optimize it into a shift. > > Actually I believe: > #define DIVISOR(x) (x & ~((x >> 1) | ~(x >> 1))) ^^^^^^^^^^^^^^^^^^^^^^^ interesting formula. Unless I'm wrong, set y=x>>1 and evaluate it again: ~(y | ~y) which should give zero on any binary machine. So I think you've come up with a sophisticated way of generating an OOPS :-) at least on architectures which don't silently and happily divide by zero. I've needed it quite often but I don't know of any short formula which computes the log2 of an integer (whether rounded up or down does not matter) with standard C operators. I've been looking for it quite carefully, and I don't even think that it is possible: there is an example on how to do this in the HP-UX assembler documentation, it takes 18 machine instructions (no loops, performing a binary search) and it has been written by people who, as you would expect, know very well the tricks of the architecture (cascaded conditional nullification to perform branchless if then else: the then branch is a shift, the else branch is an add). Code size or time are not the problem here since it is a compile time constant, but trying to find a simple C expression to approximate log2 is probably futile (and not worth the effort in this case). Gabriel. -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-09 9:27 ` Gabriel Paubert @ 1999-04-09 15:40 ` Eric W. Biederman 0 siblings, 0 replies; 51+ messages in thread From: Eric W. Biederman @ 1999-04-09 15:40 UTC (permalink / raw) To: Gabriel Paubert Cc: Eric W. Biederman, Arvind Sankar, davem, mingo, sct, andrea, cel, linux-kernel, linux-mm >>>>> "GP" == Gabriel Paubert <paubert@iram.es> writes: GP> On 9 Apr 1999, Eric W. Biederman wrote: AS> typo there, I guess. the >> should be an integer division. Since the divisor is AS> a constant power of 2, the compiler will optimize it into a shift. >> >> Actually I believe: >> #define DIVISOR(x) (x & ~((x >> 1) | ~(x >> 1))) GP> ^^^^^^^^^^^^^^^^^^^^^^^ GP> interesting formula. Unless I'm wrong, set y=x>>1 and evaluate it again: GP> ~(y | ~y) GP> which should give zero on any binary machine. Duh. I forgot that the high bits got set. GP> So I think you've come up GP> with a sophisticated way of generating an OOPS :-) at least on GP> architectures which don't silently and happily divide by zero. GP> I've needed it quite often but I don't know of any short formula which GP> computes the log2 of an integer (whether rounded up or down does not GP> matter) with standard C operators. Well I wasn't trying for log2 but instead for truncating to the nearest power of two, in a form that the compiler could compute, at compile time. GP> I've been looking for it quite carefully, and I don't even think that it GP> is possible: there is an example on how to do this in the HP-UX assembler GP> documentation, it takes 18 machine instructions (no loops, performing a GP> binary search) and it has been written by people who, as you would GP> expect, know very well the tricks of the architecture (cascaded GP> conditional nullification to perform branchless if then else: the then GP> branch is a shift, the else branch is an add). GP> Code size or time are not the problem here since it is a compile time GP> constant, but trying to find a simple C expression to approximate log2 is GP> probably futile (and not worth the effort in this case). True but it is fun :) Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-06 22:19 ` Ingo Molnar 1999-04-06 22:40 ` David Miller @ 1999-04-08 8:09 ` Carlo Daffara 1 sibling, 0 replies; 51+ messages in thread From: Carlo Daffara @ 1999-04-08 8:09 UTC (permalink / raw) To: andreae; +Cc: linux-kernel, linux-mm I have tried your arca-vm patches, and must say that this is by far the best for interactive response. We are making a custom low cost workstation based on gnome (with only 32m ram) and 2.2.5 with your patch makes it perfectly usable even with multiple applets and netscape running. I have seen no strange side effects, even after VERY heavy usage. Any chance that your VM will be part of a future 2.2 or (eventually) 2.3? Maybe as a "pluggable" VM? best regards, and keep up the great work!! Carlo Daffara Conecta Telematica PS: by the way, 2.2.5 compiled with egcs 2.91.66 is perfectly fine... -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-06 21:22 ` Stephen C. Tweedie 1999-04-06 22:19 ` Ingo Molnar @ 1999-04-06 22:31 ` Andrea Arcangeli 1 sibling, 0 replies; 51+ messages in thread From: Andrea Arcangeli @ 1999-04-06 22:31 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Chuck Lever, linux-kernel, linux-mm On Tue, 6 Apr 1999, Stephen C. Tweedie wrote: >> -#define i (((unsigned long) inode)/(sizeof(struct inode) & ~ (sizeof(struct inode) - 1))) >> +#define i (((unsigned long) inode-PAGE_OFFSET)/(sizeof(struct inode) & ~ (sizeof(struct inode) - 1))) > >This just ends up adding or subtracting a constant to the hash function, >so won't have any effect at all on the occupancy distribution of the >hash buckets. My point is that PAGE_HASH_BITS is < of 32/2. Andrea Arcangeli -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-06 17:16 ` Andrea Arcangeli 1999-04-06 18:07 ` Andrea Arcangeli @ 1999-04-06 20:47 ` Chuck Lever 1999-04-06 21:04 ` Andrea Arcangeli 1999-04-06 21:11 ` Stephen C. Tweedie 1 sibling, 2 replies; 51+ messages in thread From: Chuck Lever @ 1999-04-06 20:47 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Stephen C. Tweedie, linux-kernel, linux-mm On Tue, 6 Apr 1999, Andrea Arcangeli wrote: > >i guess i'm confused then. what good does this change do: > > Hmm I think I misunderstood you point, Chuck. I thought you was > complaining about the fact that some hash entry could be unused and other > overloaded yes, you understood correctly. > but I was just assuming that the offset is always page-aligned. > I could write a simulation to check the hash function... i didn't realize that the "offset" argument would always be page-aligned. but still, why does it help to add the unshifted "offset"? doesn't seem like there's any new information in that. - Chuck Lever -- corporate: <chuckl@netscape.com> personal: <chucklever@netscape.net> or <cel@monkey.org> The Linux Scalability project: http://www.citi.umich.edu/projects/citi-netscape/ -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-06 20:47 ` Chuck Lever @ 1999-04-06 21:04 ` Andrea Arcangeli 1999-04-06 21:11 ` Stephen C. Tweedie 1 sibling, 0 replies; 51+ messages in thread From: Andrea Arcangeli @ 1999-04-06 21:04 UTC (permalink / raw) To: Chuck Lever; +Cc: Stephen C. Tweedie, linux-kernel, linux-mm On Tue, 6 Apr 1999, Chuck Lever wrote: >but still, why does it help to add the unshifted "offset"? doesn't seem >like there's any new information in that. The unshifted offset is useful for swap cache pages. When the page is a swap cache page the page->offset really is the swap entry that tell us where the page is been swapped out. And in such case page->offset is not page aligned. Andrea Arcangeli -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-06 20:47 ` Chuck Lever 1999-04-06 21:04 ` Andrea Arcangeli @ 1999-04-06 21:11 ` Stephen C. Tweedie 1 sibling, 0 replies; 51+ messages in thread From: Stephen C. Tweedie @ 1999-04-06 21:11 UTC (permalink / raw) To: Chuck Lever; +Cc: Andrea Arcangeli, Stephen C. Tweedie, linux-kernel, linux-mm Hi, On Tue, 6 Apr 1999 16:47:23 -0400 (EDT), Chuck Lever <cel@monkey.org> said: >> but I was just assuming that the offset is always page-aligned. >> I could write a simulation to check the hash function... > i didn't realize that the "offset" argument would always be page-aligned. > but still, why does it help to add the unshifted "offset"? doesn't seem > like there's any new information in that. It is always aligned for the page cache (hence mixing in the lower bits to the hash function shouldn't change anything), but the swap cache uses the lower bits extensively, concentrating the swap cache into just a few hash buckets unless we make this change. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-05 0:22 ` Andrea Arcangeli 1999-04-05 13:23 ` Mark Hemment 1999-04-05 21:31 ` Chuck Lever @ 1999-04-06 14:00 ` Stephen C. Tweedie 1999-04-06 16:29 ` Andrea Arcangeli 2 siblings, 1 reply; 51+ messages in thread From: Stephen C. Tweedie @ 1999-04-06 14:00 UTC (permalink / raw) To: Andrea Arcangeli Cc: Chuck Lever, linux-kernel, linux-mm, Stephen C. Tweedie, Linus Torvalds Hi, On Mon, 5 Apr 1999 02:22:35 +0200 (CEST), Andrea Arcangeli <andrea@e-mind.com> said: > The page hash function change is from Stephen (I did it here too because I > completly agreed with it). The point is that shm entries uses the lower > bits of the pagemap->offset field. _All_ swap entries do. shm entries never enter the page cache so that's not a problem, but the swap cache _is_ a problem. > Eh, my shrink_mmap() is is a black magic and it's long to explain what I > thought ;). It is hard to have a meaningful discussion about it otherwise! --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [patch] arca-vm-2.2.5 1999-04-06 14:00 ` Stephen C. Tweedie @ 1999-04-06 16:29 ` Andrea Arcangeli 0 siblings, 0 replies; 51+ messages in thread From: Andrea Arcangeli @ 1999-04-06 16:29 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Chuck Lever, linux-kernel, linux-mm On Tue, 6 Apr 1999, Stephen C. Tweedie wrote: >_All_ swap entries do. shm entries never enter the page cache so that's My mistake excuse me. Andrea Arcangeli -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 51+ messages in thread
end of thread, other threads:[~1999-04-25 3:22 UTC | newest] Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 1999-04-01 23:32 [patch] arca-vm-2.2.5 Andrea Arcangeli 1999-04-04 21:07 ` Chuck Lever 1999-04-05 0:22 ` Andrea Arcangeli 1999-04-05 13:23 ` Mark Hemment 1999-04-05 15:56 ` Andrea Arcangeli 1999-04-07 11:28 ` [patch] only-one-cache-query [was Re: [patch] arca-vm-2.2.5] Andrea Arcangeli 1999-04-07 13:06 ` Stephen C. Tweedie 1999-04-07 13:49 ` Andrea Arcangeli 1999-04-07 13:42 ` Andrea Arcangeli 1999-04-07 13:47 ` Ingo Molnar 1999-04-07 14:08 ` Andrea Arcangeli 1999-04-05 20:24 ` [patch] arca-vm-2.2.5 Horst von Brand 1999-04-05 23:25 ` Andrea Arcangeli 1999-04-05 23:37 ` Horst von Brand 1999-04-06 1:23 ` Andrea Arcangeli 1999-04-17 11:12 ` Andrea Arcangeli 1999-04-05 21:31 ` Chuck Lever 1999-04-06 0:15 ` Andrea Arcangeli 1999-04-06 2:14 ` Doug Ledford 1999-04-06 13:04 ` Andrea Arcangeli 1999-04-06 21:31 ` Stephen C. Tweedie 1999-04-06 22:27 ` Andrea Arcangeli 1999-04-07 12:27 ` Stephen C. Tweedie 1999-04-25 3:22 ` Chuck Lever 1999-04-06 5:52 ` Chuck Lever 1999-04-06 13:09 ` Andrea Arcangeli 1999-04-06 16:19 ` Eric W. Biederman 1999-04-06 20:26 ` Andrea Arcangeli 1999-04-07 5:00 ` Eric W. Biederman 1999-04-07 11:36 ` Andrea Arcangeli 1999-04-06 14:02 ` Stephen C. Tweedie 1999-04-06 15:38 ` Chuck Lever 1999-04-06 17:16 ` Andrea Arcangeli 1999-04-06 18:07 ` Andrea Arcangeli 1999-04-06 21:22 ` Stephen C. Tweedie 1999-04-06 22:19 ` Ingo Molnar 1999-04-06 22:40 ` David Miller 1999-04-06 22:49 ` Ingo Molnar 1999-04-06 22:53 ` David Miller 1999-04-07 15:59 ` Gabriel Paubert 1999-04-07 21:07 ` Arvind Sankar 1999-04-09 6:58 ` Eric W. Biederman 1999-04-09 9:27 ` Gabriel Paubert 1999-04-09 15:40 ` Eric W. Biederman 1999-04-08 8:09 ` Carlo Daffara 1999-04-06 22:31 ` Andrea Arcangeli 1999-04-06 20:47 ` Chuck Lever 1999-04-06 21:04 ` Andrea Arcangeli 1999-04-06 21:11 ` Stephen C. Tweedie 1999-04-06 14:00 ` Stephen C. Tweedie 1999-04-06 16:29 ` Andrea Arcangeli
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox