* classzone-VM + mapped pages out of lru_cache @ 2000-05-03 16:26 Andrea Arcangeli 2000-05-04 0:42 ` David S. Miller 2000-05-04 14:40 ` Juan J. Quintela 0 siblings, 2 replies; 23+ messages in thread From: Andrea Arcangeli @ 2000-05-03 16:26 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel, quintela This patch will convert the internal pg_data_t design from zone to classzone to give correctness to the memory balancing and memory allocation decisions. It also moves the lru_cache inside the pg_data_t for NUMA. It also splits the LRU in two parts, one for swap cache and one for the more interesting cache and last but not the least it keeps mapped pages without overlapped buffer headers out of the lru so that we don't waste time in shrink_mmap trying to release mapped pages when we have to shrink the cache. ftp://ftp.*.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.3/2.3.99-pre7-pre3/classzone-18.gz It gives me smoother swap behaviour since the swap cache hardly pollutes the lru_cache now. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: classzone-VM + mapped pages out of lru_cache 2000-05-03 16:26 classzone-VM + mapped pages out of lru_cache Andrea Arcangeli @ 2000-05-04 0:42 ` David S. Miller 2000-05-04 10:00 ` Andrea Arcangeli 2000-05-04 14:40 ` Juan J. Quintela 1 sibling, 1 reply; 23+ messages in thread From: David S. Miller @ 2000-05-04 0:42 UTC (permalink / raw) To: andrea; +Cc: linux-mm, linux-kernel, quintela ftp://ftp.*.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.3/2.3.99-pre7-pre3/classzone-18.gz Btw, the path seem to be incorrect. It should be: /pub/linux/kernel/people/andrea/patches/v2.3/2.3.99-pre7-pre3/classzone-18.gz :-) One note after initial study. I wish we could get rid of the "map_count" thing you added to the page struct. Currently, when we turn off wait queue debugging, the page struct is an exact power of 2 on both 64-bit and 32-bit architectures. With the map_count there now, it will not be an exact power of two in size on 32-bit machines :-( Later, David S. Miller davem@redhat.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: classzone-VM + mapped pages out of lru_cache 2000-05-04 0:42 ` David S. Miller @ 2000-05-04 10:00 ` Andrea Arcangeli 0 siblings, 0 replies; 23+ messages in thread From: Andrea Arcangeli @ 2000-05-04 10:00 UTC (permalink / raw) To: David S. Miller; +Cc: linux-mm, linux-kernel, quintela On Wed, 3 May 2000, David S. Miller wrote: > Date: Wed, 3 May 2000 18:26:19 +0200 (CEST) > From: Andrea Arcangeli <andrea@suse.de> > > ftp://ftp.*.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.3/2.3.99-pre7-pre3/classzone-18.gz > >Btw, the path seem to be incorrect. It should be: > >/pub/linux/kernel/people/andrea/patches/v2.3/2.3.99-pre7-pre3/classzone-18.gz > >:-) Correct, I'm sorry for the mistake. All single patches are in the andrea/patches directory indeed. >One note after initial study. I wish we could get rid of the >"map_count" thing you added to the page struct. Currently, when I understand your point. My only problem is that if we'll drop the map_count then I must be allowed to say "map_count = page_count(page) - 1". That's definitely not always true (just look the middle of shrink_mmap). Then all the places that does __find_lock_page are non trivial and they should all be audited and they would break if somebody won't be carefuul about the dependency on the map count and page count. Also the msync path increases the page count for a very minor reason: only to have a common exit path for the invalidate case. That trick is currently sane since the page count can be increased at any time for any reason as far as memory doesn't leak but such trick breaks the invariant I would need to enforce in order to drop map_count. So I need further information anyway (the PageOutLru bitflag wouldn't be enough either) and so I thought I can as well grab an integer for its dedicated purpose. The above are the only reasons that made me to add the map_count. I preferred to not break the current relaxed page->count semantic and to allow everybody to grab as many times they want the page_count as now, and the page count will remain relevant only to the freelist, and it have no meaning respect to the number of ptes where the page is mapped to. Basically using page_count(page) for the map count seems a pain and something we don't want to do as far I can see. If you have clever idea on how to get rid of the map_count they will be very appreciated indeed. Thanks. >we turn off wait queue debugging, the page struct is an exact power >of 2 on both 64-bit and 32-bit architectures. With the map_count >there now, it will not be an exact power of two in size on 32-bit >machines :-( I seen the problem with 32bit archs since the first place (and the problem is not only alignment but memory waste in general) but my argument here is that we'd better choose to drop page->virtual if HIGHMEM is disabled (that field is very less useful than map_count when HIGHMEM is disabled and so it should go away from the binary image first IMHO ;-). (and page->virtual should be dropped unconditionally on 64bit archs... even when HIGHMEM is enabled) On 64bit archs the map_count doesn't waste memory anyway because it gets packeted in the same word with the 32bit page->count (both count and map_count are 32bits wide) while the page->virtual instead waste lots of memory even now on 64bit archs. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: classzone-VM + mapped pages out of lru_cache 2000-05-03 16:26 classzone-VM + mapped pages out of lru_cache Andrea Arcangeli 2000-05-04 0:42 ` David S. Miller @ 2000-05-04 14:40 ` Juan J. Quintela 2000-05-04 15:19 ` Andrea Arcangeli 2000-05-04 16:34 ` Juan J. Quintela 1 sibling, 2 replies; 23+ messages in thread From: Juan J. Quintela @ 2000-05-04 14:40 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-mm, linux-kernel >>>>> "andrea" == Andrea Arcangeli <andrea@suse.de> writes: Hi andrea> It gives me smoother swap behaviour since the swap cache hardly pollutes andrea> the lru_cache now. Andrea I have run here last night your patch (classzone-18) against pre7-3 in one machine and vanilla pre7-3 in other machine. The test was from my memtest suite: while (true); do time ./mmap002; done suited to the memory of my system (Uniprocessor 96MB) The results are very good for your patch: Vanilla pre7-3 pre7-3+classzone-18 real 3m29.926s real 2m10.210s user 0m15.280s real 2m10.210s sys 0m20.500s real 2m10.210s That are the normal times. classzone patches variations are very low (all the iterations are between 2m08 and 2m10). But in vanilla pre7-3, the variations are higher: between 3m4 and 4m20, and the worst part, when the kswapd problem appear, the program takes until 36m20 (yes 36, ten times more, is not a typo). Furthermore, vanilla pre7-3 kill the process after 2 hours and a half, classzone works for more than 12 hours without a problem. That is the description of the situation. Andrea, why do you reverse the patch of filemap.c:truncate_inode_pages() of using TryLockPage()? That change was proposed by Rik due to an Oops that I get here. It was one of the non-easily reproducible ones that I get. Later, Juan. -- In theory, practice and theory are the same, but in practice they are different -- Larry McVoy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: classzone-VM + mapped pages out of lru_cache 2000-05-04 14:40 ` Juan J. Quintela @ 2000-05-04 15:19 ` Andrea Arcangeli 2000-05-04 15:23 ` Andrea Arcangeli ` (3 more replies) 2000-05-04 16:34 ` Juan J. Quintela 1 sibling, 4 replies; 23+ messages in thread From: Andrea Arcangeli @ 2000-05-04 15:19 UTC (permalink / raw) To: Juan J. Quintela; +Cc: linux-mm, linux-kernel, trond.myklebust On 4 May 2000, Juan J. Quintela wrote: >suited to the memory of my system (Uniprocessor 96MB) Do you have also an SMP to try it out too? On IA32-SMP and alpha-SMP I definitely can't reproduce the pico lockup reported to me a few hours ago. >The results are very good for your patch: > >Vanilla pre7-3 pre7-3+classzone-18 >real 3m29.926s real 2m10.210s >user 0m15.280s real 2m10.210s >sys 0m20.500s real 2m10.210s ;) >That is the description of the situation. Andrea, why do you reverse >the patch of filemap.c:truncate_inode_pages() of using TryLockPage()? Because it's not necessary as far I can tell. Only one truncate_inode_pages() can run at once and none read or write can run under truncate_inode_pages(). This should be enforced by the VFS, and if that doesn't happen the truncate_inode_pages changes that gone into pre6 (and following) hides the real bug. >That change was proposed by Rik due to an Oops that I get here. It Can I see the Oops? >was one of the non-easily reproducible ones that I get. Maybe you can try to write/read under a truncate or something like that. I've not yet checked if there are still races in the VFS between read/write/truncate. BTW, maybe if you was using NFS you got in troubles with invalidate_inode_pages. New invalidate_inode_pages in classzone patch won't race with the old truncate_inode_page. Also without classzone-VM-18 invalidate_inode_pages can crash the kerne because _nobody_ can unlink a mapped page-cache from the cache without first clearing the pte (and flushing the page to disk if the pte happened to be dirty). So if invalidate_inode_pages() runs under an inode map-shared in memory you'll get a lockup as soon as you try to sync the page to disk. I didn't tried to reproduce but I only read the code. However if I am right about this reproducing should be easy. You have to MAP_SHARED on the client a file, touch the shared mapping so that some pte become dirty, now overwrite the file on the server and then push the client low on memory so that swap_out will try to unmap the page -> crash. This is a security issue also for 2.2.x, fix is in classzone-VM-18 where I don't unmap the page if the page count is > 1 (so if the page have no chance to be mapped). Effect of the fix is that you can't MAP_SHARED and change the file from more than one client or you have to expect inchoerency between the cache copies. This untested patch should fix the problem also in 2.2.15 (the same way I fixed it in classzone patch): --- 2.2.15/mm/filemap.c Thu May 4 13:00:40 2000 +++ /tmp/filemap.c Thu May 4 17:11:18 2000 @@ -68,7 +68,7 @@ p = &inode->i_pages; while ((page = *p) != NULL) { - if (PageLocked(page)) { + if (PageLocked(page) || atomic_read(&page->count) > 1) { p = &page->next; continue; } Trond, what do you think about it? Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: classzone-VM + mapped pages out of lru_cache 2000-05-04 15:19 ` Andrea Arcangeli @ 2000-05-04 15:23 ` Andrea Arcangeli 2000-05-04 15:38 ` Rik van Riel ` (2 subsequent siblings) 3 siblings, 0 replies; 23+ messages in thread From: Andrea Arcangeli @ 2000-05-04 15:23 UTC (permalink / raw) To: Juan J. Quintela; +Cc: linux-mm, linux-kernel, trond.myklebust On Thu, 4 May 2000, Andrea Arcangeli wrote: >--- 2.2.15/mm/filemap.c Thu May 4 13:00:40 2000 >+++ /tmp/filemap.c Thu May 4 17:11:18 2000 >@@ -68,7 +68,7 @@ > > p = &inode->i_pages; > while ((page = *p) != NULL) { >- if (PageLocked(page)) { >+ if (PageLocked(page) || atomic_read(&page->count) > 1) { > p = &page->next; > continue; > } above patch is also here: ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.15/invalidate_inode_pages-1 Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: classzone-VM + mapped pages out of lru_cache 2000-05-04 15:19 ` Andrea Arcangeli 2000-05-04 15:23 ` Andrea Arcangeli @ 2000-05-04 15:38 ` Rik van Riel 2000-05-04 17:59 ` Andrea Arcangeli 2000-05-04 16:34 ` Manfred Spraul, Andrea Arcangeli 2000-05-04 16:48 ` Trond Myklebust 3 siblings, 1 reply; 23+ messages in thread From: Rik van Riel @ 2000-05-04 15:38 UTC (permalink / raw) To: Andrea Arcangeli Cc: Juan J. Quintela, linux-mm, linux-kernel, trond.myklebust On Thu, 4 May 2000, Andrea Arcangeli wrote: > --- 2.2.15/mm/filemap.c Thu May 4 13:00:40 2000 > +++ /tmp/filemap.c Thu May 4 17:11:18 2000 > @@ -68,7 +68,7 @@ > > p = &inode->i_pages; > while ((page = *p) != NULL) { > - if (PageLocked(page)) { > + if (PageLocked(page) || atomic_read(&page->count) > 1) { > p = &page->next; > continue; > } Fun, fun, fun ... So the other CPU takes a lock on the page while we're testing for the page->count and increments the pagecount after the lock, while we try to do something (call __free_page(page)?) with the page ... As long as the other cpu increments the page count quick enough it should be ok, but when it doesn't we can *still* end up freeing a locked page. I've seen backtraces where __free_pages_ok() Oopsed on PageLocked(page) and the function was called from truncate_inode_pages(). The fix which is in the latest kernel from Linus fixed the bug for those people. Stubbornly reversing the fix because you haven't managed to reproduce it yet is most probably not the right thing to do. regards, Rik -- The Internet is not a network of computers. It is a network of people. That is its real strength. Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: classzone-VM + mapped pages out of lru_cache 2000-05-04 15:38 ` Rik van Riel @ 2000-05-04 17:59 ` Andrea Arcangeli 2000-05-04 19:24 ` Rik van Riel 0 siblings, 1 reply; 23+ messages in thread From: Andrea Arcangeli @ 2000-05-04 17:59 UTC (permalink / raw) To: riel; +Cc: Juan J. Quintela, linux-mm, linux-kernel, trond.myklebust On Thu, 4 May 2000, Rik van Riel wrote: >On Thu, 4 May 2000, Andrea Arcangeli wrote: > >> --- 2.2.15/mm/filemap.c Thu May 4 13:00:40 2000 ^^^^^^ >> +++ /tmp/filemap.c Thu May 4 17:11:18 2000 >> @@ -68,7 +68,7 @@ >> >> p = &inode->i_pages; >> while ((page = *p) != NULL) { >> - if (PageLocked(page)) { >> + if (PageLocked(page) || atomic_read(&page->count) > 1) { >> p = &page->next; >> continue; >> } XXXXXXXXXXXXXXXX > >Fun, fun, fun ... > >So the other CPU takes a lock on the page while we're testing >for the page->count and increments the pagecount after the lock, >while we try to do something (call __free_page(page)?) with the >page ... You're obviously wrong: 1) the other cpu on 2.2.15 were spinning on the big kernel lock and had no way to try to lock down the page we're processing. 2) if what you described above would be true, then virgin 2.2.15 and all the 2.2.x official/unofficial kernels out there would have a major SMP race anyway (not thanks to my above fix) because it would mean that in point XXXXXXXXXXXXXXXXX the page could become locked from under us. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: classzone-VM + mapped pages out of lru_cache 2000-05-04 17:59 ` Andrea Arcangeli @ 2000-05-04 19:24 ` Rik van Riel 0 siblings, 0 replies; 23+ messages in thread From: Rik van Riel @ 2000-05-04 19:24 UTC (permalink / raw) To: Andrea Arcangeli Cc: Juan J. Quintela, linux-mm, linux-kernel, trond.myklebust On Thu, 4 May 2000, Andrea Arcangeli wrote: > On Thu, 4 May 2000, Rik van Riel wrote: > > >On Thu, 4 May 2000, Andrea Arcangeli wrote: > > > >> --- 2.2.15/mm/filemap.c Thu May 4 13:00:40 2000 > ^^^^^^ > You're obviously wrong: > > 1) the other cpu on 2.2.15 were spinning on the big kernel lock Ooops, sorry. I had my mind wrapped around the 2.3 code so tight that I looked at the Subject and the code only and didn't spot the kernel version in the diff header ;) cheers, Rik -- The Internet is not a network of computers. It is a network of people. That is its real strength. Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: classzone-VM + mapped pages out of lru_cache 2000-05-04 15:19 ` Andrea Arcangeli 2000-05-04 15:23 ` Andrea Arcangeli 2000-05-04 15:38 ` Rik van Riel @ 2000-05-04 16:34 ` Manfred Spraul, Andrea Arcangeli 2000-05-04 16:48 ` Trond Myklebust 3 siblings, 0 replies; 23+ messages in thread From: Manfred Spraul, Andrea Arcangeli @ 2000-05-04 16:34 UTC (permalink / raw) To: Andrea Arcangeli, Juan J. Quintela Cc: linux-mm, linux-kernel, trond.myklebust > > Because it's not necessary as far I can tell. Only one > truncate_inode_pages() can run at once and none read or write can run > under truncate_inode_pages(). This should be enforced by the VFS, and if > that doesn't happen the truncate_inode_pages changes that gone into pre6 > (and following) hides the real bug. > truncate: VFS acquires inode->i_sem semaphore.[fs/open.c, do_truncate()] write: VFS doesn't acquire the semaphore [new in 2.3], but f_op->write() could acquire the semaphore. e.g. generic_file_write() acquires the semaphore. [mm/filemap.c] read: no locking. AFAICS read & truncate could run in parallel. [I'm reading 2.3.99-pre6] -- Manfred -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: classzone-VM + mapped pages out of lru_cache 2000-05-04 15:19 ` Andrea Arcangeli ` (2 preceding siblings ...) 2000-05-04 16:34 ` Manfred Spraul, Andrea Arcangeli @ 2000-05-04 16:48 ` Trond Myklebust 2000-05-04 18:43 ` Andrea Arcangeli 3 siblings, 1 reply; 23+ messages in thread From: Trond Myklebust @ 2000-05-04 16:48 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Juan J. Quintela, linux-mm, linux-kernel >>>>> " " == Andrea Arcangeli <andrea@suse.de> writes: > This untested patch should fix the problem also in 2.2.15 (the > same way I fixed it in classzone patch): > --- 2.2.15/mm/filemap.c Thu May 4 13:00:40 2000 > +++ /tmp/filemap.c Thu May 4 17:11:18 2000 > @@ -68,7 +68,7 @@ > p = &inode->i_pages; while ((page = *p) != NULL) { > - if (PageLocked(page)) { > + if (PageLocked(page) || atomic_read(&page->count) > 1) { > p = &page->next; continue; > } > Trond, what do you think about it? Not good. If I'm running /bin/bash, and somebody on the server updates /bin/bash, then I don't want to reboot my machine. With the above patch, then all new processes will receive a mixture of pages from the old and the new file until by some accident I manage to clear the cache of the bad pages. We have to insist on the PageLocked() both in 2.2.x and 2.3.x because only pages which are in the process of being read in are safe. If we know we're scheduled to write out a full page then that would be safe too, but that is the only such case. Cheers, Trond PS: It would be nice to have truncate_inode_pages() work in the same way as it does now: waiting on pages and locking them. This is useful for reading in the directory pages, since they need to be read in sequentially (please see the proposed patch I put on l-k earlier today). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: classzone-VM + mapped pages out of lru_cache 2000-05-04 16:48 ` Trond Myklebust @ 2000-05-04 18:43 ` Andrea Arcangeli 2000-05-04 19:32 ` Trond Myklebust 0 siblings, 1 reply; 23+ messages in thread From: Andrea Arcangeli @ 2000-05-04 18:43 UTC (permalink / raw) To: Trond Myklebust; +Cc: Juan J. Quintela, linux-mm, linux-kernel On 4 May 2000, Trond Myklebust wrote: >Not good. If I'm running /bin/bash, and somebody on the server updates >/bin/bash, then I don't want to reboot my machine. With the above If you use rename(2) to update the shell (as you should since `cp` would corrupt also users that are reading /bin/bash from local fs) then nfs should get it right also with my patch since it should notice the inode number changed (the nfs fd handle should get the inode number as cookie), right? >We have to insist on the PageLocked() both in 2.2.x and 2.3.x because >only pages which are in the process of being read in are safe. If we >know we're scheduled to write out a full page then that would be safe >too, but that is the only such case. I'm not wondering about locking/coherency/read/writes. The only problem I am wondering about is that we simply can't unlink _mapped_ page-cache pages from the pagecache as we do now. Say there's page A in the page cache. It gets mapped into a pte of process X. Then before you can drop A from the page cache to invalidate it (because such page changed on the nfs server), you _first_ have to unmap such page from the pte of process X. This is why invalidate_inode_pages must not unlink mapped pages. It's not a locking problem, PageLocked() pagecache_lock and all other locks are irrelevant. It's not a race but a design issue. >PS: It would be nice to have truncate_inode_pages() work in the same >way as it does now: waiting on pages and locking them. This is useful >for reading in the directory pages, since they need to be read in >sequentially (please see the proposed patch I put on l-k earlier >today). I'll look at it. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: classzone-VM + mapped pages out of lru_cache 2000-05-04 18:43 ` Andrea Arcangeli @ 2000-05-04 19:32 ` Trond Myklebust 2000-05-04 20:15 ` Andrea Arcangeli 0 siblings, 1 reply; 23+ messages in thread From: Trond Myklebust @ 2000-05-04 19:32 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Juan J. Quintela, linux-mm, linux-kernel >>>>> " " == Andrea Arcangeli <andrea@suse.de> writes: > On 4 May 2000, Trond Myklebust wrote: >> Not good. If I'm running /bin/bash, and somebody on the server >> updates /bin/bash, then I don't want to reboot my machine. With >> the above > If you use rename(2) to update the shell (as you should since > `cp` would corrupt also users that are reading /bin/bash from > local fs) then nfs should get it right also with my patch since > it should notice the inode number changed (the nfs fd handle > should get the inode number as cookie), right? Yes, but I'm on the client: I cannot guarantee that people on the server will do it 'right'. The server can have temporarily dropped down into single user mode in order to protect its own users for all I know. Accuracy has to be the first rule whatever the case. > The only problem I am wondering about is that we simply can't > unlink _mapped_ page-cache pages from the pagecache as we do > now. > Say there's page A in the page cache. It gets mapped into a pte > of process > X. Then before you can drop A from the page cache to invalidate > it > (because such page changed on the nfs server), you _first_ have > to unmap such page from the pte of process X. This is why > invalidate_inode_pages must not unlink mapped pages. It's not a > locking problem, PageLocked() pagecache_lock and all other > locks are irrelevant. It's not a race but a design issue. As far as NFS is concerned, that page is incorrect and should be read in again whenever we next try to access it. That is the purpose of the call to invalidate_inode_pages(). As far as I can see, your patch fundamentally breaks that concept for all files whether they are mmapped or not. When you say 'unmap from the pte', what exactly do you mean? Why does such a page still have to be part of an inode's i_data? Cheers, Trond -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: classzone-VM + mapped pages out of lru_cache 2000-05-04 19:32 ` Trond Myklebust @ 2000-05-04 20:15 ` Andrea Arcangeli 2000-05-05 7:01 ` Trond Myklebust 0 siblings, 1 reply; 23+ messages in thread From: Andrea Arcangeli @ 2000-05-04 20:15 UTC (permalink / raw) To: Trond Myklebust; +Cc: Juan J. Quintela, linux-mm, linux-kernel On Thu, 4 May 2000, Trond Myklebust wrote: >Yes, but I'm on the client: I cannot guarantee that people on the >server will do it 'right'. [..] If people on the server uses `cp` to upgrade bash they will screwup themselfs and their shell will segfault from under them elventually. > [..] The server can have temporarily dropped >down into single user mode in order to protect its own users for all I >know. > >Accuracy has to be the first rule whatever the case. I fully see your point, however as far I can see (at least for 2.2.x where we probably don't want to redesign the VM rules) we have to choose between accuracy and stability and I choose stability. I much prefer to reboot cleanly the machine (or more simply unmount/remount the nfs) than to crash. That's also a local security issue, btw. >As far as NFS is concerned, that page is incorrect and should be read >in again whenever we next try to access it. That is the purpose of the >call to invalidate_inode_pages(). As far as I can see, your patch >fundamentally breaks that concept for all files whether they are >mmapped or not. It breaks the concept only for mmaped files. non mmaped files have page->count == 1 so their cache will be shrunk completly as usual. >When you say 'unmap from the pte', what exactly do you mean? Why does ^^^^^^^^^^^^^^^^^^ unmapping page from the pagetable means that later userspace won't be anymore able to read/write to the page (only kernel will have visibility on the page then and you'll read from the page in each read(2) and write(2)). A page in the cache can be mapped in several ptes and we have to unmap it from all them before we're allowed to unlink the page from the pagecache or current VM will break. >such a page still have to be part of an inode's i_data? Mapped page-cache can't be unlinked from the cache as first because when you'll have to sync the dirty shard mapping (because you run low on memory and you have to get rid of dirty data in the VM) you won't know anymore which inode and which fs the page belongs to. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: classzone-VM + mapped pages out of lru_cache 2000-05-04 20:15 ` Andrea Arcangeli @ 2000-05-05 7:01 ` Trond Myklebust 0 siblings, 0 replies; 23+ messages in thread From: Trond Myklebust @ 2000-05-05 7:01 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Juan J. Quintela, linux-mm, linux-kernel >>>>> " " == Andrea Arcangeli <andrea@suse.de> writes: >> As far as NFS is concerned, that page is incorrect and should >> be read in again whenever we next try to access it. That is the >> purpose of the call to invalidate_inode_pages(). As far as I >> can see, your patch fundamentally breaks that concept for all >> files whether they are mmapped or not. > It breaks the concept only for mmaped files. non mmaped files > have page->count == 1 so their cache will be shrunk completly > as usual. If there are pending asynchronous writes then this is neither true in 2.2.x nor in 2.3.x. Each pending writeback has to increment the page->count in order to prevent the page from disappearing beneath it. Since a writeback does not have to involve the whole page, we cannot assume that just because the page is dirty, then it won't want to get invalidated. Imagine the scenario: Process 1 (on client 1) Process 2 (on client 2) Schedule asynchronous write on bytes 0-255 of file Write bytes 256-511 of same file. revalidate inode discover that file has changed try to invalidate page 0 Under your patch, the invalidation of page 0 will fail due to the pending writeback, and hence process 1 will never see what process 2 wrote. > unmapping page from the pagetable means that later userspace > won't be anymore able to read/write to the page (only kernel > will have visibility on the page then and you'll read from the > page in each read(2) and write(2)). A page in the cache can be > mapped in several ptes and we have to unmap it from all them > before we're allowed to unlink the page from the pagecache or > current VM will break. Could this be done as part of "invalidate_inode_pages" or would that break the VM? >> such a page still have to be part of an inode's i_data? > Mapped page-cache can't be unlinked from the cache as first > because when you'll have to sync the dirty shard mapping > (because you run low on memory and you have to get rid of dirty > data in the VM) you won't know anymore which inode and which fs > the page belongs to. You have the vma->vm_file and hence both dentry and inode. Don't forget that on NFS, the inode is just a pretty collection of statistics. It contains our estimates of the data, size, creation times... It does *not* contain sufficient information to sync a page to storage, and if the VM assumes that it does, then it is clearly broken. Under NFS all read and write operations require us to use a file handle, which is stored in the dentry, not in the inode. So you will always be required to use the vm_file in some form or other. Cheers, Trond -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: classzone-VM + mapped pages out of lru_cache 2000-05-04 14:40 ` Juan J. Quintela 2000-05-04 15:19 ` Andrea Arcangeli @ 2000-05-04 16:34 ` Juan J. Quintela 2000-05-04 18:27 ` Chris Evans 1 sibling, 1 reply; 23+ messages in thread From: Juan J. Quintela @ 2000-05-04 16:34 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-mm, linux-kernel >>>>> "juan" == Juan J Quintela <quintela@fi.udc.es> writes: Hi, I screwed the times with the copy& paste, the real ones are: juan> Vanilla pre7-3 pre7-3+classzone-18 juan> real 3m29.926s real 2m10.301s juan> user 0m15.280s user 0m15.410s juan> sys 0m20.500s sys 0m8.750s Sorry for the inconveniences. Later, Juan. -- In theory, practice and theory are the same, but in practice they are different -- Larry McVoy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: classzone-VM + mapped pages out of lru_cache 2000-05-04 16:34 ` Juan J. Quintela @ 2000-05-04 18:27 ` Chris Evans 0 siblings, 0 replies; 23+ messages in thread From: Chris Evans @ 2000-05-04 18:27 UTC (permalink / raw) To: Juan J. Quintela; +Cc: Andrea Arcangeli, linux-mm, linux-kernel On 4 May 2000, Juan J. Quintela wrote: > I screwed the times with the copy& paste, the real ones are: > > > juan> Vanilla pre7-3 pre7-3+classzone-18 > juan> real 3m29.926s real 2m10.301s I'm interested - how does the 2m10 figure compare with a run using the 2.2 kernel? Cheers Chris -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <3911ECCD.BA1BB24E@arcormail.de>]
* Re: classzone-VM + mapped pages out of lru_cache [not found] <3911ECCD.BA1BB24E@arcormail.de> @ 2000-05-04 23:44 ` Andrea Arcangeli 2000-05-05 0:03 ` Jens Axboe ` (2 more replies) 0 siblings, 3 replies; 23+ messages in thread From: Andrea Arcangeli @ 2000-05-04 23:44 UTC (permalink / raw) To: Shane Shrybman, Juan J. Quintela, gandalf, Joerg Stroettchen Cc: linux-kernel, Jens Axboe, linux-mm As somebody noticed classzone-18 had a deadlock condition. It was due a silly bug and I fixed it in the new classzone-22: ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.3/2.3.99-pre7-pre4/classzone-22.gz This new one also doesn't allow read(2) to see the page that we're going to drop from truncate(2). There are no other changes. It's against 2.3.99-pre7-pre4. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: classzone-VM + mapped pages out of lru_cache 2000-05-04 23:44 ` Andrea Arcangeli @ 2000-05-05 0:03 ` Jens Axboe 2000-05-05 3:04 ` David S. Miller 2000-05-06 13:37 ` Andrea Arcangeli 2 siblings, 0 replies; 23+ messages in thread From: Jens Axboe @ 2000-05-05 0:03 UTC (permalink / raw) To: Andrea Arcangeli Cc: Shane Shrybman, Juan J. Quintela, gandalf, Joerg Stroettchen, linux-kernel, linux-mm On Fri, May 05 2000, Andrea Arcangeli wrote: > As somebody noticed classzone-18 had a deadlock condition. It was due a > silly bug and I fixed it in the new classzone-22: Yes, this fixes the truncate bug here. No problems noticed so far. -- * Jens Axboe <axboe@suse.de> * Linux CD/DVD-ROM, SuSE Labs * http://kernel.dk -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: classzone-VM + mapped pages out of lru_cache 2000-05-04 23:44 ` Andrea Arcangeli 2000-05-05 0:03 ` Jens Axboe @ 2000-05-05 3:04 ` David S. Miller 2000-05-05 8:43 ` Russell King 2000-05-05 14:56 ` Andrea Arcangeli 2000-05-06 13:37 ` Andrea Arcangeli 2 siblings, 2 replies; 23+ messages in thread From: David S. Miller @ 2000-05-05 3:04 UTC (permalink / raw) To: andrea Cc: shrybman, quintela, gandalf, joerg.stroettchen, linux-kernel, axboe, linux-mm Andrea, please do not pass IRQ state "flags" to another function and try to restore them in this way, it breaks Sparc and any other cpu which keeps "stack frame" state in the flags value. "flags" must be obtained and restored in the same function. You do this in your rmqueue() changes. Thanks. Later, David S. Miller davem@redhat.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: classzone-VM + mapped pages out of lru_cache 2000-05-05 3:04 ` David S. Miller @ 2000-05-05 8:43 ` Russell King 2000-05-05 14:56 ` Andrea Arcangeli 1 sibling, 0 replies; 23+ messages in thread From: Russell King @ 2000-05-05 8:43 UTC (permalink / raw) To: David S. Miller Cc: andrea, shrybman, quintela, gandalf, joerg.stroettchen, linux-kernel, axboe, linux-mm David S. Miller writes: > Andrea, please do not pass IRQ state "flags" to another function > and try to restore them in this way, it breaks Sparc and any other > cpu which keeps "stack frame" state in the flags value. "flags" must > be obtained and restored in the same function. On some of the older (obsolete) ARMs, this is also not possible - the IRQ state is restored each time a function exits in kernel mode. (I'm not too concerned with these today). I've seen this done somewhere else in the kernel as well. How about changing flags to an architecture-defined struct, and doing something like: extern inline void save_flags(struct flags *flg) { #ifdef CATCH_BAD_FLAGS flg->ret = __builtin_return_address(0); #endif __save_flags(flg->flag); } extern inline void restore_flags(struct flags *flg) { #ifdef CATCH_BAD_FLAGS if (flg->ret != __builtin_return_address(0)) BUG(); #endif __restore_flags(flg->flag); } Of course, CATCH_BAD_FLAGS would be turned off for the stable series to reduce the impact of the check, but at least we could catch bad usage on development kernels easily. Of course, the above is dependent on __builtin_return_address() returning the return address of the function that these were inlined into. I'm just wondering - what about spinlocks? There are a couple of instances where a spinlock is taken in a parent function and temporarily released in one of its child functions. I'm not happy with this usage, but if this is a legal usage of the spinlocks, then the above may bite when it shouldn't. _____ |_____| ------------------------------------------------- ---+---+- | | Russell King rmk@arm.linux.org.uk --- --- | | | | http://www.arm.linux.org.uk/~rmk/aboutme.html / / | | +-+-+ --- -+- / | THE developer of ARM Linux |+| /|\ / | | | --- | +-+-+ ------------------------------------------------- /\\\ | -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: classzone-VM + mapped pages out of lru_cache 2000-05-05 3:04 ` David S. Miller 2000-05-05 8:43 ` Russell King @ 2000-05-05 14:56 ` Andrea Arcangeli 1 sibling, 0 replies; 23+ messages in thread From: Andrea Arcangeli @ 2000-05-05 14:56 UTC (permalink / raw) To: David S. Miller Cc: shrybman, quintela, gandalf, joerg.stroettchen, linux-kernel, axboe, linux-mm On Thu, 4 May 2000, David S. Miller wrote: >Andrea, please do not pass IRQ state "flags" to another function >and try to restore them in this way, it breaks Sparc and any other >cpu which keeps "stack frame" state in the flags value. "flags" must >be obtained and restored in the same function. Ok, thanks. Probably it worth to add a comment about this also in reschedule_idle. Fix is the same as in reschedule_idle, make rmqueue inline. Thanks. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: classzone-VM + mapped pages out of lru_cache 2000-05-04 23:44 ` Andrea Arcangeli 2000-05-05 0:03 ` Jens Axboe 2000-05-05 3:04 ` David S. Miller @ 2000-05-06 13:37 ` Andrea Arcangeli 2 siblings, 0 replies; 23+ messages in thread From: Andrea Arcangeli @ 2000-05-06 13:37 UTC (permalink / raw) To: Shane Shrybman, Juan J. Quintela, gandalf, Joerg Stroettchen Cc: linux-kernel, Jens Axboe, linux-mm New classzone patch for 2.3.99-pre7-pre6 is here: ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.3/2.3.99-pre7-pre6/classzone-24.gz Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2000-05-06 13:37 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-05-03 16:26 classzone-VM + mapped pages out of lru_cache Andrea Arcangeli
2000-05-04 0:42 ` David S. Miller
2000-05-04 10:00 ` Andrea Arcangeli
2000-05-04 14:40 ` Juan J. Quintela
2000-05-04 15:19 ` Andrea Arcangeli
2000-05-04 15:23 ` Andrea Arcangeli
2000-05-04 15:38 ` Rik van Riel
2000-05-04 17:59 ` Andrea Arcangeli
2000-05-04 19:24 ` Rik van Riel
2000-05-04 16:34 ` Manfred Spraul, Andrea Arcangeli
2000-05-04 16:48 ` Trond Myklebust
2000-05-04 18:43 ` Andrea Arcangeli
2000-05-04 19:32 ` Trond Myklebust
2000-05-04 20:15 ` Andrea Arcangeli
2000-05-05 7:01 ` Trond Myklebust
2000-05-04 16:34 ` Juan J. Quintela
2000-05-04 18:27 ` Chris Evans
[not found] <3911ECCD.BA1BB24E@arcormail.de>
2000-05-04 23:44 ` Andrea Arcangeli
2000-05-05 0:03 ` Jens Axboe
2000-05-05 3:04 ` David S. Miller
2000-05-05 8:43 ` Russell King
2000-05-05 14:56 ` Andrea Arcangeli
2000-05-06 13:37 ` Andrea Arcangeli
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox