linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* classzone-VM + mapped pages out of lru_cache
@ 2000-05-03 16:26 Andrea Arcangeli
  2000-05-04  0:42 ` David S. Miller
  2000-05-04 14:40 ` Juan J. Quintela
  0 siblings, 2 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2000-05-03 16:26 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, quintela

This patch will convert the internal pg_data_t design from zone to
classzone to give correctness to the memory balancing and memory
allocation decisions. It also moves the lru_cache inside the pg_data_t for
NUMA.

It also splits the LRU in two parts, one for swap cache and one for the
more interesting cache and last but not the least it keeps mapped pages
without overlapped buffer headers out of the lru so that we don't waste
time in shrink_mmap trying to release mapped pages when we have to shrink
the cache.

	ftp://ftp.*.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.3/2.3.99-pre7-pre3/classzone-18.gz

It gives me smoother swap behaviour since the swap cache hardly pollutes
the lru_cache now.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: classzone-VM + mapped pages out of lru_cache
  2000-05-03 16:26 classzone-VM + mapped pages out of lru_cache Andrea Arcangeli
@ 2000-05-04  0:42 ` David S. Miller
  2000-05-04 10:00   ` Andrea Arcangeli
  2000-05-04 14:40 ` Juan J. Quintela
  1 sibling, 1 reply; 23+ messages in thread
From: David S. Miller @ 2000-05-04  0:42 UTC (permalink / raw)
  To: andrea; +Cc: linux-mm, linux-kernel, quintela

	   ftp://ftp.*.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.3/2.3.99-pre7-pre3/classzone-18.gz

Btw, the path seem to be incorrect.  It should be:

/pub/linux/kernel/people/andrea/patches/v2.3/2.3.99-pre7-pre3/classzone-18.gz

:-)

One note after initial study.  I wish we could get rid of the
"map_count" thing you added to the page struct.  Currently, when
we turn off wait queue debugging, the page struct is an exact power
of 2 on both 64-bit and 32-bit architectures.  With the map_count
there now, it will not be an exact power of two in size on 32-bit
machines :-(

Later,
David S. Miller
davem@redhat.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: classzone-VM + mapped pages out of lru_cache
  2000-05-04  0:42 ` David S. Miller
@ 2000-05-04 10:00   ` Andrea Arcangeli
  0 siblings, 0 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2000-05-04 10:00 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-mm, linux-kernel, quintela

On Wed, 3 May 2000, David S. Miller wrote:

>   Date: 	Wed, 3 May 2000 18:26:19 +0200 (CEST)
>   From: Andrea Arcangeli <andrea@suse.de>
>
>	   ftp://ftp.*.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.3/2.3.99-pre7-pre3/classzone-18.gz
>
>Btw, the path seem to be incorrect.  It should be:
>
>/pub/linux/kernel/people/andrea/patches/v2.3/2.3.99-pre7-pre3/classzone-18.gz
>
>:-)

Correct, I'm sorry for the mistake. All single patches are in the
andrea/patches directory indeed.

>One note after initial study.  I wish we could get rid of the
>"map_count" thing you added to the page struct.  Currently, when

I understand your point. My only problem is that if we'll drop the
map_count then I must be allowed to say "map_count = page_count(page) -
1". That's definitely not always true (just look the middle of
shrink_mmap).

Then all the places that does __find_lock_page are non trivial and they
should all be audited and they would break if somebody won't be carefuul
about the dependency on the map count and page count. Also the msync path
increases the page count for a very minor reason: only to have a common
exit path for the invalidate case. That trick is currently sane since the
page count can be increased at any time for any reason as far as memory
doesn't leak but such trick breaks the invariant I would need to enforce
in order to drop map_count.

So I need further information anyway (the PageOutLru bitflag wouldn't be
enough either) and so I thought I can as well grab an integer for its
dedicated purpose.

The above are the only reasons that made me to add the map_count. I
preferred to not break the current relaxed page->count semantic and to
allow everybody to grab as many times they want the page_count as now, and
the page count will remain relevant only to the freelist, and it have no
meaning respect to the number of ptes where the page is mapped to.

Basically using page_count(page) for the map count seems a pain and
something we don't want to do as far I can see.

If you have clever idea on how to get rid of the map_count they will be
very appreciated indeed. Thanks.

>we turn off wait queue debugging, the page struct is an exact power
>of 2 on both 64-bit and 32-bit architectures.  With the map_count
>there now, it will not be an exact power of two in size on 32-bit
>machines :-(

I seen the problem with 32bit archs since the first place (and the problem
is not only alignment but memory waste in general) but my argument here is
that we'd better choose to drop page->virtual if HIGHMEM is disabled (that
field is very less useful than map_count when HIGHMEM is disabled and so
it should go away from the binary image first IMHO ;-). (and page->virtual
should be dropped unconditionally on 64bit archs... even when HIGHMEM is
enabled)

On 64bit archs the map_count doesn't waste memory anyway because it gets
packeted in the same word with the 32bit page->count (both count and
map_count are 32bits wide) while the page->virtual instead waste lots of
memory even now on 64bit archs.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: classzone-VM + mapped pages out of lru_cache
  2000-05-03 16:26 classzone-VM + mapped pages out of lru_cache Andrea Arcangeli
  2000-05-04  0:42 ` David S. Miller
@ 2000-05-04 14:40 ` Juan J. Quintela
  2000-05-04 15:19   ` Andrea Arcangeli
  2000-05-04 16:34   ` Juan J. Quintela
  1 sibling, 2 replies; 23+ messages in thread
From: Juan J. Quintela @ 2000-05-04 14:40 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, linux-kernel

>>>>> "andrea" == Andrea Arcangeli <andrea@suse.de> writes:

Hi

andrea> It gives me smoother swap behaviour since the swap cache hardly pollutes
andrea> the lru_cache now.

Andrea I have run here last night your patch (classzone-18) against
pre7-3 in one machine and vanilla pre7-3 in other machine.  The test
was from my memtest suite: 
    while (true); do time ./mmap002; done
suited to the memory of my system (Uniprocessor 96MB)

The results are very good for your patch:

Vanilla pre7-3            pre7-3+classzone-18
real    3m29.926s         real    2m10.210s
user    0m15.280s         real    2m10.210s
sys     0m20.500s         real    2m10.210s

That are the normal times. classzone patches variations are very low
(all the iterations are between 2m08 and 2m10).  But in vanilla
pre7-3, the variations are higher: between 3m4 and 4m20, and the worst
part, when the kswapd problem appear, the program takes until  36m20
(yes 36, ten times more, is not a typo).  Furthermore, vanilla pre7-3
kill the process after 2 hours and a half,  classzone works for more
than 12 hours without a problem.

That is the description of the situation.  Andrea, why do you reverse
the patch of filemap.c:truncate_inode_pages() of using TryLockPage()?
That change was proposed by Rik due to an Oops that I get here.  It
was one of the non-easily reproducible ones that I get.

Later, Juan.

-- 
In theory, practice and theory are the same, but in practice they 
are different -- Larry McVoy
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: classzone-VM + mapped pages out of lru_cache
  2000-05-04 14:40 ` Juan J. Quintela
@ 2000-05-04 15:19   ` Andrea Arcangeli
  2000-05-04 15:23     ` Andrea Arcangeli
                       ` (3 more replies)
  2000-05-04 16:34   ` Juan J. Quintela
  1 sibling, 4 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2000-05-04 15:19 UTC (permalink / raw)
  To: Juan J. Quintela; +Cc: linux-mm, linux-kernel, trond.myklebust

On 4 May 2000, Juan J. Quintela wrote:

>suited to the memory of my system (Uniprocessor 96MB)

Do you have also an SMP to try it out too? On IA32-SMP and alpha-SMP I
definitely can't reproduce the pico lockup reported to me a few hours ago.

>The results are very good for your patch:
>
>Vanilla pre7-3            pre7-3+classzone-18
>real    3m29.926s         real    2m10.210s
>user    0m15.280s         real    2m10.210s
>sys     0m20.500s         real    2m10.210s

;)

>That is the description of the situation.  Andrea, why do you reverse
>the patch of filemap.c:truncate_inode_pages() of using TryLockPage()?

Because it's not necessary as far I can tell. Only one
truncate_inode_pages() can run at once and none read or write can run
under truncate_inode_pages(). This should be enforced by the VFS, and if
that doesn't happen the truncate_inode_pages changes that gone into pre6
(and following) hides the real bug.

>That change was proposed by Rik due to an Oops that I get here.  It

Can I see the Oops?

>was one of the non-easily reproducible ones that I get.

Maybe you can try to write/read under a truncate or something like
that. I've not yet checked if there are still races in the VFS between
read/write/truncate.

BTW, maybe if you was using NFS you got in troubles with
invalidate_inode_pages. New invalidate_inode_pages in classzone patch
won't race with the old truncate_inode_page. Also without classzone-VM-18
invalidate_inode_pages can crash the kerne because _nobody_ can unlink a
mapped page-cache from the cache without first clearing the pte (and
flushing the page to disk if the pte happened to be dirty). So if
invalidate_inode_pages() runs under an inode map-shared in memory you'll
get a lockup as soon as you try to sync the page to disk. I didn't tried
to reproduce but I only read the code. However if I am right about this
reproducing should be easy. You have to MAP_SHARED on the client a file,
touch the shared mapping so that some pte become dirty, now overwrite the
file on the server and then push the client low on memory so that swap_out
will try to unmap the page -> crash. This is a security issue also for
2.2.x, fix is in classzone-VM-18 where I don't unmap the page if the page
count is > 1 (so if the page have no chance to be mapped). Effect of the
fix is that you can't MAP_SHARED and change the file from more than one
client or you have to expect inchoerency between the cache copies.

This untested patch should fix the problem also in 2.2.15 (the same way I
fixed it in classzone patch):

--- 2.2.15/mm/filemap.c	Thu May  4 13:00:40 2000
+++ /tmp/filemap.c	Thu May  4 17:11:18 2000
@@ -68,7 +68,7 @@
 
 	p = &inode->i_pages;
 	while ((page = *p) != NULL) {
-		if (PageLocked(page)) {
+		if (PageLocked(page) || atomic_read(&page->count) > 1) {
 			p = &page->next;
 			continue;
 		}


Trond, what do you think about it?

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: classzone-VM + mapped pages out of lru_cache
  2000-05-04 15:19   ` Andrea Arcangeli
@ 2000-05-04 15:23     ` Andrea Arcangeli
  2000-05-04 15:38     ` Rik van Riel
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2000-05-04 15:23 UTC (permalink / raw)
  To: Juan J. Quintela; +Cc: linux-mm, linux-kernel, trond.myklebust

On Thu, 4 May 2000, Andrea Arcangeli wrote:

>--- 2.2.15/mm/filemap.c	Thu May  4 13:00:40 2000
>+++ /tmp/filemap.c	Thu May  4 17:11:18 2000
>@@ -68,7 +68,7 @@
> 
> 	p = &inode->i_pages;
> 	while ((page = *p) != NULL) {
>-		if (PageLocked(page)) {
>+		if (PageLocked(page) || atomic_read(&page->count) > 1) {
> 			p = &page->next;
> 			continue;
> 		}

above patch is also here:

	ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.15/invalidate_inode_pages-1

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: classzone-VM + mapped pages out of lru_cache
  2000-05-04 15:19   ` Andrea Arcangeli
  2000-05-04 15:23     ` Andrea Arcangeli
@ 2000-05-04 15:38     ` Rik van Riel
  2000-05-04 17:59       ` Andrea Arcangeli
  2000-05-04 16:34     ` Manfred Spraul, Andrea Arcangeli
  2000-05-04 16:48     ` Trond Myklebust
  3 siblings, 1 reply; 23+ messages in thread
From: Rik van Riel @ 2000-05-04 15:38 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Juan J. Quintela, linux-mm, linux-kernel, trond.myklebust

On Thu, 4 May 2000, Andrea Arcangeli wrote:

> --- 2.2.15/mm/filemap.c	Thu May  4 13:00:40 2000
> +++ /tmp/filemap.c	Thu May  4 17:11:18 2000
> @@ -68,7 +68,7 @@
>  
>  	p = &inode->i_pages;
>  	while ((page = *p) != NULL) {
> -		if (PageLocked(page)) {
> +		if (PageLocked(page) || atomic_read(&page->count) > 1) {
>  			p = &page->next;
>  			continue;
>  		}

Fun, fun, fun ...

So the other CPU takes a lock on the page while we're testing
for the page->count and increments the pagecount after the lock,
while we try to do something (call __free_page(page)?) with the
page ...

As long as the other cpu increments the page count quick enough
it should be ok, but when it doesn't we can *still* end up freeing
a locked page.  I've seen backtraces where __free_pages_ok()
Oopsed on PageLocked(page) and the function was called from
truncate_inode_pages().

The fix which is in the latest kernel from Linus fixed the bug for
those people. Stubbornly reversing the fix because you haven't
managed to reproduce it yet is most probably not the right thing
to do.

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: classzone-VM + mapped pages out of lru_cache
  2000-05-04 15:19   ` Andrea Arcangeli
  2000-05-04 15:23     ` Andrea Arcangeli
  2000-05-04 15:38     ` Rik van Riel
@ 2000-05-04 16:34     ` Manfred Spraul, Andrea Arcangeli
  2000-05-04 16:48     ` Trond Myklebust
  3 siblings, 0 replies; 23+ messages in thread
From: Manfred Spraul, Andrea Arcangeli @ 2000-05-04 16:34 UTC (permalink / raw)
  To: Andrea Arcangeli, Juan J. Quintela
  Cc: linux-mm, linux-kernel, trond.myklebust

>
> Because it's not necessary as far I can tell. Only one
> truncate_inode_pages() can run at once and none read or write can run
> under truncate_inode_pages(). This should be enforced by the VFS, and if
> that doesn't happen the truncate_inode_pages changes that gone into pre6
> (and following) hides the real bug.
>

truncate: VFS acquires inode->i_sem semaphore.[fs/open.c, do_truncate()]
write: VFS doesn't acquire the semaphore [new in 2.3], but f_op->write()
could acquire the semaphore.
e.g. generic_file_write() acquires the semaphore. [mm/filemap.c]
read: no locking. AFAICS read & truncate could run in parallel.

[I'm reading 2.3.99-pre6]
--
    Manfred

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: classzone-VM + mapped pages out of lru_cache
  2000-05-04 14:40 ` Juan J. Quintela
  2000-05-04 15:19   ` Andrea Arcangeli
@ 2000-05-04 16:34   ` Juan J. Quintela
  2000-05-04 18:27     ` Chris Evans
  1 sibling, 1 reply; 23+ messages in thread
From: Juan J. Quintela @ 2000-05-04 16:34 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, linux-kernel

>>>>> "juan" == Juan J Quintela <quintela@fi.udc.es> writes:

Hi, 
    I screwed the times with the copy& paste, the real ones are:


juan> Vanilla pre7-3            pre7-3+classzone-18
juan> real    3m29.926s         real    2m10.301s
juan> user    0m15.280s         user    0m15.410s
juan> sys     0m20.500s         sys     0m8.750s

Sorry for the inconveniences.

Later, Juan.


-- 
In theory, practice and theory are the same, but in practice they 
are different -- Larry McVoy
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: classzone-VM + mapped pages out of lru_cache
  2000-05-04 15:19   ` Andrea Arcangeli
                       ` (2 preceding siblings ...)
  2000-05-04 16:34     ` Manfred Spraul, Andrea Arcangeli
@ 2000-05-04 16:48     ` Trond Myklebust
  2000-05-04 18:43       ` Andrea Arcangeli
  3 siblings, 1 reply; 23+ messages in thread
From: Trond Myklebust @ 2000-05-04 16:48 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Juan J. Quintela, linux-mm, linux-kernel

>>>>> " " == Andrea Arcangeli <andrea@suse.de> writes:

     > This untested patch should fix the problem also in 2.2.15 (the
     > same way I fixed it in classzone patch):

     > --- 2.2.15/mm/filemap.c Thu May 4 13:00:40 2000
     > +++ /tmp/filemap.c Thu May 4 17:11:18 2000
     > @@ -68,7 +68,7 @@
 
     >  	p = &inode->i_pages; while ((page = *p) != NULL) {
     > - if (PageLocked(page)) {
     > + if (PageLocked(page) || atomic_read(&page->count) > 1) {
     >  			p = &page->next; continue;
     >  		}


     > Trond, what do you think about it?

Not good. If I'm running /bin/bash, and somebody on the server updates
/bin/bash, then I don't want to reboot my machine. With the above
patch, then all new processes will receive a mixture of pages from the
old and the new file until by some accident I manage to clear the
cache of the bad pages.

We have to insist on the PageLocked() both in 2.2.x and 2.3.x because
only pages which are in the process of being read in are safe. If we
know we're scheduled to write out a full page then that would be safe
too, but that is the only such case.

Cheers,
  Trond

PS: It would be nice to have truncate_inode_pages() work in the same
way as it does now: waiting on pages and locking them. This is useful
for reading in the directory pages, since they need to be read in
sequentially (please see the proposed patch I put on l-k earlier
today).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: classzone-VM + mapped pages out of lru_cache
  2000-05-04 15:38     ` Rik van Riel
@ 2000-05-04 17:59       ` Andrea Arcangeli
  2000-05-04 19:24         ` Rik van Riel
  0 siblings, 1 reply; 23+ messages in thread
From: Andrea Arcangeli @ 2000-05-04 17:59 UTC (permalink / raw)
  To: riel; +Cc: Juan J. Quintela, linux-mm, linux-kernel, trond.myklebust

On Thu, 4 May 2000, Rik van Riel wrote:

>On Thu, 4 May 2000, Andrea Arcangeli wrote:
>
>> --- 2.2.15/mm/filemap.c	Thu May  4 13:00:40 2000
       ^^^^^^
>> +++ /tmp/filemap.c	Thu May  4 17:11:18 2000
>> @@ -68,7 +68,7 @@
>>  
>>  	p = &inode->i_pages;
>>  	while ((page = *p) != NULL) {
>> -		if (PageLocked(page)) {
>> +		if (PageLocked(page) || atomic_read(&page->count) > 1) {
>>  			p = &page->next;
>>  			continue;
>>  		}
		XXXXXXXXXXXXXXXX
>
>Fun, fun, fun ...
>
>So the other CPU takes a lock on the page while we're testing
>for the page->count and increments the pagecount after the lock,
>while we try to do something (call __free_page(page)?) with the
>page ...

You're obviously wrong:

1) the other cpu on 2.2.15 were spinning on the big kernel lock
   and had no way to try to lock down the page we're processing.
2) if what you described above would be true, then virgin 2.2.15 and
   all the 2.2.x official/unofficial kernels out there would
   have a major SMP race anyway (not thanks to my above fix) because it
   would mean that in point XXXXXXXXXXXXXXXXX the page could become locked
   from under us.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: classzone-VM + mapped pages out of lru_cache
  2000-05-04 16:34   ` Juan J. Quintela
@ 2000-05-04 18:27     ` Chris Evans
  0 siblings, 0 replies; 23+ messages in thread
From: Chris Evans @ 2000-05-04 18:27 UTC (permalink / raw)
  To: Juan J. Quintela; +Cc: Andrea Arcangeli, linux-mm, linux-kernel

On 4 May 2000, Juan J. Quintela wrote:

>     I screwed the times with the copy& paste, the real ones are:
> 
> 
> juan> Vanilla pre7-3            pre7-3+classzone-18
> juan> real    3m29.926s         real    2m10.301s

I'm interested - how does the 2m10 figure compare with a run using the 2.2
kernel?

Cheers
Chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: classzone-VM + mapped pages out of lru_cache
  2000-05-04 16:48     ` Trond Myklebust
@ 2000-05-04 18:43       ` Andrea Arcangeli
  2000-05-04 19:32         ` Trond Myklebust
  0 siblings, 1 reply; 23+ messages in thread
From: Andrea Arcangeli @ 2000-05-04 18:43 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Juan J. Quintela, linux-mm, linux-kernel

On 4 May 2000, Trond Myklebust wrote:

>Not good. If I'm running /bin/bash, and somebody on the server updates
>/bin/bash, then I don't want to reboot my machine. With the above

If you use rename(2) to update the shell (as you should since `cp` would
corrupt also users that are reading /bin/bash from local fs) then nfs
should get it right also with my patch since it should notice the inode
number changed (the nfs fd handle should get the inode number as cookie),
right?

>We have to insist on the PageLocked() both in 2.2.x and 2.3.x because
>only pages which are in the process of being read in are safe. If we
>know we're scheduled to write out a full page then that would be safe
>too, but that is the only such case.

I'm not wondering about locking/coherency/read/writes.

The only problem I am wondering about is that we simply can't unlink
_mapped_ page-cache pages from the pagecache as we do now.

Say there's page A in the page cache. It gets mapped into a pte of process
X. Then before you can drop A from the page cache to invalidate it
(because such page changed on the nfs server), you _first_ have to unmap
such page from the pte of process X. This is why invalidate_inode_pages
must not unlink mapped pages. It's not a locking problem, PageLocked()
pagecache_lock and all other locks are irrelevant. It's not a race but a
design issue.

>PS: It would be nice to have truncate_inode_pages() work in the same
>way as it does now: waiting on pages and locking them. This is useful
>for reading in the directory pages, since they need to be read in
>sequentially (please see the proposed patch I put on l-k earlier
>today).

I'll look at it.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: classzone-VM + mapped pages out of lru_cache
  2000-05-04 17:59       ` Andrea Arcangeli
@ 2000-05-04 19:24         ` Rik van Riel
  0 siblings, 0 replies; 23+ messages in thread
From: Rik van Riel @ 2000-05-04 19:24 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Juan J. Quintela, linux-mm, linux-kernel, trond.myklebust

On Thu, 4 May 2000, Andrea Arcangeli wrote:
> On Thu, 4 May 2000, Rik van Riel wrote:
> 
> >On Thu, 4 May 2000, Andrea Arcangeli wrote:
> >
> >> --- 2.2.15/mm/filemap.c	Thu May  4 13:00:40 2000
>        ^^^^^^
> You're obviously wrong:
> 
> 1) the other cpu on 2.2.15 were spinning on the big kernel lock

Ooops, sorry.

I had my mind wrapped around the 2.3 code so tight that
I looked at the Subject and the code only and didn't
spot the kernel version in the diff header ;)

cheers,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: classzone-VM + mapped pages out of lru_cache
  2000-05-04 18:43       ` Andrea Arcangeli
@ 2000-05-04 19:32         ` Trond Myklebust
  2000-05-04 20:15           ` Andrea Arcangeli
  0 siblings, 1 reply; 23+ messages in thread
From: Trond Myklebust @ 2000-05-04 19:32 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Juan J. Quintela, linux-mm, linux-kernel

>>>>> " " == Andrea Arcangeli <andrea@suse.de> writes:

     > On 4 May 2000, Trond Myklebust wrote:
    >> Not good. If I'm running /bin/bash, and somebody on the server
    >> updates /bin/bash, then I don't want to reboot my machine. With
    >> the above

     > If you use rename(2) to update the shell (as you should since
     > `cp` would corrupt also users that are reading /bin/bash from
     > local fs) then nfs should get it right also with my patch since
     > it should notice the inode number changed (the nfs fd handle
     > should get the inode number as cookie), right?

Yes, but I'm on the client: I cannot guarantee that people on the
server will do it 'right'. The server can have temporarily dropped
down into single user mode in order to protect its own users for all I
know.

Accuracy has to be the first rule whatever the case.

     > The only problem I am wondering about is that we simply can't
     > unlink _mapped_ page-cache pages from the pagecache as we do
     > now.

     > Say there's page A in the page cache. It gets mapped into a pte
     > of process
     > X. Then before you can drop A from the page cache to invalidate
     >    it
     > (because such page changed on the nfs server), you _first_ have
     > to unmap such page from the pte of process X. This is why
     > invalidate_inode_pages must not unlink mapped pages. It's not a
     > locking problem, PageLocked() pagecache_lock and all other
     > locks are irrelevant. It's not a race but a design issue.

As far as NFS is concerned, that page is incorrect and should be read
in again whenever we next try to access it. That is the purpose of the
call to invalidate_inode_pages().  As far as I can see, your patch
fundamentally breaks that concept for all files whether they are
mmapped or not.

When you say 'unmap from the pte', what exactly do you mean? Why does
such a page still have to be part of an inode's i_data?

Cheers,
  Trond
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: classzone-VM + mapped pages out of lru_cache
  2000-05-04 19:32         ` Trond Myklebust
@ 2000-05-04 20:15           ` Andrea Arcangeli
  2000-05-05  7:01             ` Trond Myklebust
  0 siblings, 1 reply; 23+ messages in thread
From: Andrea Arcangeli @ 2000-05-04 20:15 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Juan J. Quintela, linux-mm, linux-kernel

On Thu, 4 May 2000, Trond Myklebust wrote:

>Yes, but I'm on the client: I cannot guarantee that people on the
>server will do it 'right'. [..]

If people on the server uses `cp` to upgrade bash they will screwup
themselfs and their shell will segfault from under them elventually.

> [..] The server can have temporarily dropped
>down into single user mode in order to protect its own users for all I
>know.
>
>Accuracy has to be the first rule whatever the case.

I fully see your point, however as far I can see (at least for 2.2.x where
we probably don't want to redesign the VM rules) we have to choose between
accuracy and stability and I choose stability. I much prefer to reboot
cleanly the machine (or more simply unmount/remount the nfs) than to
crash. That's also a local security issue, btw.

>As far as NFS is concerned, that page is incorrect and should be read
>in again whenever we next try to access it. That is the purpose of the
>call to invalidate_inode_pages().  As far as I can see, your patch
>fundamentally breaks that concept for all files whether they are
>mmapped or not.

It breaks the concept only for mmaped files. non mmaped files have
page->count == 1 so their cache will be shrunk completly as usual.

>When you say 'unmap from the pte', what exactly do you mean? Why does
               ^^^^^^^^^^^^^^^^^^

unmapping page from the pagetable means that later userspace won't be
anymore able to read/write to the page (only kernel will have visibility
on the page then and you'll read from the page in each read(2) and
write(2)). A page in the cache can be mapped in several ptes and we have
to unmap it from all them before we're allowed to unlink the page from the
pagecache or current VM will break.

>such a page still have to be part of an inode's i_data?

Mapped page-cache can't be unlinked from the cache as first because when
you'll have to sync the dirty shard mapping (because you run low on memory
and you have to get rid of dirty data in the VM) you won't know anymore
which inode and which fs the page belongs to.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: classzone-VM + mapped pages out of lru_cache
  2000-05-04 20:15           ` Andrea Arcangeli
@ 2000-05-05  7:01             ` Trond Myklebust
  0 siblings, 0 replies; 23+ messages in thread
From: Trond Myklebust @ 2000-05-05  7:01 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Juan J. Quintela, linux-mm, linux-kernel

>>>>> " " == Andrea Arcangeli <andrea@suse.de> writes:

    >> As far as NFS is concerned, that page is incorrect and should
    >> be read in again whenever we next try to access it. That is the
    >> purpose of the call to invalidate_inode_pages().  As far as I
    >> can see, your patch fundamentally breaks that concept for all
    >> files whether they are mmapped or not.

     > It breaks the concept only for mmaped files. non mmaped files
     > have page->count == 1 so their cache will be shrunk completly
     > as usual.

If there are pending asynchronous writes then this is neither true in
2.2.x nor in 2.3.x. Each pending writeback has to increment the
page->count in order to prevent the page from disappearing beneath
it. Since a writeback does not have to involve the whole page, we
cannot assume that just because the page is dirty, then it won't want
to get invalidated. Imagine the scenario:

   Process 1 (on client 1)             Process 2 (on client 2)

   Schedule asynchronous write
   on bytes 0-255 of file

                                      Write bytes 256-511 of same file.

   revalidate inode
   discover that file has changed
   try to invalidate page 0

Under your patch, the invalidation of page 0 will fail due to the
pending writeback, and hence process 1 will never see what process 2
wrote.

     > unmapping page from the pagetable means that later userspace
     > won't be anymore able to read/write to the page (only kernel
     > will have visibility on the page then and you'll read from the
     > page in each read(2) and write(2)). A page in the cache can be
     > mapped in several ptes and we have to unmap it from all them
     > before we're allowed to unlink the page from the pagecache or
     > current VM will break.

Could this be done as part of "invalidate_inode_pages" or would that
break the VM?

    >> such a page still have to be part of an inode's i_data?

     > Mapped page-cache can't be unlinked from the cache as first
     > because when you'll have to sync the dirty shard mapping
     > (because you run low on memory and you have to get rid of dirty
     > data in the VM) you won't know anymore which inode and which fs
     > the page belongs to.

You have the vma->vm_file and hence both dentry and inode.

Don't forget that on NFS, the inode is just a pretty collection of
statistics. It contains our estimates of the data, size, creation
times...  It does *not* contain sufficient information to sync a page
to storage, and if the VM assumes that it does, then it is clearly
broken.
Under NFS all read and write operations require us to use a file
handle, which is stored in the dentry, not in the inode. So you will
always be required to use the vm_file in some form or other.

Cheers,
  Trond
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: classzone-VM + mapped pages out of lru_cache
  2000-05-04 23:44 ` Andrea Arcangeli
  2000-05-05  0:03   ` Jens Axboe
  2000-05-05  3:04   ` David S. Miller
@ 2000-05-06 13:37   ` Andrea Arcangeli
  2 siblings, 0 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2000-05-06 13:37 UTC (permalink / raw)
  To: Shane Shrybman, Juan J. Quintela, gandalf, Joerg Stroettchen
  Cc: linux-kernel, Jens Axboe, linux-mm

New classzone patch for 2.3.99-pre7-pre6 is here:

	ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.3/2.3.99-pre7-pre6/classzone-24.gz

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: classzone-VM + mapped pages out of lru_cache
  2000-05-05  3:04   ` David S. Miller
  2000-05-05  8:43     ` Russell King
@ 2000-05-05 14:56     ` Andrea Arcangeli
  1 sibling, 0 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2000-05-05 14:56 UTC (permalink / raw)
  To: David S. Miller
  Cc: shrybman, quintela, gandalf, joerg.stroettchen, linux-kernel,
	axboe, linux-mm

On Thu, 4 May 2000, David S. Miller wrote:

>Andrea, please do not pass IRQ state "flags" to another function
>and try to restore them in this way, it breaks Sparc and any other
>cpu which keeps "stack frame" state in the flags value.  "flags" must
>be obtained and restored in the same function.

Ok, thanks. Probably it worth to add a comment about this also in
reschedule_idle.

Fix is the same as in reschedule_idle, make rmqueue inline.

Thanks.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: classzone-VM + mapped pages out of lru_cache
  2000-05-05  3:04   ` David S. Miller
@ 2000-05-05  8:43     ` Russell King
  2000-05-05 14:56     ` Andrea Arcangeli
  1 sibling, 0 replies; 23+ messages in thread
From: Russell King @ 2000-05-05  8:43 UTC (permalink / raw)
  To: David S. Miller
  Cc: andrea, shrybman, quintela, gandalf, joerg.stroettchen,
	linux-kernel, axboe, linux-mm

David S. Miller writes:
> Andrea, please do not pass IRQ state "flags" to another function
> and try to restore them in this way, it breaks Sparc and any other
> cpu which keeps "stack frame" state in the flags value.  "flags" must
> be obtained and restored in the same function.

On some of the older (obsolete) ARMs, this is also not possible - the
IRQ state is restored each time a function exits in kernel mode.
(I'm not too concerned with these today).

I've seen this done somewhere else in the kernel as well.  How about
changing flags to an architecture-defined struct, and doing something
like:

extern inline void save_flags(struct flags *flg)
{
#ifdef CATCH_BAD_FLAGS
	flg->ret = __builtin_return_address(0);
#endif
	__save_flags(flg->flag);
}

extern inline void restore_flags(struct flags *flg)
{
#ifdef CATCH_BAD_FLAGS
	if (flg->ret != __builtin_return_address(0))
		BUG();
#endif
	__restore_flags(flg->flag);
}

Of course, CATCH_BAD_FLAGS would be turned off for the stable series to
reduce the impact of the check, but at least we could catch bad usage on
development kernels easily.

Of course, the above is dependent on __builtin_return_address() returning
the return address of the function that these were inlined into.

I'm just wondering - what about spinlocks?  There are a couple of instances
where a spinlock is taken in a parent function and temporarily released in
one of its child functions.  I'm not happy with this usage, but if this is
a legal usage of the spinlocks, then the above may bite when it shouldn't.
   _____
  |_____| ------------------------------------------------- ---+---+-
  |   |         Russell King        rmk@arm.linux.org.uk      --- ---
  | | | |   http://www.arm.linux.org.uk/~rmk/aboutme.html    /  /  |
  | +-+-+                                                     --- -+-
  /   |               THE developer of ARM Linux              |+| /|\
 /  | | |                                                     ---  |
    +-+-+ -------------------------------------------------  /\\\  |
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: classzone-VM + mapped pages out of lru_cache
  2000-05-04 23:44 ` Andrea Arcangeli
  2000-05-05  0:03   ` Jens Axboe
@ 2000-05-05  3:04   ` David S. Miller
  2000-05-05  8:43     ` Russell King
  2000-05-05 14:56     ` Andrea Arcangeli
  2000-05-06 13:37   ` Andrea Arcangeli
  2 siblings, 2 replies; 23+ messages in thread
From: David S. Miller @ 2000-05-05  3:04 UTC (permalink / raw)
  To: andrea
  Cc: shrybman, quintela, gandalf, joerg.stroettchen, linux-kernel,
	axboe, linux-mm

Andrea, please do not pass IRQ state "flags" to another function
and try to restore them in this way, it breaks Sparc and any other
cpu which keeps "stack frame" state in the flags value.  "flags" must
be obtained and restored in the same function.

You do this in your rmqueue() changes.

Thanks.

Later,
David S. Miller
davem@redhat.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: classzone-VM + mapped pages out of lru_cache
  2000-05-04 23:44 ` Andrea Arcangeli
@ 2000-05-05  0:03   ` Jens Axboe
  2000-05-05  3:04   ` David S. Miller
  2000-05-06 13:37   ` Andrea Arcangeli
  2 siblings, 0 replies; 23+ messages in thread
From: Jens Axboe @ 2000-05-05  0:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Shane Shrybman, Juan J. Quintela, gandalf, Joerg Stroettchen,
	linux-kernel, linux-mm

On Fri, May 05 2000, Andrea Arcangeli wrote:
> As somebody noticed classzone-18 had a deadlock condition. It was due a
> silly bug and I fixed it in the new classzone-22:

Yes, this fixes the truncate bug here. No problems noticed so far.

-- 
*  Jens Axboe <axboe@suse.de>
*  Linux CD/DVD-ROM, SuSE Labs
*  http://kernel.dk
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: classzone-VM + mapped pages out of lru_cache
       [not found] <3911ECCD.BA1BB24E@arcormail.de>
@ 2000-05-04 23:44 ` Andrea Arcangeli
  2000-05-05  0:03   ` Jens Axboe
                     ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2000-05-04 23:44 UTC (permalink / raw)
  To: Shane Shrybman, Juan J. Quintela, gandalf, Joerg Stroettchen
  Cc: linux-kernel, Jens Axboe, linux-mm

As somebody noticed classzone-18 had a deadlock condition. It was due a
silly bug and I fixed it in the new classzone-22:

	ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.3/2.3.99-pre7-pre4/classzone-22.gz

This new one also doesn't allow read(2) to see the page that we're going
to drop from truncate(2). There are no other changes. It's against
2.3.99-pre7-pre4.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2000-05-06 13:37 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-05-03 16:26 classzone-VM + mapped pages out of lru_cache Andrea Arcangeli
2000-05-04  0:42 ` David S. Miller
2000-05-04 10:00   ` Andrea Arcangeli
2000-05-04 14:40 ` Juan J. Quintela
2000-05-04 15:19   ` Andrea Arcangeli
2000-05-04 15:23     ` Andrea Arcangeli
2000-05-04 15:38     ` Rik van Riel
2000-05-04 17:59       ` Andrea Arcangeli
2000-05-04 19:24         ` Rik van Riel
2000-05-04 16:34     ` Manfred Spraul, Andrea Arcangeli
2000-05-04 16:48     ` Trond Myklebust
2000-05-04 18:43       ` Andrea Arcangeli
2000-05-04 19:32         ` Trond Myklebust
2000-05-04 20:15           ` Andrea Arcangeli
2000-05-05  7:01             ` Trond Myklebust
2000-05-04 16:34   ` Juan J. Quintela
2000-05-04 18:27     ` Chris Evans
     [not found] <3911ECCD.BA1BB24E@arcormail.de>
2000-05-04 23:44 ` Andrea Arcangeli
2000-05-05  0:03   ` Jens Axboe
2000-05-05  3:04   ` David S. Miller
2000-05-05  8:43     ` Russell King
2000-05-05 14:56     ` Andrea Arcangeli
2000-05-06 13:37   ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox