Keeping mmap'ed files in core regression in 2.6.7-rc

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Keeping mmap'ed files in core regression in 2.6.7-rc
@ 2004-06-08 14:29 Miquel van Smoorenburg
  2004-06-12  6:56 ` Nick Piggin
  0 siblings, 1 reply; 11+ messages in thread
From: Miquel van Smoorenburg @ 2004-06-08 14:29 UTC (permalink / raw)
  To: linux-mm

I'm running a Usenet news server with a full feed. Software is
INN 2.4.1.

The list of all articles is called the "history database" and is
indexed by a history.hash and a history.index file, both sized
around 300-400 MB.

These hash and index files are mmap'ed by the main innd process.

A full usenet feed is 800-1000 GB/day, that's ~ 12MB / sec incoming
traffic going to the local spool disk. About the same amount of
traffic is sent out to peers.

With kernels 2.6.0 - 2.6.6, I did a "echo 15 > /proc/sys/vm/swappiness"
and the kernel did a pretty good job of keeping the mmap'ed files
mostly in core, which is needed for performance (100-200 database
queries/sec!).

This is the output with a 2.6.6 kernel:

# ps u -C innd
USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
news       276 26.8 60.2 817228 624932 ?     D    01:57 232:55 /usr/local/news/b

Now I tried 2.6.7-rc2 and -rc3 (well rc2-bk-latest-before-rc3) and
with those kernels, performance goes to hell because no matter
how much I tune, the kernel will throw out the mmap'ed pages first.
RSS of the innd process hovers around 200-250 MB instead of 600.

Ideas ?

Mike.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Keeping mmap'ed files in core regression in 2.6.7-rc
  2004-06-08 14:29 Keeping mmap'ed files in core regression in 2.6.7-rc Miquel van Smoorenburg
@ 2004-06-12  6:56 ` Nick Piggin
  2004-06-14 14:06   ` Miquel van Smoorenburg
  0 siblings, 1 reply; 11+ messages in thread
From: Nick Piggin @ 2004-06-12  6:56 UTC (permalink / raw)
  To: Miquel van Smoorenburg; +Cc: linux-mm

[-- Attachment #1: Type: text/plain, Size: 359 bytes --]

Miquel van Smoorenburg wrote:

> Now I tried 2.6.7-rc2 and -rc3 (well rc2-bk-latest-before-rc3) and
> with those kernels, performance goes to hell because no matter
> how much I tune, the kernel will throw out the mmap'ed pages first.
> RSS of the innd process hovers around 200-250 MB instead of 600.
> 
> Ideas ?
> 

Can you try the following patch please?

[-- Attachment #2: vm-revert-fix.patch --]
[-- Type: text/x-patch, Size: 998 bytes --]

 linux-2.6-npiggin/mm/vmscan.c |    7 ++-----
 1 files changed, 2 insertions(+), 5 deletions(-)

diff -puN mm/vmscan.c~vm-revert-fix mm/vmscan.c
--- linux-2.6/mm/vmscan.c~vm-revert-fix	2004-06-12 16:53:02.000000000 +1000
+++ linux-2.6-npiggin/mm/vmscan.c	2004-06-12 16:54:26.000000000 +1000
@@ -813,9 +813,8 @@ shrink_caches(struct zone **zones, int p
 		struct zone *zone = zones[i];
 		int max_scan;
 
-		zone->temp_priority = priority;
-		if (zone->prev_priority > priority)
-			zone->prev_priority = priority;
+		if (zone->free_pages < zone->pages_high)
+			zone->temp_priority = priority;
 
 		if (zone->all_unreclaimable && priority != DEF_PRIORITY)
 			continue;	/* Let kswapd poll it */
@@ -996,8 +995,6 @@ scan:
 					all_zones_ok = 0;
 			}
 			zone->temp_priority = priority;
-			if (zone->prev_priority > priority)
-				zone->prev_priority = priority;
 			max_scan = (zone->nr_active + zone->nr_inactive)
 								>> priority;
 			reclaimed = shrink_zone(zone, max_scan, GFP_KERNEL,

_

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Keeping mmap'ed files in core regression in 2.6.7-rc
  2004-06-12  6:56 ` Nick Piggin
@ 2004-06-14 14:06   ` Miquel van Smoorenburg
  2004-06-15  3:03     ` Nick Piggin
  0 siblings, 1 reply; 11+ messages in thread
From: Miquel van Smoorenburg @ 2004-06-14 14:06 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Miquel van Smoorenburg, linux-mm

On 2004.06.12 08:56, Nick Piggin wrote:
> Miquel van Smoorenburg wrote:
> 
> > Now I tried 2.6.7-rc2 and -rc3 (well rc2-bk-latest-before-rc3) and
> > with those kernels, performance goes to hell because no matter
> > how much I tune, the kernel will throw out the mmap'ed pages first.
> > RSS of the innd process hovers around 200-250 MB instead of 600.
> > 
> > Ideas ?
> > 
> 
> Can you try the following patch please?

The patch below indeed fixes this problem. Now most of the mmap'ed files
are actually kept in memory and RSS is around 600 MB again:

$ uname -a
Linux quantum 2.6.7-rc3 #1 SMP Mon Jun 14 12:48:34 CEST 2004 i686 GNU/Linux
$ free
             total       used       free     shared    buffers     cached
Mem:       1037240     897668     139572          0     159320     501688
-/+ buffers/cache:     236660     800580
Swap:       996020      16160     979860
$ ps u -C innd
USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
news       277 31.8 56.2 857124 583896 ?     D    13:02  57:01 /usr/local/news/b

Hmm, weird that 'free' says that 139 MB is unused.. the box is doing
lots of I/O. 'free' hovers between 30 - 250 MB over time.

Look, 1 minute later:

$ free
             total       used       free     shared    buffers     cached
Mem:       1037240     788368     248872          0      29260     497600
-/+ buffers/cache:     261508     775732
Swap:       996020      16260     979760

Ah wait, that appears to be an outgoing feed process that keeps on allocating
and freeing memory at a fast rate, so that makes sense I guess. At least
the RSS of the main innd process remains steady at around ~600 MB and that
is what is important for this application.

Thanks,

Mike.



>  linux-2.6-npiggin/mm/vmscan.c |    7 ++-----
>  1 files changed, 2 insertions(+), 5 deletions(-)
> 
> diff -puN mm/vmscan.c~vm-revert-fix mm/vmscan.c
> --- linux-2.6/mm/vmscan.c~vm-revert-fix	2004-06-12 16:53:02.000000000 +1000
> +++ linux-2.6-npiggin/mm/vmscan.c	2004-06-12 16:54:26.000000000 +1000
> @@ -813,9 +813,8 @@ shrink_caches(struct zone **zones, int p
>  		struct zone *zone = zones[i];
>  		int max_scan;
>  
> -		zone->temp_priority = priority;
> -		if (zone->prev_priority > priority)
> -			zone->prev_priority = priority;
> +		if (zone->free_pages < zone->pages_high)
> +			zone->temp_priority = priority;
>  
>  		if (zone->all_unreclaimable && priority != DEF_PRIORITY)
>  			continue;	/* Let kswapd poll it */
> @@ -996,8 +995,6 @@ scan:
>  					all_zones_ok = 0;
>  			}
>  			zone->temp_priority = priority;
> -			if (zone->prev_priority > priority)
> -				zone->prev_priority = priority;
>  			max_scan = (zone->nr_active + zone->nr_inactive)
>  								>> priority;
>  			reclaimed = shrink_zone(zone, max_scan, GFP_KERNEL,
> 
> _
> 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Keeping mmap'ed files in core regression in 2.6.7-rc
  2004-06-14 14:06   ` Miquel van Smoorenburg
@ 2004-06-15  3:03     ` Nick Piggin
  2004-06-15 14:31       ` Miquel van Smoorenburg
  0 siblings, 1 reply; 11+ messages in thread
From: Nick Piggin @ 2004-06-15  3:03 UTC (permalink / raw)
  To: Miquel van Smoorenburg, Andrew Morton; +Cc: linux-mm

Miquel van Smoorenburg wrote:
> On 2004.06.12 08:56, Nick Piggin wrote:
> 
>>Miquel van Smoorenburg wrote:
>>
>>
>>>Now I tried 2.6.7-rc2 and -rc3 (well rc2-bk-latest-before-rc3) and
>>>with those kernels, performance goes to hell because no matter
>>>how much I tune, the kernel will throw out the mmap'ed pages first.
>>>RSS of the innd process hovers around 200-250 MB instead of 600.
>>>
>>>Ideas ?
>>>
>>
>>Can you try the following patch please?
> 
> 
> The patch below indeed fixes this problem. Now most of the mmap'ed files
> are actually kept in memory and RSS is around 600 MB again:
> 

OK good. Cc'ing Andrew.

> $ uname -a
> Linux quantum 2.6.7-rc3 #1 SMP Mon Jun 14 12:48:34 CEST 2004 i686 GNU/Linux
> $ free
>              total       used       free     shared    buffers     cached
> Mem:       1037240     897668     139572          0     159320     501688
> -/+ buffers/cache:     236660     800580
> Swap:       996020      16160     979860
> $ ps u -C innd
> USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
> news       277 31.8 56.2 857124 583896 ?     D    13:02  57:01 /usr/local/news/b
> 
> Hmm, weird that 'free' says that 139 MB is unused.. the box is doing
> lots of I/O. 'free' hovers between 30 - 250 MB over time.
> 
> Look, 1 minute later:
> 
> $ free
>              total       used       free     shared    buffers     cached
> Mem:       1037240     788368     248872          0      29260     497600
> -/+ buffers/cache:     261508     775732
> Swap:       996020      16260     979760
> 
> Ah wait, that appears to be an outgoing feed process that keeps on allocating
> and freeing memory at a fast rate, so that makes sense I guess. At least

That would be right.

> the RSS of the main innd process remains steady at around ~600 MB and that
> is what is important for this application.
> 

Absolute performance is the thing that matters at the end of the day.
Is it as good as 2.6.6 now?

Thanks
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Keeping mmap'ed files in core regression in 2.6.7-rc
  2004-06-15  3:03     ` Nick Piggin
@ 2004-06-15 14:31       ` Miquel van Smoorenburg
  2004-06-16  3:16         ` Nick Piggin
  0 siblings, 1 reply; 11+ messages in thread
From: Miquel van Smoorenburg @ 2004-06-15 14:31 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-mm

According to Nick Piggin:
> Miquel van Smoorenburg wrote:
> >On 2004.06.12 08:56, Nick Piggin wrote:
> >
> >>Miquel van Smoorenburg wrote:
> >>
> >>
> >>>Now I tried 2.6.7-rc2 and -rc3 (well rc2-bk-latest-before-rc3) and
> >>>with those kernels, performance goes to hell because no matter
> >>>how much I tune, the kernel will throw out the mmap'ed pages first.
> >>>RSS of the innd process hovers around 200-250 MB instead of 600.
> >>>
> >>Can you try the following patch please?
> >
> >The patch below indeed fixes this problem. Now most of the mmap'ed files
> >are actually kept in memory and RSS is around 600 MB again:
> 
> OK good. Cc'ing Andrew.

I've built a small test app that creates the same I/O pattern and ran it
on 2.6.6, 2.6.7-rc3 and 2.6.7-rc3+patch and running that confirms it,
though not as dramatically as the real-life application.



Now something else that is weird, but might be unrelated and I have
not found a way to reproduce it on a different machine yet, so feel
free to ignore it, I'm just mentioning it in case someone reckognizes
this.

The news server process uses /dev/hd[cdg]1 directly for storage
(Cyclic News File System). There's about 12 MB/sec incoming
being stored on those 3 (SATA) disks. Look at the vmstat output:

# vmstat 2
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 4  0  22664   5216 277332 496644   28    0  8143    36 9785  2162 12 43 28 16
 1  3  22660 231252  71808 488580   16    0  5947 33856 8868  1633  9 60 11 20
 2  2  22660 273972  40988 489508    0    0  8895 21144 8875  1931 10 43 21 27
 3  0  22660 236412  73620 491148    0    0 10774 10551 9877  1937 10 44 24 22
 1  1  22660 185112 104112 492616    0    0  9677 12354 10216  1863 10 44 28 19
 2  0  22660 148700 138388 494108    0    0 10227 13919 9976  1925 11 44 24 21
 0  2  22660 123432 162032 495012    0    0  6244 15418 10065  1793 11 46 28 16
 3  0  22660  93096 190452 496292    8    0  6548 10293 9860  1975 11 43 31 15
 2  0  22660  51688 218628 497424    0    0  6405    52 10575  2063 13 48 27 12
 3  1  22660  19012 245632 499032    8    0  8108 12400 10136  1892 11 44 24 21
 2  1  22660 249192  42956 490932    0    0  8231 33005 9109  1343 10 60 13 18
 0  1  22660 240396  53764 491956    0    0 10082 18625 9504  1740 10 47 24 19
 2  2  22660 205632  86108 493408    0    0  8305 12368 8941  1775  8 33 32 26
 0  2  22660 164672 119156 494972    0    0  6867    62 9695  1894 11 40 31 18
 1  3  22660 137924 144964 496568    0    0  7099 16440 10388  1878 11 47 26 17
 1  1  22660 101604 176936 498052    0    0  9166 12332 10237  1694 12 44 28 16
 2  1  22660  67816 205376 499176    8    0  6169  6158 9906  1897 11 44 31 15
 1  1  22660  28004 236520 500652   10    0  7418  6202 10289  1744 12 44 30 14
 2  1  22660   7484 259156 492544   12    0  7494 18540 10218  1757 11 49 21 19
 1  4  22660  61664 228360 494004   72    0  6131 14412 9611  2437 10 46 20 23
 3  1  22660  76976 242652 498884   36    0  6927 16558 7560  2219 18 42 13 27
 0  1  22660  62352 267840 501140   14    0  7358 10424 8273  2601 11 32 33 23
 1  1  22660   6880 301056 502528    4    0 11045  2304 10177  2137 12 42 26 20
 0  4  22660 280848  40856 494196    0    0  6583 45092 9379  1505  9 61 13

See how "cache" remains stable, but free/buffers memory is oscillating?
That shouldn't happen, right ? 

I tried to reproduce it on another 2.6.7-rc3 system with
while :; do dd if=/dev/zero of=/dev/sda8 bs=1M count=10; sleep 1; done
and while I did see it oscillating once or twice after that it
remained stable (buffers high / free memory low) and I can't seem
to be able to reproduce it again.

Yesterday I ported my rawfs module over to 2.6. It's a minimal filesystem that
shows a blockdevice as a single large file. I'm letting the newsserver access
that instead of the blockdevice directly so all access goes through the
pagecache instead of the buffer cache and that runs much more smoothly, though
it's harder to tune 'swappiness' with it - it seems to be much more "all
or nothing" in that case. Anyway that's what I'm using now.

Mike.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Keeping mmap'ed files in core regression in 2.6.7-rc
  2004-06-15 14:31       ` Miquel van Smoorenburg
@ 2004-06-16  3:16         ` Nick Piggin
  2004-06-16  3:50           ` Andrew Morton
  2004-06-17 10:50           ` Miquel van Smoorenburg
  0 siblings, 2 replies; 11+ messages in thread
From: Nick Piggin @ 2004-06-16  3:16 UTC (permalink / raw)
  To: Miquel van Smoorenburg; +Cc: Andrew Morton, linux-mm

Miquel van Smoorenburg wrote:
> According to Nick Piggin:
> 
>>Miquel van Smoorenburg wrote:
>>
>>>
>>>The patch below indeed fixes this problem. Now most of the mmap'ed files
>>>are actually kept in memory and RSS is around 600 MB again:
>>
>>OK good. Cc'ing Andrew.
> 
> 
> I've built a small test app that creates the same I/O pattern and ran it
> on 2.6.6, 2.6.7-rc3 and 2.6.7-rc3+patch and running that confirms it,
> though not as dramatically as the real-life application.
> 

Can you send the test app over?
Andrew, do you have any ideas about how to fix this so far?

> 
> 
> Now something else that is weird, but might be unrelated and I have
> not found a way to reproduce it on a different machine yet, so feel
> free to ignore it, I'm just mentioning it in case someone reckognizes
> this.
> 
> The news server process uses /dev/hd[cdg]1 directly for storage
> (Cyclic News File System). There's about 12 MB/sec incoming
> being stored on those 3 (SATA) disks. Look at the vmstat output:
> 
> # vmstat 2
> procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
>  4  0  22664   5216 277332 496644   28    0  8143    36 9785  2162 12 43 28 16
>  1  3  22660 231252  71808 488580   16    0  5947 33856 8868  1633  9 60 11 20
>  2  2  22660 273972  40988 489508    0    0  8895 21144 8875  1931 10 43 21 27
>  3  0  22660 236412  73620 491148    0    0 10774 10551 9877  1937 10 44 24 22
>  1  1  22660 185112 104112 492616    0    0  9677 12354 10216  1863 10 44 28 19
>  2  0  22660 148700 138388 494108    0    0 10227 13919 9976  1925 11 44 24 21
>  0  2  22660 123432 162032 495012    0    0  6244 15418 10065  1793 11 46 28 16
>  3  0  22660  93096 190452 496292    8    0  6548 10293 9860  1975 11 43 31 15
>  2  0  22660  51688 218628 497424    0    0  6405    52 10575  2063 13 48 27 12
>  3  1  22660  19012 245632 499032    8    0  8108 12400 10136  1892 11 44 24 21
>  2  1  22660 249192  42956 490932    0    0  8231 33005 9109  1343 10 60 13 18
>  0  1  22660 240396  53764 491956    0    0 10082 18625 9504  1740 10 47 24 19
>  2  2  22660 205632  86108 493408    0    0  8305 12368 8941  1775  8 33 32 26
>  0  2  22660 164672 119156 494972    0    0  6867    62 9695  1894 11 40 31 18
>  1  3  22660 137924 144964 496568    0    0  7099 16440 10388  1878 11 47 26 17
>  1  1  22660 101604 176936 498052    0    0  9166 12332 10237  1694 12 44 28 16
>  2  1  22660  67816 205376 499176    8    0  6169  6158 9906  1897 11 44 31 15
>  1  1  22660  28004 236520 500652   10    0  7418  6202 10289  1744 12 44 30 14
>  2  1  22660   7484 259156 492544   12    0  7494 18540 10218  1757 11 49 21 19
>  1  4  22660  61664 228360 494004   72    0  6131 14412 9611  2437 10 46 20 23
>  3  1  22660  76976 242652 498884   36    0  6927 16558 7560  2219 18 42 13 27
>  0  1  22660  62352 267840 501140   14    0  7358 10424 8273  2601 11 32 33 23
>  1  1  22660   6880 301056 502528    4    0 11045  2304 10177  2137 12 42 26 20
>  0  4  22660 280848  40856 494196    0    0  6583 45092 9379  1505  9 61 13
> 
> See how "cache" remains stable, but free/buffers memory is oscillating?
> That shouldn't happen, right ? 
> 

If it is doing IO to large regions of mapped memory, the page reclaim
can start getting a bit chunky. Not much you can do about it, but it
shouldn't do any harm.

> I tried to reproduce it on another 2.6.7-rc3 system with
> while :; do dd if=/dev/zero of=/dev/sda8 bs=1M count=10; sleep 1; done
> and while I did see it oscillating once or twice after that it
> remained stable (buffers high / free memory low) and I can't seem
> to be able to reproduce it again.
> 

Probably because it isn't doing mmapped IO.

> Yesterday I ported my rawfs module over to 2.6. It's a minimal filesystem that
> shows a blockdevice as a single large file. I'm letting the newsserver access
> that instead of the blockdevice directly so all access goes through the
> pagecache instead of the buffer cache and that runs much more smoothly, though
> it's harder to tune 'swappiness' with it - it seems to be much more "all
> or nothing" in that case. Anyway that's what I'm using now.
> 

In 2.6, everything basically should go through the same path I think,
so it really shouldn't make much difference.

The fact that swappiness stops having any effect sounds like the server
switched from doing mapped IO to read/write. Maybe I'm crazy... could
you verify?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Keeping mmap'ed files in core regression in 2.6.7-rc
  2004-06-16  3:16         ` Nick Piggin
@ 2004-06-16  3:50           ` Andrew Morton
  2004-06-16  4:03             ` Nick Piggin
  2004-06-17 10:50           ` Miquel van Smoorenburg
  1 sibling, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2004-06-16  3:50 UTC (permalink / raw)
  To: Nick Piggin; +Cc: miquels, linux-mm

Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
> Can you send the test app over?

logical next step.

> Andrew, do you have any ideas about how to fix this so far?

Not sure what, if anything, is wrong yet.  It could be that reclaim is now
doing the "right" thing, but this particular workload preferred the "wrong"
thing.  Needs more investigation.


> > 
> > See how "cache" remains stable, but free/buffers memory is oscillating?
> > That shouldn't happen, right ? 
> > 
> 
> If it is doing IO to large regions of mapped memory, the page reclaim
> can start getting a bit chunky. Not much you can do about it, but it
> shouldn't do any harm.

shrink_zone() will free arbitrarily large amounts of memory as the scanning
priority increases.  Probably it shouldn't.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Keeping mmap'ed files in core regression in 2.6.7-rc
  2004-06-16  3:50           ` Andrew Morton
@ 2004-06-16  4:03             ` Nick Piggin
  2004-06-16  4:23               ` Andrew Morton
  0 siblings, 1 reply; 11+ messages in thread
From: Nick Piggin @ 2004-06-16  4:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: miquels, linux-mm

Andrew Morton wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
>>Can you send the test app over?
> 
> 
> logical next step.
> 
> 
>>Andrew, do you have any ideas about how to fix this so far?
> 
> 
> Not sure what, if anything, is wrong yet.  It could be that reclaim is now
> doing the "right" thing, but this particular workload preferred the "wrong"
> thing.  Needs more investigation.
> 
> 
> 
>>>See how "cache" remains stable, but free/buffers memory is oscillating?
>>>That shouldn't happen, right ? 
>>>
>>
>>If it is doing IO to large regions of mapped memory, the page reclaim
>>can start getting a bit chunky. Not much you can do about it, but it
>>shouldn't do any harm.
> 
> 
> shrink_zone() will free arbitrarily large amounts of memory as the scanning
> priority increases.  Probably it shouldn't.
> 
> 

Especially for kswapd, I think, because it can end up fighting with
memory allocators and think it is getting into trouble. It should
probably rather just keep putting along quietly.

I have a few experimental patches that magnify this problem, so I'll
be looking at fixing it soon. The tricky part will be trying to
maintain a similar prev_priority / temp_priority balance.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Keeping mmap'ed files in core regression in 2.6.7-rc
  2004-06-16  4:03             ` Nick Piggin
@ 2004-06-16  4:23               ` Andrew Morton
  2004-06-16  4:41                 ` Nick Piggin
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2004-06-16  4:23 UTC (permalink / raw)
  To: Nick Piggin; +Cc: miquels, linux-mm

Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
> > 
> > shrink_zone() will free arbitrarily large amounts of memory as the scanning
> > priority increases.  Probably it shouldn't.
> > 
> > 
> 
> Especially for kswapd, I think, because it can end up fighting with
> memory allocators and think it is getting into trouble. It should
> probably rather just keep putting along quietly.
> 
> I have a few experimental patches that magnify this problem, so I'll
> be looking at fixing it soon. The tricky part will be trying to
> maintain a similar prev_priority / temp_priority balance.

hm, I don't see why.  Why not simply bale from shrink_listing as soon as
we've reclaimed SWAP_CLUSTER_MAX pages?

I got bored of shrink_zone() bugs and rewrote it again yesterday.  Haven't
tested it much.  I really hate struct scan_control btw ;)




We've been futzing with the scan rates of the inactive and active lists far
too much, and it's still not right (Anton reports interrupt-off times of over
a second).

- We have this logic in there from 2.4.early (at least) which tries to keep
  the inactive list 1/3rd the size of the active list.  Or something.

  I really cannot see any logic behind this, so toss it out and change the
  arithmetic in there so that all pages on both lists have equal scan rates.

- Chunk the work up so we never hold interrupts off for more that 32 pages
  worth of scanning.

- Make the per-zone scan-count accumulators unsigned long rather than
  atomic_t.

  Mainly because atomic_t's could conceivably overflow, but also because
  access to these counters is racy-by-design anyway.

Signed-off-by: Andrew Morton <akpm@osdl.org>
---

 25-akpm/include/linux/mmzone.h |    4 +-
 25-akpm/mm/page_alloc.c        |    4 +-
 25-akpm/mm/vmscan.c            |   70 ++++++++++++++++++-----------------------
 3 files changed, 35 insertions(+), 43 deletions(-)

diff -puN mm/vmscan.c~vmscan-scan-sanity mm/vmscan.c
--- 25/mm/vmscan.c~vmscan-scan-sanity	2004-06-15 02:19:01.485627112 -0700
+++ 25-akpm/mm/vmscan.c	2004-06-15 02:49:29.317754392 -0700
@@ -789,54 +789,46 @@ refill_inactive_zone(struct zone *zone, 
 }
 
 /*
- * Scan `nr_pages' from this zone.  Returns the number of reclaimed pages.
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
  */
 static void
 shrink_zone(struct zone *zone, struct scan_control *sc)
 {
-	unsigned long scan_active, scan_inactive;
-	int count;
-
-	scan_inactive = (zone->nr_active + zone->nr_inactive) >> sc->priority;
+	unsigned long nr_active;
+	unsigned long nr_inactive;
 
 	/*
-	 * Try to keep the active list 2/3 of the size of the cache.  And
-	 * make sure that refill_inactive is given a decent number of pages.
-	 *
-	 * The "scan_active + 1" here is important.  With pagecache-intensive
-	 * workloads the inactive list is huge, and `ratio' evaluates to zero
-	 * all the time.  Which pins the active list memory.  So we add one to
-	 * `scan_active' just to make sure that the kernel will slowly sift
-	 * through the active list.
+	 * Add one to `nr_to_scan' just to make sure that the kernel will
+	 * slowly sift through the active list.
 	 */
-	if (zone->nr_active >= 4*(zone->nr_inactive*2 + 1)) {
-		/* Don't scan more than 4 times the inactive list scan size */
-		scan_active = 4*scan_inactive;
-	} else {
-		unsigned long long tmp;
-
-		/* Cast to long long so the multiply doesn't overflow */
-
-		tmp = (unsigned long long)scan_inactive * zone->nr_active;
-		do_div(tmp, zone->nr_inactive*2 + 1);
-		scan_active = (unsigned long)tmp;
-	}
-
-	atomic_add(scan_active + 1, &zone->nr_scan_active);
-	count = atomic_read(&zone->nr_scan_active);
-	if (count >= SWAP_CLUSTER_MAX) {
-		atomic_set(&zone->nr_scan_active, 0);
-		sc->nr_to_scan = count;
-		refill_inactive_zone(zone, sc);
-	}
+	zone->nr_scan_active += (zone->nr_active >> sc->priority) + 1;
+	nr_active = zone->nr_scan_active;
+	if (nr_active >= SWAP_CLUSTER_MAX)
+		zone->nr_scan_active = 0;
+	else
+		nr_active = 0;
+
+	zone->nr_scan_inactive += (zone->nr_inactive >> sc->priority) + 1;
+	nr_inactive = zone->nr_scan_inactive;
+	if (nr_inactive >= SWAP_CLUSTER_MAX)
+		zone->nr_scan_inactive = 0;
+	else
+		nr_inactive = 0;
+
+	while (nr_active || nr_inactive) {
+		if (nr_active) {
+			sc->nr_to_scan = min(nr_active,
+					(unsigned long)SWAP_CLUSTER_MAX);
+			nr_active -= sc->nr_to_scan;
+			refill_inactive_zone(zone, sc);
+		}
 
-	atomic_add(scan_inactive, &zone->nr_scan_inactive);
-	count = atomic_read(&zone->nr_scan_inactive);
-	if (count >= SWAP_CLUSTER_MAX) {
-		atomic_set(&zone->nr_scan_inactive, 0);
-		sc->nr_to_scan = count;
-		shrink_cache(zone, sc);
+		if (nr_inactive) {
+			sc->nr_to_scan = min(nr_inactive,
+					(unsigned long)SWAP_CLUSTER_MAX);
+			nr_inactive -= sc->nr_to_scan;
+			shrink_cache(zone, sc);
+		}
 	}
 }
 
diff -puN include/linux/mmzone.h~vmscan-scan-sanity include/linux/mmzone.h
--- 25/include/linux/mmzone.h~vmscan-scan-sanity	2004-06-15 02:49:35.705783264 -0700
+++ 25-akpm/include/linux/mmzone.h	2004-06-15 02:49:48.283871104 -0700
@@ -118,8 +118,8 @@ struct zone {
 	spinlock_t		lru_lock;	
 	struct list_head	active_list;
 	struct list_head	inactive_list;
-	atomic_t		nr_scan_active;
-	atomic_t		nr_scan_inactive;
+	unsigned long		nr_scan_active;
+	unsigned long		nr_scan_inactive;
 	unsigned long		nr_active;
 	unsigned long		nr_inactive;
 	int			all_unreclaimable; /* All pages pinned */
diff -puN mm/page_alloc.c~vmscan-scan-sanity mm/page_alloc.c
--- 25/mm/page_alloc.c~vmscan-scan-sanity	2004-06-15 02:50:04.404420408 -0700
+++ 25-akpm/mm/page_alloc.c	2004-06-15 02:50:53.752918296 -0700
@@ -1482,8 +1482,8 @@ static void __init free_area_init_core(s
 				zone_names[j], realsize, batch);
 		INIT_LIST_HEAD(&zone->active_list);
 		INIT_LIST_HEAD(&zone->inactive_list);
-		atomic_set(&zone->nr_scan_active, 0);
-		atomic_set(&zone->nr_scan_inactive, 0);
+		zone->nr_scan_active = 0;
+		zone->nr_scan_inactive = 0;
 		zone->nr_active = 0;
 		zone->nr_inactive = 0;
 		if (!size)
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Keeping mmap'ed files in core regression in 2.6.7-rc
  2004-06-16  4:23               ` Andrew Morton
@ 2004-06-16  4:41                 ` Nick Piggin
  0 siblings, 0 replies; 11+ messages in thread
From: Nick Piggin @ 2004-06-16  4:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: miquels, linux-mm

Andrew Morton wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
>>>shrink_zone() will free arbitrarily large amounts of memory as the scanning
>>>priority increases.  Probably it shouldn't.
>>>
>>>
>>
>>Especially for kswapd, I think, because it can end up fighting with
>>memory allocators and think it is getting into trouble. It should
>>probably rather just keep putting along quietly.
>>
>>I have a few experimental patches that magnify this problem, so I'll
>>be looking at fixing it soon. The tricky part will be trying to
>>maintain a similar prev_priority / temp_priority balance.
> 
> 
> hm, I don't see why.  Why not simply bale from shrink_listing as soon as
> we've reclaimed SWAP_CLUSTER_MAX pages?
> 

Oh yeah, that would be the way to go about it. Your patch looks
alright as a platform to do achieve this.

> I got bored of shrink_zone() bugs and rewrote it again yesterday.  Haven't
> tested it much.  I really hate struct scan_control btw ;)
> 

Well I can keep it local here. I have some stuff which requires more
things to be passed up and down the call chains which gets annoying
passing lots of things by reference.

> 
> 
> 
> We've been futzing with the scan rates of the inactive and active lists far
> too much, and it's still not right (Anton reports interrupt-off times of over
> a second).
> 
> - We have this logic in there from 2.4.early (at least) which tries to keep
>   the inactive list 1/3rd the size of the active list.  Or something.
> 
>   I really cannot see any logic behind this, so toss it out and change the
>   arithmetic in there so that all pages on both lists have equal scan rates.
> 

I think it is somewhat to do with use-once logic. If your inactive list
remains full of use-once pages, you can happily scan them while putting
minimal pressure on the active list.

I don't think we need to try to keep it *at least* 1/3rd the size anymore.
 From distant memory, that may have been when the inactive list was more
of a "writeout queue". I don't know though, it might still be useful.

> - Chunk the work up so we never hold interrupts off for more that 32 pages
>   worth of scanning.
> 

Yeah this was a bit silly. Good fix.

> - Make the per-zone scan-count accumulators unsigned long rather than
>   atomic_t.
> 
>   Mainly because atomic_t's could conceivably overflow, but also because
>   access to these counters is racy-by-design anyway.
> 

Seems OK other than my one possible issue.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Keeping mmap'ed files in core regression in 2.6.7-rc
  2004-06-16  3:16         ` Nick Piggin
  2004-06-16  3:50           ` Andrew Morton
@ 2004-06-17 10:50           ` Miquel van Smoorenburg
  1 sibling, 0 replies; 11+ messages in thread
From: Miquel van Smoorenburg @ 2004-06-17 10:50 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Miquel van Smoorenburg, Andrew Morton, linux-mm

On 2004.06.16 05:16, Nick Piggin wrote:
> Miquel van Smoorenburg wrote:
> > According to Nick Piggin:
> > 
> >>Miquel van Smoorenburg wrote:
> >>
> >>>
> >>>The patch below indeed fixes this problem. Now most of the mmap'ed files
> >>>are actually kept in memory and RSS is around 600 MB again:
> >>
> >>OK good. Cc'ing Andrew.
> > 
> > 
> > I've built a small test app that creates the same I/O pattern and ran it
> > on 2.6.6, 2.6.7-rc3 and 2.6.7-rc3+patch and running that confirms it,
> > though not as dramatically as the real-life application.
> > 
> 
> Can you send the test app over?
> Andrew, do you have any ideas about how to fix this so far?

I'll have to come back on this later - I'm about to go on
vacation, and there's some other stuff that needs to be
taken care of first.

Mike.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2004-06-17 10:50 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-06-08 14:29 Keeping mmap'ed files in core regression in 2.6.7-rc Miquel van Smoorenburg
2004-06-12  6:56 ` Nick Piggin
2004-06-14 14:06   ` Miquel van Smoorenburg
2004-06-15  3:03     ` Nick Piggin
2004-06-15 14:31       ` Miquel van Smoorenburg
2004-06-16  3:16         ` Nick Piggin
2004-06-16  3:50           ` Andrew Morton
2004-06-16  4:03             ` Nick Piggin
2004-06-16  4:23               ` Andrew Morton
2004-06-16  4:41                 ` Nick Piggin
2004-06-17 10:50           ` Miquel van Smoorenburg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox