* Keeping mmap'ed files in core regression in 2.6.7-rc @ 2004-06-08 14:29 Miquel van Smoorenburg 2004-06-12 6:56 ` Nick Piggin 0 siblings, 1 reply; 11+ messages in thread From: Miquel van Smoorenburg @ 2004-06-08 14:29 UTC (permalink / raw) To: linux-mm I'm running a Usenet news server with a full feed. Software is INN 2.4.1. The list of all articles is called the "history database" and is indexed by a history.hash and a history.index file, both sized around 300-400 MB. These hash and index files are mmap'ed by the main innd process. A full usenet feed is 800-1000 GB/day, that's ~ 12MB / sec incoming traffic going to the local spool disk. About the same amount of traffic is sent out to peers. With kernels 2.6.0 - 2.6.6, I did a "echo 15 > /proc/sys/vm/swappiness" and the kernel did a pretty good job of keeping the mmap'ed files mostly in core, which is needed for performance (100-200 database queries/sec!). This is the output with a 2.6.6 kernel: # ps u -C innd USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND news 276 26.8 60.2 817228 624932 ? D 01:57 232:55 /usr/local/news/b Now I tried 2.6.7-rc2 and -rc3 (well rc2-bk-latest-before-rc3) and with those kernels, performance goes to hell because no matter how much I tune, the kernel will throw out the mmap'ed pages first. RSS of the innd process hovers around 200-250 MB instead of 600. Ideas ? Mike. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Keeping mmap'ed files in core regression in 2.6.7-rc 2004-06-08 14:29 Keeping mmap'ed files in core regression in 2.6.7-rc Miquel van Smoorenburg @ 2004-06-12 6:56 ` Nick Piggin 2004-06-14 14:06 ` Miquel van Smoorenburg 0 siblings, 1 reply; 11+ messages in thread From: Nick Piggin @ 2004-06-12 6:56 UTC (permalink / raw) To: Miquel van Smoorenburg; +Cc: linux-mm [-- Attachment #1: Type: text/plain, Size: 359 bytes --] Miquel van Smoorenburg wrote: > Now I tried 2.6.7-rc2 and -rc3 (well rc2-bk-latest-before-rc3) and > with those kernels, performance goes to hell because no matter > how much I tune, the kernel will throw out the mmap'ed pages first. > RSS of the innd process hovers around 200-250 MB instead of 600. > > Ideas ? > Can you try the following patch please? [-- Attachment #2: vm-revert-fix.patch --] [-- Type: text/x-patch, Size: 998 bytes --] linux-2.6-npiggin/mm/vmscan.c | 7 ++----- 1 files changed, 2 insertions(+), 5 deletions(-) diff -puN mm/vmscan.c~vm-revert-fix mm/vmscan.c --- linux-2.6/mm/vmscan.c~vm-revert-fix 2004-06-12 16:53:02.000000000 +1000 +++ linux-2.6-npiggin/mm/vmscan.c 2004-06-12 16:54:26.000000000 +1000 @@ -813,9 +813,8 @@ shrink_caches(struct zone **zones, int p struct zone *zone = zones[i]; int max_scan; - zone->temp_priority = priority; - if (zone->prev_priority > priority) - zone->prev_priority = priority; + if (zone->free_pages < zone->pages_high) + zone->temp_priority = priority; if (zone->all_unreclaimable && priority != DEF_PRIORITY) continue; /* Let kswapd poll it */ @@ -996,8 +995,6 @@ scan: all_zones_ok = 0; } zone->temp_priority = priority; - if (zone->prev_priority > priority) - zone->prev_priority = priority; max_scan = (zone->nr_active + zone->nr_inactive) >> priority; reclaimed = shrink_zone(zone, max_scan, GFP_KERNEL, _ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Keeping mmap'ed files in core regression in 2.6.7-rc 2004-06-12 6:56 ` Nick Piggin @ 2004-06-14 14:06 ` Miquel van Smoorenburg 2004-06-15 3:03 ` Nick Piggin 0 siblings, 1 reply; 11+ messages in thread From: Miquel van Smoorenburg @ 2004-06-14 14:06 UTC (permalink / raw) To: Nick Piggin; +Cc: Miquel van Smoorenburg, linux-mm On 2004.06.12 08:56, Nick Piggin wrote: > Miquel van Smoorenburg wrote: > > > Now I tried 2.6.7-rc2 and -rc3 (well rc2-bk-latest-before-rc3) and > > with those kernels, performance goes to hell because no matter > > how much I tune, the kernel will throw out the mmap'ed pages first. > > RSS of the innd process hovers around 200-250 MB instead of 600. > > > > Ideas ? > > > > Can you try the following patch please? The patch below indeed fixes this problem. Now most of the mmap'ed files are actually kept in memory and RSS is around 600 MB again: $ uname -a Linux quantum 2.6.7-rc3 #1 SMP Mon Jun 14 12:48:34 CEST 2004 i686 GNU/Linux $ free total used free shared buffers cached Mem: 1037240 897668 139572 0 159320 501688 -/+ buffers/cache: 236660 800580 Swap: 996020 16160 979860 $ ps u -C innd USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND news 277 31.8 56.2 857124 583896 ? D 13:02 57:01 /usr/local/news/b Hmm, weird that 'free' says that 139 MB is unused.. the box is doing lots of I/O. 'free' hovers between 30 - 250 MB over time. Look, 1 minute later: $ free total used free shared buffers cached Mem: 1037240 788368 248872 0 29260 497600 -/+ buffers/cache: 261508 775732 Swap: 996020 16260 979760 Ah wait, that appears to be an outgoing feed process that keeps on allocating and freeing memory at a fast rate, so that makes sense I guess. At least the RSS of the main innd process remains steady at around ~600 MB and that is what is important for this application. Thanks, Mike. > linux-2.6-npiggin/mm/vmscan.c | 7 ++----- > 1 files changed, 2 insertions(+), 5 deletions(-) > > diff -puN mm/vmscan.c~vm-revert-fix mm/vmscan.c > --- linux-2.6/mm/vmscan.c~vm-revert-fix 2004-06-12 16:53:02.000000000 +1000 > +++ linux-2.6-npiggin/mm/vmscan.c 2004-06-12 16:54:26.000000000 +1000 > @@ -813,9 +813,8 @@ shrink_caches(struct zone **zones, int p > struct zone *zone = zones[i]; > int max_scan; > > - zone->temp_priority = priority; > - if (zone->prev_priority > priority) > - zone->prev_priority = priority; > + if (zone->free_pages < zone->pages_high) > + zone->temp_priority = priority; > > if (zone->all_unreclaimable && priority != DEF_PRIORITY) > continue; /* Let kswapd poll it */ > @@ -996,8 +995,6 @@ scan: > all_zones_ok = 0; > } > zone->temp_priority = priority; > - if (zone->prev_priority > priority) > - zone->prev_priority = priority; > max_scan = (zone->nr_active + zone->nr_inactive) > >> priority; > reclaimed = shrink_zone(zone, max_scan, GFP_KERNEL, > > _ > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Keeping mmap'ed files in core regression in 2.6.7-rc 2004-06-14 14:06 ` Miquel van Smoorenburg @ 2004-06-15 3:03 ` Nick Piggin 2004-06-15 14:31 ` Miquel van Smoorenburg 0 siblings, 1 reply; 11+ messages in thread From: Nick Piggin @ 2004-06-15 3:03 UTC (permalink / raw) To: Miquel van Smoorenburg, Andrew Morton; +Cc: linux-mm Miquel van Smoorenburg wrote: > On 2004.06.12 08:56, Nick Piggin wrote: > >>Miquel van Smoorenburg wrote: >> >> >>>Now I tried 2.6.7-rc2 and -rc3 (well rc2-bk-latest-before-rc3) and >>>with those kernels, performance goes to hell because no matter >>>how much I tune, the kernel will throw out the mmap'ed pages first. >>>RSS of the innd process hovers around 200-250 MB instead of 600. >>> >>>Ideas ? >>> >> >>Can you try the following patch please? > > > The patch below indeed fixes this problem. Now most of the mmap'ed files > are actually kept in memory and RSS is around 600 MB again: > OK good. Cc'ing Andrew. > $ uname -a > Linux quantum 2.6.7-rc3 #1 SMP Mon Jun 14 12:48:34 CEST 2004 i686 GNU/Linux > $ free > total used free shared buffers cached > Mem: 1037240 897668 139572 0 159320 501688 > -/+ buffers/cache: 236660 800580 > Swap: 996020 16160 979860 > $ ps u -C innd > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND > news 277 31.8 56.2 857124 583896 ? D 13:02 57:01 /usr/local/news/b > > Hmm, weird that 'free' says that 139 MB is unused.. the box is doing > lots of I/O. 'free' hovers between 30 - 250 MB over time. > > Look, 1 minute later: > > $ free > total used free shared buffers cached > Mem: 1037240 788368 248872 0 29260 497600 > -/+ buffers/cache: 261508 775732 > Swap: 996020 16260 979760 > > Ah wait, that appears to be an outgoing feed process that keeps on allocating > and freeing memory at a fast rate, so that makes sense I guess. At least That would be right. > the RSS of the main innd process remains steady at around ~600 MB and that > is what is important for this application. > Absolute performance is the thing that matters at the end of the day. Is it as good as 2.6.6 now? Thanks -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Keeping mmap'ed files in core regression in 2.6.7-rc 2004-06-15 3:03 ` Nick Piggin @ 2004-06-15 14:31 ` Miquel van Smoorenburg 2004-06-16 3:16 ` Nick Piggin 0 siblings, 1 reply; 11+ messages in thread From: Miquel van Smoorenburg @ 2004-06-15 14:31 UTC (permalink / raw) To: Nick Piggin; +Cc: Andrew Morton, linux-mm According to Nick Piggin: > Miquel van Smoorenburg wrote: > >On 2004.06.12 08:56, Nick Piggin wrote: > > > >>Miquel van Smoorenburg wrote: > >> > >> > >>>Now I tried 2.6.7-rc2 and -rc3 (well rc2-bk-latest-before-rc3) and > >>>with those kernels, performance goes to hell because no matter > >>>how much I tune, the kernel will throw out the mmap'ed pages first. > >>>RSS of the innd process hovers around 200-250 MB instead of 600. > >>> > >>Can you try the following patch please? > > > >The patch below indeed fixes this problem. Now most of the mmap'ed files > >are actually kept in memory and RSS is around 600 MB again: > > OK good. Cc'ing Andrew. I've built a small test app that creates the same I/O pattern and ran it on 2.6.6, 2.6.7-rc3 and 2.6.7-rc3+patch and running that confirms it, though not as dramatically as the real-life application. Now something else that is weird, but might be unrelated and I have not found a way to reproduce it on a different machine yet, so feel free to ignore it, I'm just mentioning it in case someone reckognizes this. The news server process uses /dev/hd[cdg]1 directly for storage (Cyclic News File System). There's about 12 MB/sec incoming being stored on those 3 (SATA) disks. Look at the vmstat output: # vmstat 2 procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 4 0 22664 5216 277332 496644 28 0 8143 36 9785 2162 12 43 28 16 1 3 22660 231252 71808 488580 16 0 5947 33856 8868 1633 9 60 11 20 2 2 22660 273972 40988 489508 0 0 8895 21144 8875 1931 10 43 21 27 3 0 22660 236412 73620 491148 0 0 10774 10551 9877 1937 10 44 24 22 1 1 22660 185112 104112 492616 0 0 9677 12354 10216 1863 10 44 28 19 2 0 22660 148700 138388 494108 0 0 10227 13919 9976 1925 11 44 24 21 0 2 22660 123432 162032 495012 0 0 6244 15418 10065 1793 11 46 28 16 3 0 22660 93096 190452 496292 8 0 6548 10293 9860 1975 11 43 31 15 2 0 22660 51688 218628 497424 0 0 6405 52 10575 2063 13 48 27 12 3 1 22660 19012 245632 499032 8 0 8108 12400 10136 1892 11 44 24 21 2 1 22660 249192 42956 490932 0 0 8231 33005 9109 1343 10 60 13 18 0 1 22660 240396 53764 491956 0 0 10082 18625 9504 1740 10 47 24 19 2 2 22660 205632 86108 493408 0 0 8305 12368 8941 1775 8 33 32 26 0 2 22660 164672 119156 494972 0 0 6867 62 9695 1894 11 40 31 18 1 3 22660 137924 144964 496568 0 0 7099 16440 10388 1878 11 47 26 17 1 1 22660 101604 176936 498052 0 0 9166 12332 10237 1694 12 44 28 16 2 1 22660 67816 205376 499176 8 0 6169 6158 9906 1897 11 44 31 15 1 1 22660 28004 236520 500652 10 0 7418 6202 10289 1744 12 44 30 14 2 1 22660 7484 259156 492544 12 0 7494 18540 10218 1757 11 49 21 19 1 4 22660 61664 228360 494004 72 0 6131 14412 9611 2437 10 46 20 23 3 1 22660 76976 242652 498884 36 0 6927 16558 7560 2219 18 42 13 27 0 1 22660 62352 267840 501140 14 0 7358 10424 8273 2601 11 32 33 23 1 1 22660 6880 301056 502528 4 0 11045 2304 10177 2137 12 42 26 20 0 4 22660 280848 40856 494196 0 0 6583 45092 9379 1505 9 61 13 See how "cache" remains stable, but free/buffers memory is oscillating? That shouldn't happen, right ? I tried to reproduce it on another 2.6.7-rc3 system with while :; do dd if=/dev/zero of=/dev/sda8 bs=1M count=10; sleep 1; done and while I did see it oscillating once or twice after that it remained stable (buffers high / free memory low) and I can't seem to be able to reproduce it again. Yesterday I ported my rawfs module over to 2.6. It's a minimal filesystem that shows a blockdevice as a single large file. I'm letting the newsserver access that instead of the blockdevice directly so all access goes through the pagecache instead of the buffer cache and that runs much more smoothly, though it's harder to tune 'swappiness' with it - it seems to be much more "all or nothing" in that case. Anyway that's what I'm using now. Mike. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Keeping mmap'ed files in core regression in 2.6.7-rc 2004-06-15 14:31 ` Miquel van Smoorenburg @ 2004-06-16 3:16 ` Nick Piggin 2004-06-16 3:50 ` Andrew Morton 2004-06-17 10:50 ` Miquel van Smoorenburg 0 siblings, 2 replies; 11+ messages in thread From: Nick Piggin @ 2004-06-16 3:16 UTC (permalink / raw) To: Miquel van Smoorenburg; +Cc: Andrew Morton, linux-mm Miquel van Smoorenburg wrote: > According to Nick Piggin: > >>Miquel van Smoorenburg wrote: >> >>> >>>The patch below indeed fixes this problem. Now most of the mmap'ed files >>>are actually kept in memory and RSS is around 600 MB again: >> >>OK good. Cc'ing Andrew. > > > I've built a small test app that creates the same I/O pattern and ran it > on 2.6.6, 2.6.7-rc3 and 2.6.7-rc3+patch and running that confirms it, > though not as dramatically as the real-life application. > Can you send the test app over? Andrew, do you have any ideas about how to fix this so far? > > > Now something else that is weird, but might be unrelated and I have > not found a way to reproduce it on a different machine yet, so feel > free to ignore it, I'm just mentioning it in case someone reckognizes > this. > > The news server process uses /dev/hd[cdg]1 directly for storage > (Cyclic News File System). There's about 12 MB/sec incoming > being stored on those 3 (SATA) disks. Look at the vmstat output: > > # vmstat 2 > procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy id wa > 4 0 22664 5216 277332 496644 28 0 8143 36 9785 2162 12 43 28 16 > 1 3 22660 231252 71808 488580 16 0 5947 33856 8868 1633 9 60 11 20 > 2 2 22660 273972 40988 489508 0 0 8895 21144 8875 1931 10 43 21 27 > 3 0 22660 236412 73620 491148 0 0 10774 10551 9877 1937 10 44 24 22 > 1 1 22660 185112 104112 492616 0 0 9677 12354 10216 1863 10 44 28 19 > 2 0 22660 148700 138388 494108 0 0 10227 13919 9976 1925 11 44 24 21 > 0 2 22660 123432 162032 495012 0 0 6244 15418 10065 1793 11 46 28 16 > 3 0 22660 93096 190452 496292 8 0 6548 10293 9860 1975 11 43 31 15 > 2 0 22660 51688 218628 497424 0 0 6405 52 10575 2063 13 48 27 12 > 3 1 22660 19012 245632 499032 8 0 8108 12400 10136 1892 11 44 24 21 > 2 1 22660 249192 42956 490932 0 0 8231 33005 9109 1343 10 60 13 18 > 0 1 22660 240396 53764 491956 0 0 10082 18625 9504 1740 10 47 24 19 > 2 2 22660 205632 86108 493408 0 0 8305 12368 8941 1775 8 33 32 26 > 0 2 22660 164672 119156 494972 0 0 6867 62 9695 1894 11 40 31 18 > 1 3 22660 137924 144964 496568 0 0 7099 16440 10388 1878 11 47 26 17 > 1 1 22660 101604 176936 498052 0 0 9166 12332 10237 1694 12 44 28 16 > 2 1 22660 67816 205376 499176 8 0 6169 6158 9906 1897 11 44 31 15 > 1 1 22660 28004 236520 500652 10 0 7418 6202 10289 1744 12 44 30 14 > 2 1 22660 7484 259156 492544 12 0 7494 18540 10218 1757 11 49 21 19 > 1 4 22660 61664 228360 494004 72 0 6131 14412 9611 2437 10 46 20 23 > 3 1 22660 76976 242652 498884 36 0 6927 16558 7560 2219 18 42 13 27 > 0 1 22660 62352 267840 501140 14 0 7358 10424 8273 2601 11 32 33 23 > 1 1 22660 6880 301056 502528 4 0 11045 2304 10177 2137 12 42 26 20 > 0 4 22660 280848 40856 494196 0 0 6583 45092 9379 1505 9 61 13 > > See how "cache" remains stable, but free/buffers memory is oscillating? > That shouldn't happen, right ? > If it is doing IO to large regions of mapped memory, the page reclaim can start getting a bit chunky. Not much you can do about it, but it shouldn't do any harm. > I tried to reproduce it on another 2.6.7-rc3 system with > while :; do dd if=/dev/zero of=/dev/sda8 bs=1M count=10; sleep 1; done > and while I did see it oscillating once or twice after that it > remained stable (buffers high / free memory low) and I can't seem > to be able to reproduce it again. > Probably because it isn't doing mmapped IO. > Yesterday I ported my rawfs module over to 2.6. It's a minimal filesystem that > shows a blockdevice as a single large file. I'm letting the newsserver access > that instead of the blockdevice directly so all access goes through the > pagecache instead of the buffer cache and that runs much more smoothly, though > it's harder to tune 'swappiness' with it - it seems to be much more "all > or nothing" in that case. Anyway that's what I'm using now. > In 2.6, everything basically should go through the same path I think, so it really shouldn't make much difference. The fact that swappiness stops having any effect sounds like the server switched from doing mapped IO to read/write. Maybe I'm crazy... could you verify? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Keeping mmap'ed files in core regression in 2.6.7-rc 2004-06-16 3:16 ` Nick Piggin @ 2004-06-16 3:50 ` Andrew Morton 2004-06-16 4:03 ` Nick Piggin 2004-06-17 10:50 ` Miquel van Smoorenburg 1 sibling, 1 reply; 11+ messages in thread From: Andrew Morton @ 2004-06-16 3:50 UTC (permalink / raw) To: Nick Piggin; +Cc: miquels, linux-mm Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > Can you send the test app over? logical next step. > Andrew, do you have any ideas about how to fix this so far? Not sure what, if anything, is wrong yet. It could be that reclaim is now doing the "right" thing, but this particular workload preferred the "wrong" thing. Needs more investigation. > > > > See how "cache" remains stable, but free/buffers memory is oscillating? > > That shouldn't happen, right ? > > > > If it is doing IO to large regions of mapped memory, the page reclaim > can start getting a bit chunky. Not much you can do about it, but it > shouldn't do any harm. shrink_zone() will free arbitrarily large amounts of memory as the scanning priority increases. Probably it shouldn't. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Keeping mmap'ed files in core regression in 2.6.7-rc 2004-06-16 3:50 ` Andrew Morton @ 2004-06-16 4:03 ` Nick Piggin 2004-06-16 4:23 ` Andrew Morton 0 siblings, 1 reply; 11+ messages in thread From: Nick Piggin @ 2004-06-16 4:03 UTC (permalink / raw) To: Andrew Morton; +Cc: miquels, linux-mm Andrew Morton wrote: > Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >>Can you send the test app over? > > > logical next step. > > >>Andrew, do you have any ideas about how to fix this so far? > > > Not sure what, if anything, is wrong yet. It could be that reclaim is now > doing the "right" thing, but this particular workload preferred the "wrong" > thing. Needs more investigation. > > > >>>See how "cache" remains stable, but free/buffers memory is oscillating? >>>That shouldn't happen, right ? >>> >> >>If it is doing IO to large regions of mapped memory, the page reclaim >>can start getting a bit chunky. Not much you can do about it, but it >>shouldn't do any harm. > > > shrink_zone() will free arbitrarily large amounts of memory as the scanning > priority increases. Probably it shouldn't. > > Especially for kswapd, I think, because it can end up fighting with memory allocators and think it is getting into trouble. It should probably rather just keep putting along quietly. I have a few experimental patches that magnify this problem, so I'll be looking at fixing it soon. The tricky part will be trying to maintain a similar prev_priority / temp_priority balance. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Keeping mmap'ed files in core regression in 2.6.7-rc 2004-06-16 4:03 ` Nick Piggin @ 2004-06-16 4:23 ` Andrew Morton 2004-06-16 4:41 ` Nick Piggin 0 siblings, 1 reply; 11+ messages in thread From: Andrew Morton @ 2004-06-16 4:23 UTC (permalink / raw) To: Nick Piggin; +Cc: miquels, linux-mm Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > > > > shrink_zone() will free arbitrarily large amounts of memory as the scanning > > priority increases. Probably it shouldn't. > > > > > > Especially for kswapd, I think, because it can end up fighting with > memory allocators and think it is getting into trouble. It should > probably rather just keep putting along quietly. > > I have a few experimental patches that magnify this problem, so I'll > be looking at fixing it soon. The tricky part will be trying to > maintain a similar prev_priority / temp_priority balance. hm, I don't see why. Why not simply bale from shrink_listing as soon as we've reclaimed SWAP_CLUSTER_MAX pages? I got bored of shrink_zone() bugs and rewrote it again yesterday. Haven't tested it much. I really hate struct scan_control btw ;) We've been futzing with the scan rates of the inactive and active lists far too much, and it's still not right (Anton reports interrupt-off times of over a second). - We have this logic in there from 2.4.early (at least) which tries to keep the inactive list 1/3rd the size of the active list. Or something. I really cannot see any logic behind this, so toss it out and change the arithmetic in there so that all pages on both lists have equal scan rates. - Chunk the work up so we never hold interrupts off for more that 32 pages worth of scanning. - Make the per-zone scan-count accumulators unsigned long rather than atomic_t. Mainly because atomic_t's could conceivably overflow, but also because access to these counters is racy-by-design anyway. Signed-off-by: Andrew Morton <akpm@osdl.org> --- 25-akpm/include/linux/mmzone.h | 4 +- 25-akpm/mm/page_alloc.c | 4 +- 25-akpm/mm/vmscan.c | 70 ++++++++++++++++++----------------------- 3 files changed, 35 insertions(+), 43 deletions(-) diff -puN mm/vmscan.c~vmscan-scan-sanity mm/vmscan.c --- 25/mm/vmscan.c~vmscan-scan-sanity 2004-06-15 02:19:01.485627112 -0700 +++ 25-akpm/mm/vmscan.c 2004-06-15 02:49:29.317754392 -0700 @@ -789,54 +789,46 @@ refill_inactive_zone(struct zone *zone, } /* - * Scan `nr_pages' from this zone. Returns the number of reclaimed pages. * This is a basic per-zone page freer. Used by both kswapd and direct reclaim. */ static void shrink_zone(struct zone *zone, struct scan_control *sc) { - unsigned long scan_active, scan_inactive; - int count; - - scan_inactive = (zone->nr_active + zone->nr_inactive) >> sc->priority; + unsigned long nr_active; + unsigned long nr_inactive; /* - * Try to keep the active list 2/3 of the size of the cache. And - * make sure that refill_inactive is given a decent number of pages. - * - * The "scan_active + 1" here is important. With pagecache-intensive - * workloads the inactive list is huge, and `ratio' evaluates to zero - * all the time. Which pins the active list memory. So we add one to - * `scan_active' just to make sure that the kernel will slowly sift - * through the active list. + * Add one to `nr_to_scan' just to make sure that the kernel will + * slowly sift through the active list. */ - if (zone->nr_active >= 4*(zone->nr_inactive*2 + 1)) { - /* Don't scan more than 4 times the inactive list scan size */ - scan_active = 4*scan_inactive; - } else { - unsigned long long tmp; - - /* Cast to long long so the multiply doesn't overflow */ - - tmp = (unsigned long long)scan_inactive * zone->nr_active; - do_div(tmp, zone->nr_inactive*2 + 1); - scan_active = (unsigned long)tmp; - } - - atomic_add(scan_active + 1, &zone->nr_scan_active); - count = atomic_read(&zone->nr_scan_active); - if (count >= SWAP_CLUSTER_MAX) { - atomic_set(&zone->nr_scan_active, 0); - sc->nr_to_scan = count; - refill_inactive_zone(zone, sc); - } + zone->nr_scan_active += (zone->nr_active >> sc->priority) + 1; + nr_active = zone->nr_scan_active; + if (nr_active >= SWAP_CLUSTER_MAX) + zone->nr_scan_active = 0; + else + nr_active = 0; + + zone->nr_scan_inactive += (zone->nr_inactive >> sc->priority) + 1; + nr_inactive = zone->nr_scan_inactive; + if (nr_inactive >= SWAP_CLUSTER_MAX) + zone->nr_scan_inactive = 0; + else + nr_inactive = 0; + + while (nr_active || nr_inactive) { + if (nr_active) { + sc->nr_to_scan = min(nr_active, + (unsigned long)SWAP_CLUSTER_MAX); + nr_active -= sc->nr_to_scan; + refill_inactive_zone(zone, sc); + } - atomic_add(scan_inactive, &zone->nr_scan_inactive); - count = atomic_read(&zone->nr_scan_inactive); - if (count >= SWAP_CLUSTER_MAX) { - atomic_set(&zone->nr_scan_inactive, 0); - sc->nr_to_scan = count; - shrink_cache(zone, sc); + if (nr_inactive) { + sc->nr_to_scan = min(nr_inactive, + (unsigned long)SWAP_CLUSTER_MAX); + nr_inactive -= sc->nr_to_scan; + shrink_cache(zone, sc); + } } } diff -puN include/linux/mmzone.h~vmscan-scan-sanity include/linux/mmzone.h --- 25/include/linux/mmzone.h~vmscan-scan-sanity 2004-06-15 02:49:35.705783264 -0700 +++ 25-akpm/include/linux/mmzone.h 2004-06-15 02:49:48.283871104 -0700 @@ -118,8 +118,8 @@ struct zone { spinlock_t lru_lock; struct list_head active_list; struct list_head inactive_list; - atomic_t nr_scan_active; - atomic_t nr_scan_inactive; + unsigned long nr_scan_active; + unsigned long nr_scan_inactive; unsigned long nr_active; unsigned long nr_inactive; int all_unreclaimable; /* All pages pinned */ diff -puN mm/page_alloc.c~vmscan-scan-sanity mm/page_alloc.c --- 25/mm/page_alloc.c~vmscan-scan-sanity 2004-06-15 02:50:04.404420408 -0700 +++ 25-akpm/mm/page_alloc.c 2004-06-15 02:50:53.752918296 -0700 @@ -1482,8 +1482,8 @@ static void __init free_area_init_core(s zone_names[j], realsize, batch); INIT_LIST_HEAD(&zone->active_list); INIT_LIST_HEAD(&zone->inactive_list); - atomic_set(&zone->nr_scan_active, 0); - atomic_set(&zone->nr_scan_inactive, 0); + zone->nr_scan_active = 0; + zone->nr_scan_inactive = 0; zone->nr_active = 0; zone->nr_inactive = 0; if (!size) _ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Keeping mmap'ed files in core regression in 2.6.7-rc 2004-06-16 4:23 ` Andrew Morton @ 2004-06-16 4:41 ` Nick Piggin 0 siblings, 0 replies; 11+ messages in thread From: Nick Piggin @ 2004-06-16 4:41 UTC (permalink / raw) To: Andrew Morton; +Cc: miquels, linux-mm Andrew Morton wrote: > Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >>>shrink_zone() will free arbitrarily large amounts of memory as the scanning >>>priority increases. Probably it shouldn't. >>> >>> >> >>Especially for kswapd, I think, because it can end up fighting with >>memory allocators and think it is getting into trouble. It should >>probably rather just keep putting along quietly. >> >>I have a few experimental patches that magnify this problem, so I'll >>be looking at fixing it soon. The tricky part will be trying to >>maintain a similar prev_priority / temp_priority balance. > > > hm, I don't see why. Why not simply bale from shrink_listing as soon as > we've reclaimed SWAP_CLUSTER_MAX pages? > Oh yeah, that would be the way to go about it. Your patch looks alright as a platform to do achieve this. > I got bored of shrink_zone() bugs and rewrote it again yesterday. Haven't > tested it much. I really hate struct scan_control btw ;) > Well I can keep it local here. I have some stuff which requires more things to be passed up and down the call chains which gets annoying passing lots of things by reference. > > > > We've been futzing with the scan rates of the inactive and active lists far > too much, and it's still not right (Anton reports interrupt-off times of over > a second). > > - We have this logic in there from 2.4.early (at least) which tries to keep > the inactive list 1/3rd the size of the active list. Or something. > > I really cannot see any logic behind this, so toss it out and change the > arithmetic in there so that all pages on both lists have equal scan rates. > I think it is somewhat to do with use-once logic. If your inactive list remains full of use-once pages, you can happily scan them while putting minimal pressure on the active list. I don't think we need to try to keep it *at least* 1/3rd the size anymore. From distant memory, that may have been when the inactive list was more of a "writeout queue". I don't know though, it might still be useful. > - Chunk the work up so we never hold interrupts off for more that 32 pages > worth of scanning. > Yeah this was a bit silly. Good fix. > - Make the per-zone scan-count accumulators unsigned long rather than > atomic_t. > > Mainly because atomic_t's could conceivably overflow, but also because > access to these counters is racy-by-design anyway. > Seems OK other than my one possible issue. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Keeping mmap'ed files in core regression in 2.6.7-rc 2004-06-16 3:16 ` Nick Piggin 2004-06-16 3:50 ` Andrew Morton @ 2004-06-17 10:50 ` Miquel van Smoorenburg 1 sibling, 0 replies; 11+ messages in thread From: Miquel van Smoorenburg @ 2004-06-17 10:50 UTC (permalink / raw) To: Nick Piggin; +Cc: Miquel van Smoorenburg, Andrew Morton, linux-mm On 2004.06.16 05:16, Nick Piggin wrote: > Miquel van Smoorenburg wrote: > > According to Nick Piggin: > > > >>Miquel van Smoorenburg wrote: > >> > >>> > >>>The patch below indeed fixes this problem. Now most of the mmap'ed files > >>>are actually kept in memory and RSS is around 600 MB again: > >> > >>OK good. Cc'ing Andrew. > > > > > > I've built a small test app that creates the same I/O pattern and ran it > > on 2.6.6, 2.6.7-rc3 and 2.6.7-rc3+patch and running that confirms it, > > though not as dramatically as the real-life application. > > > > Can you send the test app over? > Andrew, do you have any ideas about how to fix this so far? I'll have to come back on this later - I'm about to go on vacation, and there's some other stuff that needs to be taken care of first. Mike. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2004-06-17 10:50 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2004-06-08 14:29 Keeping mmap'ed files in core regression in 2.6.7-rc Miquel van Smoorenburg 2004-06-12 6:56 ` Nick Piggin 2004-06-14 14:06 ` Miquel van Smoorenburg 2004-06-15 3:03 ` Nick Piggin 2004-06-15 14:31 ` Miquel van Smoorenburg 2004-06-16 3:16 ` Nick Piggin 2004-06-16 3:50 ` Andrew Morton 2004-06-16 4:03 ` Nick Piggin 2004-06-16 4:23 ` Andrew Morton 2004-06-16 4:41 ` Nick Piggin 2004-06-17 10:50 ` Miquel van Smoorenburg
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox