linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Subtle MM bug
@ 2001-01-07 20:59 Zlatko Calusic
  2001-01-07 21:37 ` Rik van Riel
  0 siblings, 1 reply; 34+ messages in thread
From: Zlatko Calusic @ 2001-01-07 20:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm

I'm trying to get more familiar with the MM code in 2.4.0, as can be
seen from lots of questions I have on the subject. I discovered nasty
mm behaviour under even moderate load (2.2 didn't have troubles).

Things go berzerk if you have one big process whose working set is
around your physical memory size. Typical memory hoggers are good
enough to trigger the bad behaviour. Final effect is that physical
memory gets extremely flooded with the swap cache pages and at the
same time the system absorbs ridiculous amount of the swap space.
xmem is as usual very good at detecting this and you just need to
press Alt-SysReq-M to see that most of the memory (e.g. 90%) is
populated with the swap cache pages.

For instance on my 192MB configuration, firing up the hogmem program
which allocates let's say 170MB of memory and dirties it leads to
215MB of swap used. vmstat 1 shows that the pagecache size is
constantly growing - that is swapcache enlarging in fact - during the
second pass of the hogmem program.

...
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd  free buff  cache   si   so    bi    bo   in    cs  us  sy  id
 0  1  1 131488  1592  400  62384 4172 5188  1092  1298  353  1447   2   4  94
 0  1  1 136584  1592  400  67428 5860 4104  1465  1034  322  1327   3   3  93
 0  1  1 141668  1592  388  72536 5504 4420  1376  1106  323  1423   1   3  95
 0  1  1 146724  1592  380  77592 5996 4236  1499  1060  335  1096   2   3  94
 0  1  1 151876  1600  320  82764 6264 3712  1566   936  327  1226   3   4  93
 0  1  1 157016  1600  320  87908 5284 4268  1321  1068  315  1248   1   2  96
 1  0  0 157016  1600  308  87792 1836 5168   459  1293  281  1324   3   3  94
 0  1  0 162204  1600  304  92892 7784 5236  1946  1315  385  1353   3   5  92
 0  1  0 167216  1600  304  97780 3496 5016   874  1256  301  1222   0   2  97
 0  1  1 177904  1608  284 108276 5160 5168  1290  1300  330  1453   1   4  94
 0  1  2 182008  1588  288 112264 4936 3344  1268   838  293   801   2   3  95
 0  2  1 183620  1588  260 114012 3064 1756   830   445  290   846   0  15  85
 0  2  2 185384  1596  180 115864 2320 2620   635   658  285   722   1  29  70
 0  3  2 187528  1592  220 117892 2488 2224   657   557  273   754   3  30  67
 0  4  1 190512  1592  236 120772 2524 3012   725   760  343  1080   1  14  85
 0  4  1 195780  1592  240 125868 2336 5316   613  1331  381  1624   2   2  96
 1  0  1 200992  1592  248 131052 2080 2176   623   552  234  1044   3  23  74
 0  1  0 200996  1592  252 130948 2208 3048   580   762  256  1065  10  10  80
 0  1  1 206240  1592  252 136076 2988 5252   760  1314  309  1406   7   4  8
 0  2  1 211408  1592  256 141080 5424 5180  1389  1303  395  1885   3   5  91
 0  2  0 214744  1592  264 144280 4756 3328  1223   834  327  1211   1   5  95
 1  0  0 214868  1592  244 144468 4344 5148  1087  1295  303  1189  11   2  86
 0  1  1 214900  1592  248 144496 4360 3244  1098   812  318  1467   7   4  89
 0  1  1 214916  1592  248 144520 4280 3452  1070   865  336  1602   3   3  94
 0  1  1 214964  1592  248 144580 4972 4184  1243  1054  368  1620   3   5  92
 0  2  2 214956  1592  272 144548 3700 4544  1081  1142  665  2952   1   1  98
 0  1  0 214992  1592  272 144588 1220 5088   305  1274  282  1363   1   4  95
 0  1  1 215012  1592  272 144600 3640 4420   910  1106  325  1579   3   2  9

Any thoughts on this?
-- 
Zlatko
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Subtle MM bug
  2001-01-07 20:59 Subtle MM bug Zlatko Calusic
@ 2001-01-07 21:37 ` Rik van Riel
  2001-01-07 22:33   ` Zlatko Calusic
  2001-01-09  2:01   ` Zlatko Calusic
  0 siblings, 2 replies; 34+ messages in thread
From: Rik van Riel @ 2001-01-07 21:37 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: linux-kernel, linux-mm

On 7 Jan 2001, Zlatko Calusic wrote:

> Things go berzerk if you have one big process whose working set
> is around your physical memory size.

"go berzerk" in what way?  Does the system cause lots of extra
swap IO and does it make the system thrash where 2.2 didn't
even touch the disk ?

> Final effect is that physical memory gets extremely flooded with
> the swap cache pages and at the same time the system absorbs
> ridiculous amount of the swap space.

This is mostly because Linux 2.4 keeps dirty pages in the
swap cache. Under Linux 2.2 a page would be deleted from the
swap cache when a program writes to it, but in Linux 2.4 it
can stay in the swap cache.

Oh, and don't forget that pages in the swap cache can also
be resident in the process, so it's not like the swap cache
is "eating into" the process' RSS ;)

> For instance on my 192MB configuration, firing up the hogmem
> program which allocates let's say 170MB of memory and dirties it
> leads to 215MB of swap used.

So that's 170MB of swap space for hogmem and 45MB for
the other things in the system (daemons, X, ...).

Sounds pretty ok, except maybe for the fact that now
Linux allocates (not uses!) a lot more swap space then
before and some people may need to add some swap space
to their system ...


Now if 2.4 has worse _performance_ than 2.2 due to one
reason or another, that I'd like to hear about ;)

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Subtle MM bug
  2001-01-07 21:37 ` Rik van Riel
@ 2001-01-07 22:33   ` Zlatko Calusic
  2001-01-09  2:01   ` Zlatko Calusic
  1 sibling, 0 replies; 34+ messages in thread
From: Zlatko Calusic @ 2001-01-07 22:33 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm

Rik van Riel <riel@conectiva.com.br> writes:

> On 7 Jan 2001, Zlatko Calusic wrote:
> 
> > Things go berzerk if you have one big process whose working set
> > is around your physical memory size.
> 
> "go berzerk" in what way?  Does the system cause lots of extra
> swap IO and does it make the system thrash where 2.2 didn't
> even touch the disk ?
>

Well, I think yes. I'll do some testing on the 2.2 before I can tell
you for sure, but definitely the system is behaving badly where I
think it should not.

> > Final effect is that physical memory gets extremely flooded with
> > the swap cache pages and at the same time the system absorbs
> > ridiculous amount of the swap space.
> 
> This is mostly because Linux 2.4 keeps dirty pages in the
> swap cache. Under Linux 2.2 a page would be deleted from the
> swap cache when a program writes to it, but in Linux 2.4 it
> can stay in the swap cache.
>

OK, I can buy that.

> Oh, and don't forget that pages in the swap cache can also
> be resident in the process, so it's not like the swap cache
> is "eating into" the process' RSS ;)
>

So far so good... A little bit weird but not alarming per se.

> > For instance on my 192MB configuration, firing up the hogmem
> > program which allocates let's say 170MB of memory and dirties it
> > leads to 215MB of swap used.
> 
> So that's 170MB of swap space for hogmem and 45MB for
> the other things in the system (daemons, X, ...).
>

Yes, that's it. So it looks like all of my processes are on the
swap. That can't be good. I mean, even Solaris (known to eat swap
space like there's no tomorrow :)) would probably be more polite.

> Sounds pretty ok, except maybe for the fact that now
> Linux allocates (not uses!) a lot more swap space then
> before and some people may need to add some swap space
> to their system ...
>

Yes, I would say really a lot more. Big diffeence.

Also, I don't see a diference between allocated and used swap space on
the Linux. Could you elaborate on that?

> 
> Now if 2.4 has worse _performance_ than 2.2 due to one
> reason or another, that I'd like to hear about ;)
> 

I'll get back to you later with more data. Time to boot 2.2. :)
-- 
Zlatko
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Subtle MM bug
  2001-01-07 21:37 ` Rik van Riel
  2001-01-07 22:33   ` Zlatko Calusic
@ 2001-01-09  2:01   ` Zlatko Calusic
  2001-01-17  4:48     ` Rik van Riel
  1 sibling, 1 reply; 34+ messages in thread
From: Zlatko Calusic @ 2001-01-09  2:01 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm

Rik van Riel <riel@conectiva.com.br> writes:

> Now if 2.4 has worse _performance_ than 2.2 due to one
> reason or another, that I'd like to hear about ;)
> 

Oh, well, it seems that I was wrong. :)


First test: hogmem 180 5 = allocate 180MB and dirty it 5 times (on a
192MB machine)

kernel | swap usage | speed
-------------------------------
2.2.17 |  48 MB     | 11.8 MB/s
-------------------------------
2.4.0  | 206 MB     | 11.1 MB/s
-------------------------------

So 2.2 is only marginally faster. Also it can be seen that 2.4 uses 4
times more swap space. If Linus says it's ok... :)


Second test: kernel compile make -j32 (empirically this puts the VM
under load, but not excessively!)

2.2.17 -> make -j32  392.49s user 47.87s system 168% cpu 4:21.13 total
2.4.0  -> make -j32  389.59s user 31.29s system 182% cpu 3:50.24 total

Now, is this great news or what, 2.4.0 is definitely faster.

-- 
Zlatko
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Subtle MM bug
  2001-01-09  2:01   ` Zlatko Calusic
@ 2001-01-17  4:48     ` Rik van Riel
  2001-01-17 18:53       ` Zlatko Calusic
  0 siblings, 1 reply; 34+ messages in thread
From: Rik van Riel @ 2001-01-17  4:48 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: linux-kernel, linux-mm

On 9 Jan 2001, Zlatko Calusic wrote:
> Rik van Riel <riel@conectiva.com.br> writes:
>
> > Now if 2.4 has worse _performance_ than 2.2 due to one
> > reason or another, that I'd like to hear about ;)
> >
>
> Oh, well, it seems that I was wrong. :)
>
> First test: hogmem 180 5 = allocate 180MB and dirty it 5 times (on a
> 192MB machine)
>
> kernel | swap usage | speed
> -------------------------------
> 2.2.17 |  48 MB     | 11.8 MB/s
> -------------------------------
> 2.4.0  | 206 MB     | 11.1 MB/s
> -------------------------------
>
> So 2.2 is only marginally faster. Also it can be seen that 2.4
> uses 4 times more swap space. If Linus says it's ok... :)

I have been working on some changes to page_launder() which
might just fix this problem. Quick and dirty patches are on
my home page and I'll try to clean things up and make something
correct & clean later today or tomorrow ;)

> Second test: kernel compile make -j32 (empirically this puts the
> VM under load, but not excessively!)
>
> 2.2.17 -> make -j32  392.49s user 47.87s system 168% cpu 4:21.13 total
> 2.4.0  -> make -j32  389.59s user 31.29s system 182% cpu 3:50.24 total
>
> Now, is this great news or what, 2.4.0 is definitely faster.

One problem is that these tasks may be waiting on kswapd when
kswapd might not get scheduled in on time. On the one hand this
will mean lower load and less thrashing, on the other hand it
means more IO wait.

This is another area where we may be able to improve some things.

(btw, according to Alan the 2.4 kernel is the first one to break
the 1.2 kernel compiling speed record on an 8MB machine he has ;))

cheers,

Rik  (stuck in australia on a conference)
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Subtle MM bug
  2001-01-17  4:48     ` Rik van Riel
@ 2001-01-17 18:53       ` Zlatko Calusic
  2001-01-18  1:32         ` Rik van Riel
  0 siblings, 1 reply; 34+ messages in thread
From: Zlatko Calusic @ 2001-01-17 18:53 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm

Rik van Riel <riel@conectiva.com.br> writes:

> > Second test: kernel compile make -j32 (empirically this puts the
> > VM under load, but not excessively!)
> >
> > 2.2.17 -> make -j32  392.49s user 47.87s system 168% cpu 4:21.13 total
> > 2.4.0  -> make -j32  389.59s user 31.29s system 182% cpu 3:50.24 total
> >
> > Now, is this great news or what, 2.4.0 is definitely faster.
> 
> One problem is that these tasks may be waiting on kswapd when
> kswapd might not get scheduled in on time. On the one hand this
> will mean lower load and less thrashing, on the other hand it
> means more IO wait.
> 

Hm, if all tasks are waiting for memory, what is stopping kswapd to
run? :)
-- 
Zlatko
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Subtle MM bug
  2001-01-17 18:53       ` Zlatko Calusic
@ 2001-01-18  1:32         ` Rik van Riel
  0 siblings, 0 replies; 34+ messages in thread
From: Rik van Riel @ 2001-01-18  1:32 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: linux-kernel, linux-mm

On 17 Jan 2001, Zlatko Calusic wrote:
> Rik van Riel <riel@conectiva.com.br> writes:
>
> > > Second test: kernel compile make -j32 (empirically this puts the
> > > VM under load, but not excessively!)
> > >
> > > 2.2.17 -> make -j32  392.49s user 47.87s system 168% cpu 4:21.13 total
> > > 2.4.0  -> make -j32  389.59s user 31.29s system 182% cpu 3:50.24 total
> > >
> > > Now, is this great news or what, 2.4.0 is definitely faster.
> >
> > One problem is that these tasks may be waiting on kswapd when
> > kswapd might not get scheduled in on time. On the one hand this
> > will mean lower load and less thrashing, on the other hand it
> > means more IO wait.
>
> Hm, if all tasks are waiting for memory, what is stopping kswapd
> to run? :)

Suppose you have 8 high-priority tasks waiting on kswapd
and one lower-priority (but still higher than kswapd)
process running and preventing kswapd from doing its work.
Oh .. and also preventing the higher-priority tasks from
being woken up and continuing...


Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Subtle MM bug
  2001-01-09  3:12               ` Linus Torvalds
  2001-01-09 20:33                 ` Marcelo Tosatti
@ 2001-01-17  4:54                 ` Rik van Riel
  1 sibling, 0 replies; 34+ messages in thread
From: Rik van Riel @ 2001-01-17  4:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Marcelo Tosatti, Stephen C. Tweedie, David S. Miller, linux-mm

On Mon, 8 Jan 2001, Linus Torvalds wrote:

>  - gets rid of the complex "best mm" logic and replaces it with the
>    round-robin thing as discussed.

This could help IO clustering as well, which should be good
whenever we want to swap the data back in ;)

>  - it cleans up and simplifies the MM "priority" thing. In fact, right now
>    only one priority is ever used,

Sounds great.

In the week that I've been offline I have been working on
page_launder and doing a few other improvements to the VM.

Once I get the time to clean everything up I think we can
take 2.4 to a slightly better performance level without
having to change anything big.

regards,

Rik (at linux.conf.au)
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Subtle MM bug
  2001-01-11  9:42                               ` Stephen C. Tweedie
@ 2001-01-11 15:24                                 ` Marcelo Tosatti
  0 siblings, 0 replies; 34+ messages in thread
From: Marcelo Tosatti @ 2001-01-11 15:24 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Linus Torvalds, David S. Miller, Rik van Riel, linux-mm


On Thu, 11 Jan 2001, Stephen C. Tweedie wrote:

> This might be as simple as clamping the value of the counter to some
> arbitrary maximum value such as num_physpages.

Ok, I've taken this suggestion and used to limit the counter.

I've also changed some Linus changes to swap_out() in pre2 (related to
page aging).

I've noted quite nice performance improvements with the pte scanning
(which moves the dirty pte bits to the pages) on dbench: 7Mb/sec to
9.5Mb/sec. (128MB, 48 threads)

The pte scanning will be a big win for databases with heavy IO, I suppose.

The following patch is against 2.4.1pre2.

Comments?

diff -Nur --exclude-from=exclude linux.orig/mm/swap.c linux/mm/swap.c
--- linux.orig/mm/swap.c	Thu Jan 11 11:13:37 2001
+++ linux/mm/swap.c	Thu Jan 11 14:38:09 2001
@@ -200,17 +200,22 @@
 {
 	if (PageInactiveDirty(page)) {
 		del_page_from_inactive_dirty_list(page);
-		add_page_to_active_list(page);
 	} else if (PageInactiveClean(page)) {
 		del_page_from_inactive_clean_list(page);
-		add_page_to_active_list(page);
 	} else {
 		/*
 		 * The page was not on any list, so we take care
 		 * not to do anything.
 		 */
+		goto inc_age;
 	}
 
+	add_page_to_active_list(page);
+	
+	if(bg_page_aging < num_physpages)
+		bg_page_aging++;
+
+inc_age:
 	/* Make sure the page gets a fair chance at staying active. */
 	if (page->age < PAGE_AGE_START)
 		page->age = PAGE_AGE_START;
diff -Nur --exclude-from=exclude linux.orig/mm/vmscan.c linux/mm/vmscan.c
--- linux.orig/mm/vmscan.c	Thu Jan 11 11:13:37 2001
+++ linux/mm/vmscan.c	Thu Jan 11 14:52:04 2001
@@ -24,17 +24,8 @@
 
 #include <asm/pgalloc.h>
 
-/*
- * The swap-out functions return 1 if they successfully
- * threw something out, and we got a free page. It returns
- * zero if it couldn't do anything, and any other value
- * indicates it decreased rss, but the page was shared.
- *
- * NOTE! If it sleeps, it *must* return 1 to make sure we
- * don't continue with the swap-out. Otherwise we may be
- * using a process that no longer actually exists (it might
- * have died while we slept).
- */
+int bg_page_aging = 0;
+
 static void try_to_swap_out(struct mm_struct * mm, struct vm_area_struct* vma, unsigned long address, pte_t * page_table, struct page *page)
 {
 	pte_t pte;
@@ -42,12 +33,18 @@
 
 	/* Don't look at this pte if it's been accessed recently. */
 	if (ptep_test_and_clear_young(page_table)) {
-		page->age += PAGE_AGE_ADV;
-		if (page->age > PAGE_AGE_MAX)
-			page->age = PAGE_AGE_MAX;
+		age_page_up(page);
 		return;
+	} else {
+		age_page_down_ageonly(page);
+		if (bg_page_aging)
+			bg_page_aging--;
 	}
 
+	/* Unmap only old pages */
+	if (page->age > 0)
+		return;
+
 	if (TryLockPage(page))
 		return;
 
@@ -268,7 +265,7 @@
 	return nr < SWAP_MIN ? SWAP_MIN : nr;
 }
 
-static int swap_out(unsigned int priority, int gfp_mask)
+static int swap_out(unsigned int priority, int background)
 {
 	int counter;
 	int retval = 0;
@@ -300,6 +297,13 @@
 		/* Walk about 6% of the address space each time */
 		retval |= swap_out_mm(mm, swap_amount(mm));
 		mmput(mm);
+		/* 
+		 *  In the case of background aging, stop
+		 *  the scan when we aged the necessary amount
+		 *  of pages.
+		 */
+		if (background && !bg_page_aging)
+			break;
 	} while (--counter >= 0);
 	return retval;
 
@@ -630,22 +634,24 @@
 /**
  * refill_inactive_scan - scan the active list and find pages to deactivate
  * @priority: the priority at which to scan
- * @oneshot: exit after deactivating one page
+ * @background: slightly different behaviour for background scanning
  *
  * This function will scan a portion of the active list to find
  * unused pages, those pages will then be moved to the inactive list.
  */
-int refill_inactive_scan(unsigned int priority, int oneshot)
+int refill_inactive_scan(unsigned int priority, int background)
 {
 	struct list_head * page_lru;
 	struct page * page;
-	int maxscan, page_active = 0;
+	int maxscan;
 	int ret = 0;
+	int deactivate = 1;
 
 	/* Take the lock while messing with the list... */
 	spin_lock(&pagemap_lru_lock);
 	maxscan = nr_active_pages >> priority;
 	while (maxscan-- > 0 && (page_lru = active_list.prev) != &active_list) {
+		int page_active = 0;
 		page = list_entry(page_lru, struct page, lru);
 
 		/* Wrong page on list?! (list corruption, should not happen) */
@@ -660,9 +666,19 @@
 		if (PageTestandClearReferenced(page)) {
 			age_page_up_nolock(page);
 			page_active = 1;
-		} else {
+		} else if (deactivate) {
 			age_page_down_ageonly(page);
 			/*
+			 * We're aging down a page. Decrement the counter if it
+ 			 * has not reached zero yet. If it reached zero, and we 			 * are doing background scan, stop deactivating pages.
+			 */
+			if (bg_page_aging)
+				bg_page_aging--;
+			else if (background) {
+				deactivate = 0;
+				continue;	
+			}
+			/*
 			 * Since we don't hold a reference on the page
 			 * ourselves, we have to do our test a bit more
 			 * strict then deactivate_page(). This is needed
@@ -676,21 +692,20 @@
 						(page->buffers ? 2 : 1)) {
 				deactivate_page_nolock(page);
 				page_active = 0;
-			} else {
-				page_active = 1;
 			}
 		}
 		/*
 		 * If the page is still on the active list, move it
 		 * to the other end of the list. Otherwise it was
-		 * deactivated by age_page_down and we exit successfully.
+		 * deactivated by deactivate_page_nolock and we exit 
+		 * successfully.
 		 */
 		if (page_active || PageActive(page)) {
 			list_del(page_lru);
 			list_add(page_lru, &active_list);
 		} else {
 			ret = 1;
-			if (oneshot)
+			if (!background)
 				break;
 		}
 	}
@@ -804,13 +819,13 @@
 			schedule();
 		}
 
-		while (refill_inactive_scan(DEF_PRIORITY, 1)) {
+		while (refill_inactive_scan(DEF_PRIORITY, 0)) {
 			if (--count <= 0)
 				goto done;
 		}
 
 		/* If refill_inactive_scan failed, try to page stuff out.. */
-		swap_out(DEF_PRIORITY, gfp_mask);
+		swap_out(DEF_PRIORITY, 0);
 
 		if (--maxtry <= 0)
 				return 0;
@@ -914,7 +929,11 @@
 		 * every minute. This clears old referenced bits
 		 * and moves unused pages to the inactive list.
 		 */
-		refill_inactive_scan(DEF_PRIORITY, 0);
+		refill_inactive_scan(DEF_PRIORITY, 1);
+
+		/* Walk the pte's and age them. */
+		if (bg_page_aging)
+			swap_out(DEF_PRIORITY, 1);
 
 		/* Once a second, recalculate some VM stats. */
 		if (time_after(jiffies, recalc + HZ)) {
diff -Nur --exclude-from=exclude linux.orig/include/linux/swap.h linux/include/linux/swap.h
--- linux.orig/include/linux/swap.h	Thu Jan 11 11:13:38 2001
+++ linux/include/linux/swap.h	Thu Jan 11 14:54:57 2001
@@ -101,6 +101,7 @@
 extern void swap_setup(void);
 
 /* linux/mm/vmscan.c */
+extern int bg_page_aging;
 extern struct page * reclaim_page(zone_t *);
 extern wait_queue_head_t kswapd_wait;
 extern wait_queue_head_t kreclaimd_wait;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Subtle MM bug
  2001-01-11  3:30                             ` Marcelo Tosatti
@ 2001-01-11  9:42                               ` Stephen C. Tweedie
  2001-01-11 15:24                                 ` Marcelo Tosatti
  0 siblings, 1 reply; 34+ messages in thread
From: Stephen C. Tweedie @ 2001-01-11  9:42 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Linus Torvalds, Stephen C. Tweedie, David S. Miller,
	Rik van Riel, linux-mm

Hi,

On Thu, Jan 11, 2001 at 01:30:18AM -0200, Marcelo Tosatti wrote:
> 
> On Tue, 9 Jan 2001, Linus Torvalds wrote:
> 
> > So one "conditional aging" algorithm might just be something as simple as
> 
> I've done a very easy conditional aging patch (I dont think doing new
> functions to scan the active list and the pte's is necessary)

You still need to decay the bg_page_aging counter a little somewhere,
otherwise if you've been running a long-lived workload which keeps
most of memory recently activated, you'll build up such a large
counter that going idle will still age everything to zero.

This might be as simple as clamping the value of the counter to some
arbitrary maximum value such as num_physpages.

Cheers,
 Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Subtle MM bug
  2001-01-10  0:23                           ` Linus Torvalds
  2001-01-10  0:12                             ` Marcelo Tosatti
@ 2001-01-11  3:30                             ` Marcelo Tosatti
  2001-01-11  9:42                               ` Stephen C. Tweedie
  1 sibling, 1 reply; 34+ messages in thread
From: Marcelo Tosatti @ 2001-01-11  3:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, David S. Miller, Rik van Riel, linux-mm

On Tue, 9 Jan 2001, Linus Torvalds wrote:

> So one "conditional aging" algorithm might just be something as simple as

I've done a very easy conditional aging patch (I dont think doing new
functions to scan the active list and the pte's is necessary)

kswapd is not perfectly obeing the counter: if the counter reaches 0, we
keep doing a previously (when counter > 0) called swap_out().

But since swap_out() is only scanning a small part of a mm I dont think
the "non perfect" scanning is a big issue.

Comments? 


diff --exclude-from=/home/marcelo/exclude -Nur linux.orig/include/linux/swap.h linux/include/linux/swap.h
--- linux.orig/include/linux/swap.h	Thu Jan 11 00:27:46 2001
+++ linux/include/linux/swap.h	Thu Jan 11 02:45:04 2001
@@ -101,6 +101,8 @@
 extern void swap_setup(void);
 
 /* linux/mm/vmscan.c */
+extern int bg_page_aging;
+
 extern struct page * reclaim_page(zone_t *);
 extern wait_queue_head_t kswapd_wait;
 extern wait_queue_head_t kreclaimd_wait;
diff --exclude-from=/home/marcelo/exclude -Nur linux.orig/mm/swap.c linux/mm/swap.c
--- linux.orig/mm/swap.c	Thu Jan 11 00:27:45 2001
+++ linux/mm/swap.c	Thu Jan 11 02:12:01 2001
@@ -214,6 +214,8 @@
 	/* Make sure the page gets a fair chance at staying active. */
 	if (page->age < PAGE_AGE_START)
 		page->age = PAGE_AGE_START;
+
+	bg_page_aging++;
 }
 
 void activate_page(struct page * page)
diff --exclude-from=/home/marcelo/exclude -Nur linux.orig/mm/vmscan.c linux/mm/vmscan.c
--- linux.orig/mm/vmscan.c	Thu Jan 11 00:27:45 2001
+++ linux/mm/vmscan.c	Thu Jan 11 02:53:40 2001
@@ -24,6 +24,8 @@
 
 #include <asm/pgalloc.h>
 
+int bg_page_aging = 0;
+
 /*
  * The swap-out functions return 1 if they successfully
  * threw something out, and we got a free page. It returns
@@ -60,9 +62,12 @@
 		age_page_up(page);
 		goto out_failed;
 	}
-	if (!onlist)
+	if (!onlist) {
 		/* The page is still mapped, so it can't be freeable... */
+		if(bg_page_aging)
+			bg_page_aging--;
 		age_page_down_ageonly(page);
+	}
 
 	/*
 	 * If the page is in active use by us, or if the page
@@ -650,11 +655,12 @@
  * This function will scan a portion of the active list to find
  * unused pages, those pages will then be moved to the inactive list.
  */
-int refill_inactive_scan(unsigned int priority, int oneshot)
+int refill_inactive_scan(unsigned int priority, int background)
 {
 	struct list_head * page_lru;
 	struct page * page;
-	int maxscan, page_active = 0;
+	int maxscan, page_active;
+	int deactivate = 1;
 	int ret = 0;
 
 	/* Take the lock while messing with the list... */
@@ -674,8 +680,21 @@
 		/* Do aging on the pages. */
 		if (PageTestandClearReferenced(page)) {
 			age_page_up_nolock(page);
-			page_active = 1;
-		} else {
+		} else if (deactivate) {
+
+			/* 
+			 * We're aging down a page. 
+			 * Decrement the counter if it has not reached zero
+			 * yet. If it reached zero, and we are doing background 
+			 * scan and the counter reached 0, stop deactivating pages.
+			 */
+			if (bg_page_aging)
+				bg_page_aging--;
+			else if (background) {
+				deactivate = 0;	
+				continue;
+			}
+
 			age_page_down_ageonly(page);
 			/*
 			 * Since we don't hold a reference on the page
@@ -691,8 +710,6 @@
 						(page->buffers ? 2 : 1)) {
 				deactivate_page_nolock(page);
 				page_active = 0;
-			} else {
-				page_active = 1;
 			}
 		}
 		/*
@@ -705,7 +722,8 @@
 			list_add(page_lru, &active_list);
 		} else {
 			ret = 1;
-			if (oneshot)
+			/* Stop scanning if we're not doing background scan */
+			if (!background)
 				break;
 		}
 	}
@@ -818,7 +836,7 @@
 			schedule();
 		}
 
-		while (refill_inactive_scan(priority, 1)) {
+		while (refill_inactive_scan(priority, 0)) {
 			if (--count <= 0)
 				goto done;
 		}
@@ -921,13 +939,19 @@
 		if (inactive_shortage() || free_shortage()) 
 			do_try_to_free_pages(GFP_KSWAPD, 0);
 
+
+		/* Do some (very minimal) background scanning. */
+
 		/*
-		 * Do some (very minimal) background scanning. This
-		 * will scan all pages on the active list once
+		 * This will scan all pages on the active list once
 		 * every minute. This clears old referenced bits
 		 * and moves unused pages to the inactive list.
 		 */
-		refill_inactive_scan(6, 0);
+		refill_inactive_scan(6, 1);
+	
+		/* This will scan the pte's. */
+		if(bg_page_aging)
+			swap_out(6, 0);
 
 		/* Once a second, recalculate some VM stats. */
 		if (time_after(jiffies, recalc + HZ)) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Subtle MM bug
  2001-01-10  0:12                             ` Marcelo Tosatti
@ 2001-01-10 11:29                               ` Stephen C. Tweedie
  0 siblings, 0 replies; 34+ messages in thread
From: Stephen C. Tweedie @ 2001-01-10 11:29 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Linus Torvalds, Stephen C. Tweedie, David S. Miller,
	Rik van Riel, linux-mm

Hi,

On Tue, Jan 09, 2001 at 10:12:45PM -0200, Marcelo Tosatti wrote:
> On Tue, 9 Jan 2001, Linus Torvalds wrote:
> 
> > Hmm.. Fair enough. However, if you don't have VM pressure, you're also not
> > going to look at the page tables, so you are not going to get any use
> > information from them, either.
> 
> Are you sure that potentially unmapping pte's and swapping out its pages
> in the background scanning is ok? 

Why not?  We're only going to be aging things slowly in the absense of
memory pressure, and if a page hasn't been used between two
widely-separated passes then inactivating the page isn't likely to
have much impact: it's only a soft-fault to get it back.

> > The aging should really be done at roughly the same rate as the "mark
> > active", wouldn't you say? If you mark things active without aging, pages
> > end up all being marked as "new". And if you age without marking things
> > active, they all end up being "old". Neither is good. What you really want
> > to have is aging that happens at the same rate as reference marking.
> > So one "conditional aging" algorithm might just be something as simple as
> > 
> >  - every time you mark something referenced, you increment a counter
> >  - every time you want to age something, you check whethe rthe counter is
> >    positive first (and decrement it if you age something)
> 
> Seems to be a nice solution.

This is _exactly_ what I proposed to Rick last time we talked about
it, and it seems to be the right balance between maintaining uptodate
information when data is being accessed, and maintaining old state
when it isn't.  You need to decay the counter appropriately, though.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Subtle MM bug
  2001-01-09 22:21                         ` Marcelo Tosatti
@ 2001-01-10  0:23                           ` Linus Torvalds
  2001-01-10  0:12                             ` Marcelo Tosatti
  2001-01-11  3:30                             ` Marcelo Tosatti
  0 siblings, 2 replies; 34+ messages in thread
From: Linus Torvalds @ 2001-01-10  0:23 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Stephen C. Tweedie, David S. Miller, Rik van Riel, linux-mm


On Tue, 9 Jan 2001, Marcelo Tosatti wrote:
> > 
> > No, I'm saying that "the background scanning" should not do the page
> > aging.
> 
> If you age pages only when there is memory pressure/low memory, you'll
> have less knowledge about which pages were unused/used pages over time.

Hmm.. Fair enough. However, if you don't have VM pressure, you're also not
going to look at the page tables, so you are not going to get any use
information from them, either. 

The aging should really be done at roughly the same rate as the "mark
active", wouldn't you say? If you mark things active without aging, pages
end up all being marked as "new". And if you age without marking things
active, they all end up being "old". Neither is good. What you really want
to have is aging that happens at the same rate as reference marking.

So one "conditional aging" algorithm might just be something as simple as

 - every time you mark something referenced, you increment a counter
 - every time you want to age something, you check whethe rthe counter is
   positive first (and decrement it if you age something)

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Subtle MM bug
  2001-01-10  0:23                           ` Linus Torvalds
@ 2001-01-10  0:12                             ` Marcelo Tosatti
  2001-01-10 11:29                               ` Stephen C. Tweedie
  2001-01-11  3:30                             ` Marcelo Tosatti
  1 sibling, 1 reply; 34+ messages in thread
From: Marcelo Tosatti @ 2001-01-10  0:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, David S. Miller, Rik van Riel, linux-mm

On Tue, 9 Jan 2001, Linus Torvalds wrote:

> Hmm.. Fair enough. However, if you don't have VM pressure, you're also not
> going to look at the page tables, so you are not going to get any use
> information from them, either.

Are you sure that potentially unmapping pte's and swapping out its pages
in the background scanning is ok? 

I mean, what kind of swap behaviour we will have if we do it?

> The aging should really be done at roughly the same rate as the "mark
> active", wouldn't you say? If you mark things active without aging, pages
> end up all being marked as "new". And if you age without marking things
> active, they all end up being "old". Neither is good. What you really want
> to have is aging that happens at the same rate as reference marking.
> So one "conditional aging" algorithm might just be something as simple as
> 
>  - every time you mark something referenced, you increment a counter
>  - every time you want to age something, you check whethe rthe counter is
>    positive first (and decrement it if you age something)

Seems to be a nice solution.

I'll send you the previously promised patch and then I'll send the
background scanning one as soon as we (or I?) figure out the previous
question about background pte scanning.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Subtle MM bug
  2001-01-09 21:33                     ` Marcelo Tosatti
@ 2001-01-09 23:58                       ` Linus Torvalds
  2001-01-09 22:21                         ` Marcelo Tosatti
  0 siblings, 1 reply; 34+ messages in thread
From: Linus Torvalds @ 2001-01-09 23:58 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Stephen C. Tweedie, David S. Miller, Rik van Riel, linux-mm


On Tue, 9 Jan 2001, Marcelo Tosatti wrote:
> 
> > > The second problem is that background scanning is being done
> > > unconditionally, and it should not. You end up getting all pages with the
> > > same age if the system is idle. Look at this example (2.4.1-pre1):
> > 
> > I agree. However, I think that we do want to do some background scanning
> > to push out dirty pages in the background, kind of like bdflush. It just
> > shouldn't age the pages (and thus not move them to the inactive list).
> 
> Actually it must age the pages, but aging should not be unconditional. 

No, I'm saying that "the background scanning" should not do the page
aging.

Obviously "refill_inactive()" needs to do the page aging. I'm just not at
all convinced that "background scanning" == "refill_inactive()". 

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Subtle MM bug
  2001-01-09 20:33                 ` Marcelo Tosatti
@ 2001-01-09 22:44                   ` Linus Torvalds
  2001-01-09 21:33                     ` Marcelo Tosatti
  0 siblings, 1 reply; 34+ messages in thread
From: Linus Torvalds @ 2001-01-09 22:44 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Stephen C. Tweedie, David S. Miller, Rik van Riel, linux-mm


On Tue, 9 Jan 2001, Marcelo Tosatti wrote:
> 
> The "while (!inactive_shortage())" should be "while (inactive_shortage())"
> as Benjamin noted on lk.

Yes. Also, it does need something to make sure that it doesn't end up
being an endless loop. 

Now, the oom_killer() thing should make sure it's not endless, but the
fact is that kswapd() (who calls the oom-killer) also calls the very same
do_try_to_free_pages(), so we really do have to make sure that it doesn't
loop forever trying to find a page. 

The priority countdown used to handle this, and while I disagree with the
_other_ uses of the priority (it used to make the freeing action
"chunkier" by walking bigger pieces of the VM or the active lists), I
think we need to rename "priority" to "maxtry", and use that to give up
gracefully when we truly do run out of memory.

(I _suspect_ that the oom killer would be invoced before this happens in
practice, and refill_inactive_scan() would find _something_ to make
slight progress on all the time, but the fact is that we shouldn't have
those kinds of assumptions in the VM code).

This would make the return value (that you removed in this patch) still a
valid thing. So I don't think it should go away.

> The second problem is that background scanning is being done
> unconditionally, and it should not. You end up getting all pages with the
> same age if the system is idle. Look at this example (2.4.1-pre1):

I agree. However, I think that we do want to do some background scanning
to push out dirty pages in the background, kind of like bdflush. It just
shouldn't age the pages (and thus not move them to the inactive list).

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Subtle MM bug
  2001-01-09 23:58                       ` Linus Torvalds
@ 2001-01-09 22:21                         ` Marcelo Tosatti
  2001-01-10  0:23                           ` Linus Torvalds
  0 siblings, 1 reply; 34+ messages in thread
From: Marcelo Tosatti @ 2001-01-09 22:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, David S. Miller, Rik van Riel, linux-mm

On Tue, 9 Jan 2001, Linus Torvalds wrote:

> 
> On Tue, 9 Jan 2001, Marcelo Tosatti wrote:
> > 
> > > > The second problem is that background scanning is being done
> > > > unconditionally, and it should not. You end up getting all pages with the
> > > > same age if the system is idle. Look at this example (2.4.1-pre1):
> > > 
> > > I agree. However, I think that we do want to do some background scanning
> > > to push out dirty pages in the background, kind of like bdflush. It just
> > > shouldn't age the pages (and thus not move them to the inactive list).
> > 
> > Actually it must age the pages, but aging should not be unconditional. 
> 
> No, I'm saying that "the background scanning" should not do the page
> aging.

If you age pages only when there is memory pressure/low memory, you'll
have less knowledge about which pages were unused/used pages over time.

> Obviously "refill_inactive()" needs to do the page aging. I'm just not at
> all convinced that "background scanning" == "refill_inactive()". 

This is the background scanning I refer (in kswapd):

                /*
                 * Do some (very minimal) background scanning. This
                 * will scan all pages on the active list once
                 * every minute. This clears old referenced bits
                 * and moves unused pages to the inactive list.
                 */
                refill_inactive_scan(6, 0);




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Subtle MM bug
  2001-01-09 22:44                   ` Linus Torvalds
@ 2001-01-09 21:33                     ` Marcelo Tosatti
  2001-01-09 23:58                       ` Linus Torvalds
  0 siblings, 1 reply; 34+ messages in thread
From: Marcelo Tosatti @ 2001-01-09 21:33 UTC (permalink / raw)
  To: Stephen C. Tweedie, Linus Torvalds
  Cc: David S. Miller, Rik van Riel, linux-mm

On Tue, 9 Jan 2001, Linus Torvalds wrote:

> 
> 
> On Tue, 9 Jan 2001, Marcelo Tosatti wrote:
> > 
> > The "while (!inactive_shortage())" should be "while (inactive_shortage())"
> > as Benjamin noted on lk.
> 
> Yes. Also, it does need something to make sure that it doesn't end up
> being an endless loop. 

Ok, I'll send another patch which fixes this later today.

> > The second problem is that background scanning is being done
> > unconditionally, and it should not. You end up getting all pages with the
> > same age if the system is idle. Look at this example (2.4.1-pre1):
> 
> I agree. However, I think that we do want to do some background scanning
> to push out dirty pages in the background, kind of like bdflush. It just
> shouldn't age the pages (and thus not move them to the inactive list).

Actually it must age the pages, but aging should not be unconditional. 

Stephen has some thoughts on this. Stephen? 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Subtle MM bug
  2001-01-09  3:12               ` Linus Torvalds
@ 2001-01-09 20:33                 ` Marcelo Tosatti
  2001-01-09 22:44                   ` Linus Torvalds
  2001-01-17  4:54                 ` Rik van Riel
  1 sibling, 1 reply; 34+ messages in thread
From: Marcelo Tosatti @ 2001-01-09 20:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, David S. Miller, Rik van Riel, linux-mm

On Mon, 8 Jan 2001, Linus Torvalds wrote:

> Try out 2.4.1-pre1 in testing.

The "while (!inactive_shortage())" should be "while (inactive_shortage())"
as Benjamin noted on lk.

The second problem is that background scanning is being done
unconditionally, and it should not. You end up getting all pages with the
same age if the system is idle. Look at this example (2.4.1-pre1):

MemTotal:       900148 kB
MemFree:        145060 kB
Cached:         725624 kB
Active:           3972 kB
Inact_dirty:    722940 kB
Inact_clean:         0 kB
Inact_target:      188 kB

> That kmem_cache_reap() thing still looks completely bogus, but I didn't
> touch it. It looks _so_ bogus that there must be some reason for doing it
> that ass-backwards way. Why should anybody have does a kmem_cache_reap()
> when we're _not_ short of free pages? That code just makes me very
> confused, so I'm not touching it.

This patch removes kmem_cache_reap() from refill_inactive() and moves it
to inside the free_shortage() check in do_try_to_free_pages().

It also changes the "while (!inactive_shortage())" mistake.

Comments?

diff -Nur linux.orig/include/linux/fs.h linux/include/linux/fs.h
--- linux.orig/include/linux/fs.h	Tue Jan  9 19:32:51 2001
+++ linux/include/linux/fs.h	Tue Jan  9 20:07:32 2001
@@ -985,7 +985,7 @@
 
 extern int fs_may_remount_ro(struct super_block *);
 
-extern int try_to_free_buffers(struct page *, int);
+extern void try_to_free_buffers(struct page *, int);
 extern void refile_buffer(struct buffer_head * buf);
 
 #define BUF_CLEAN	0
diff -Nur linux.orig/include/linux/swap.h linux/include/linux/swap.h
--- linux.orig/include/linux/swap.h	Tue Jan  9 19:32:51 2001
+++ linux/include/linux/swap.h	Tue Jan  9 20:07:38 2001
@@ -108,7 +108,7 @@
 extern int free_shortage(void);
 extern int inactive_shortage(void);
 extern void wakeup_kswapd(int);
-extern int try_to_free_pages(unsigned int gfp_mask);
+extern void try_to_free_pages(unsigned int gfp_mask);
 
 /* linux/mm/page_io.c */
 extern void rw_swap_page(int, struct page *, int);
diff -Nur linux.orig/mm/vmscan.c linux/mm/vmscan.c
--- linux.orig/mm/vmscan.c	Tue Jan  9 19:35:41 2001
+++ linux/mm/vmscan.c	Tue Jan  9 20:06:01 2001
@@ -825,9 +825,6 @@
 		count = (1 << page_cluster);
 	start_count = count;
 
-	/* Always trim SLAB caches when memory gets low. */
-	kmem_cache_reap(gfp_mask);
-
 	priority = 6;
 	do {
 		if (current->need_resched) {
@@ -842,16 +839,14 @@
 
 		/* If refill_inactive_scan failed, try to page stuff out.. */
 		swap_out(priority, gfp_mask);
-	} while (!inactive_shortage());
+	} while (inactive_shortage());
 
 done:
 	return (count < start_count);
 }
 
-static int do_try_to_free_pages(unsigned int gfp_mask, int user)
+static void do_try_to_free_pages(unsigned int gfp_mask, int user)
 {
-	int ret = 0;
-
 	/*
 	 * If we're low on free pages, move pages from the
 	 * inactive_dirty list to the inactive_clean list.
@@ -862,32 +857,24 @@
 	 */
 	if (free_shortage() || nr_inactive_dirty_pages > nr_free_pages() +
 			nr_inactive_clean_pages())
-		ret += page_launder(gfp_mask, user);
+		page_launder(gfp_mask, user);
 
 	/*
 	 * If needed, we move pages from the active list
 	 * to the inactive list.
 	 */
 	if (inactive_shortage())
-		ret += refill_inactive(gfp_mask, user);
+		refill_inactive(gfp_mask, user);
 
 	/* 	
-	 * Delete pages from the inode and dentry cache 
-	 * if memory is low. 
+	 * Delete pages from the inode and dentry cache and
+	 * reclaim unused slab cache if memory is low.
 	 */
 	if (free_shortage()) {
 		shrink_dcache_memory(6, gfp_mask);
 		shrink_icache_memory(6, gfp_mask);
-	} else { 
-
-		/*
-		 * Reclaim unused slab cache memory.
-		 */
 		kmem_cache_reap(gfp_mask);
-		ret = 1;
 	}
-
-	return ret;
 }
 
 DECLARE_WAIT_QUEUE_HEAD(kswapd_wait);
@@ -1029,17 +1016,13 @@
  * memory but are unable to sleep on kswapd because
  * they might be holding some IO locks ...
  */
-int try_to_free_pages(unsigned int gfp_mask)
+void try_to_free_pages(unsigned int gfp_mask)
 {
-	int ret = 1;
-
 	if (gfp_mask & __GFP_WAIT) {
 		current->flags |= PF_MEMALLOC;
-		ret = do_try_to_free_pages(gfp_mask, 1);
+		do_try_to_free_pages(gfp_mask, 1);
 		current->flags &= ~PF_MEMALLOC;
 	}
-
-	return ret;
 }
 
 DECLARE_WAIT_QUEUE_HEAD(kreclaimd_wait);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Subtle MM bug
  2001-01-08 23:49             ` Marcelo Tosatti
@ 2001-01-09  3:12               ` Linus Torvalds
  2001-01-09 20:33                 ` Marcelo Tosatti
  2001-01-17  4:54                 ` Rik van Riel
  0 siblings, 2 replies; 34+ messages in thread
From: Linus Torvalds @ 2001-01-09  3:12 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Stephen C. Tweedie, David S. Miller, Rik van Riel, linux-mm


On Mon, 8 Jan 2001, Marcelo Tosatti wrote:
> 
> Your lazy enough to ask me to regenerate a patch or you can by
> yourself? :) 

Try out 2.4.1-pre1 in testing.

It does three things: 

 - gets rid of the complex "best mm" logic and replaces it with the
   round-robin thing as discussed. I have this suspicion that we
   eventually want to make this based on fault rates etc in an effort to
   more aggressively control big RSS processes, but I also suspect that
   this is tied in to the the RSS limiting patches, so this will simmer
   for a while.

 - it cleans up the unnecessary dcache/icache shrink that is already done
   more properly elsewhere.

 - it cleans up and simplifies the MM "priority" thing. In fact, right now
   only one priority is ever used, and I suspect strongly that all the
   "made_progress" logic was really there because that's how we want to do
   it (and just having one priority made "made_progress" unnecessary).

(It also has some non-VM patches, of course, but for this discussion the
VM ones are the only interesting ones).

As far as I can tell, the non-priority version is every bit as good as the
one that counts down priorities, and if nobody can argue against it I'll
just remove the priority argument altogether at some point. Right now it
still exists, it just doesn't change.

That kmem_cache_reap() thing still looks completely bogus, but I didn't
touch it. It looks _so_ bogus that there must be some reason for doing it
that ass-backwards way. Why should anybody have does a kmem_cache_reap()
when we're _not_ short of free pages? That code just makes me very
confused, so I'm not touching it.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Subtle MM bug
  2001-01-08 21:52         ` Marcelo Tosatti
@ 2001-01-09  0:28           ` Linus Torvalds
  2001-01-08 23:49             ` Marcelo Tosatti
  0 siblings, 1 reply; 34+ messages in thread
From: Linus Torvalds @ 2001-01-09  0:28 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Stephen C. Tweedie, David S. Miller, Rik van Riel, linux-mm


On Mon, 8 Jan 2001, Marcelo Tosatti wrote:
> 
> I've removed the free_shortage() of refill_inactive() in the patch.
> 
> Comments are welcome.

One comment: why does refill_inactive() do the shrink_dcache_memory() at
all? Why not just remove that?

do_try_to_free_pages() will do that, and that's where it makes more sense
(shrinking the dcache/icache has absolutely nothing to do with the
inactive list).

Historical code?

Also, we should probably remove the "made_progress" and "count--" from the
swap_out() case, as swap_out() hasn't actually caused pages to be free'd
in a long time.. 

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Subtle MM bug
  2001-01-09  0:28           ` Linus Torvalds
@ 2001-01-08 23:49             ` Marcelo Tosatti
  2001-01-09  3:12               ` Linus Torvalds
  0 siblings, 1 reply; 34+ messages in thread
From: Marcelo Tosatti @ 2001-01-08 23:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, David S. Miller, Rik van Riel, linux-mm

On Mon, 8 Jan 2001, Linus Torvalds wrote:

> 
> On Mon, 8 Jan 2001, Marcelo Tosatti wrote:
> > 
> > I've removed the free_shortage() of refill_inactive() in the patch.
> > 
> > Comments are welcome.
> 
> One comment: why does refill_inactive() do the shrink_dcache_memory() at
> all? Why not just remove that?
> 
> do_try_to_free_pages() will do that, and that's where it makes more sense
> (shrinking the dcache/icache has absolutely nothing to do with the
> inactive list).

Right. kmem_cache_reap() should not be there too.

> Also, we should probably remove the "made_progress" and "count--" from the
> swap_out() case, as swap_out() hasn't actually caused pages to be free'd
> in a long time.. 

Indeed. 

Your lazy enough to ask me to regenerate a patch or you can by
yourself? :) 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Subtle MM bug
  2001-01-08 18:10       ` Stephen C. Tweedie
@ 2001-01-08 21:52         ` Marcelo Tosatti
  2001-01-09  0:28           ` Linus Torvalds
  0 siblings, 1 reply; 34+ messages in thread
From: Marcelo Tosatti @ 2001-01-08 21:52 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Linus Torvalds, David S. Miller, Rik van Riel, linux-mm

On Mon, 8 Jan 2001, Stephen C. Tweedie wrote:

> > _really_ well on many loads, but this one we do badly on. And from what
> > I've been able to see so far, it's because we're just too damn good at
> > waiting on page_launder() and doing refill_inactive_scan().
> 
> do_try_to_free_pages() is trying to
> 
> 	/*
> 	 * If needed, we move pages from the active list
> 	 * to the inactive list. We also "eat" pages from
> 	 * the inode and dentry cache whenever we do this.
> 	 */
> 	if (free_shortage() || inactive_shortage()) {
> 		shrink_dcache_memory(6, gfp_mask);
> 		shrink_icache_memory(6, gfp_mask);
> 		ret += refill_inactive(gfp_mask, user);
> 	} else {
> 
> So we're refilling the inactive list regardless of its current size
> whenever free_shortage() is true.  In the situation you describe,
> there's no point refilling the inactive list too far beyond the
> ability of the swapper to launder it, regardless of whether
> free_shortage() is set.

Agreed.

After some fights me and Rik agreed on doing a per-zone inactive shortage
check in inactive_shortage().

This allow us to check _only_ for inactive_shortage()  before calling
refill_inactive().

> 
> refill_inactive contains exactly the opposite logic: it breaks out if
> 
> 		/*
> 		 * If we either have enough free memory, or if
> 		 * page_launder() will be able to make enough
> 		 * free memory, then stop.
> 		 */
> 		if (!inactive_shortage() || !free_shortage())
> 			goto done;
> 
> but that still means that we're doing unnecessary inactive list
> refilling whenever free_shortage() is true: this test only occurs
> after we've tried at least one swap_out().  We're calling
> refill_inactive if either condition is true, but we're staying inside
> it only if both conditions are true.
> 
> Shouldn't we really just be making the refill_inactive() here depend
> on inactive_shortage() alone, not free_shortage()?  By refilling the
> inactive list too agressively we actually end up discarding aging
> information which might be of use to us.

Yes.

I've removed the free_shortage() of refill_inactive() in the patch.

Comments are welcome.


--- linux.orig/mm/vmscan.c	Thu Jan  4 02:45:26 2001
+++ linux/mm/vmscan.c	Mon Jan  8 20:43:59 2001
@@ -808,6 +808,9 @@
 int inactive_shortage(void)
 {
 	int shortage = 0;
+	pg_data_t *pgdat = pgdat_list;
+
+	/* Is the inactive dirty list too small? */
 
 	shortage += freepages.high;
 	shortage += inactive_target;
@@ -818,7 +821,27 @@
 	if (shortage > 0)
 		return shortage;
 
-	return 0;
+	/* If not, do we have enough per-zone pages on the inactive list? */
+
+	shortage = 0;
+
+	do {
+		int i;
+		for(i = 0; i < MAX_NR_ZONES; i++) {
+			int zone_shortage;
+			zone_t *zone = pgdat->node_zones+ i;
+
+			zone_shortage = zone->pages_high;
+			zone_shortage -= zone->inactive_dirty_pages;
+			zone_shortage -= zone->inactive_clean_pages;
+			zone_shortage -= zone->free_pages;
+			if (zone_shortage > 0)
+				shortage += zone_shortage;
+		}
+		pgdat = pgdat->node_next;
+	} while (pgdat);
+
+	return shortage;
 }
 
 /*
@@ -861,12 +884,13 @@
 		}
 
 		/*
-		 * don't be too light against the d/i cache since
-	   	 * refill_inactive() almost never fail when there's
-	   	 * really plenty of memory free. 
+		 * Only free memory from i/d caches if we have 
+		 * are under low memory.
 		 */
-		shrink_dcache_memory(priority, gfp_mask);
-		shrink_icache_memory(priority, gfp_mask);
+		if(free_shortage()) {
+			shrink_dcache_memory(priority, gfp_mask);
+			shrink_icache_memory(priority, gfp_mask);
+		}
 
 		/*
 		 * Then, try to page stuff out..
@@ -878,11 +902,10 @@
 		}
 
 		/*
-		 * If we either have enough free memory, or if
-		 * page_launder() will be able to make enough
+		 * If page_launder() will be able to make enough
 		 * free memory, then stop.
 		 */
-		if (!inactive_shortage() || !free_shortage())
+		if (!inactive_shortage())
 			goto done;
 
 		/*
@@ -922,14 +945,20 @@
 
 	/*
 	 * If needed, we move pages from the active list
-	 * to the inactive list. We also "eat" pages from
-	 * the inode and dentry cache whenever we do this.
+	 * to the inactive list.
+	 */
+	if (inactive_shortage())
+		ret += refill_inactive(gfp_mask, user);
+
+	/* 	
+	 * Delete pages from the inode and dentry cache 
+	 * if memory is low. 
 	 */
-	if (free_shortage() || inactive_shortage()) {
+	if (free_shortage()) {
 		shrink_dcache_memory(6, gfp_mask);
 		shrink_icache_memory(6, gfp_mask);
-		ret += refill_inactive(gfp_mask, user);
-	} else {
+	} else { 
+
 		/*
 		 * Reclaim unused slab cache memory.
 		 */



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Subtle MM bug
  2001-01-08 18:21       ` Rik van Riel
@ 2001-01-08 18:38         ` Linus Torvalds
  0 siblings, 0 replies; 34+ messages in thread
From: Linus Torvalds @ 2001-01-08 18:38 UTC (permalink / raw)
  To: Rik van Riel; +Cc: David S. Miller, Marcelo Tosatti, linux-mm


On Mon, 8 Jan 2001, Rik van Riel wrote:
> 
> > That _is_ the problem the above will fix. Don't read
> > "page_launder()" there: it's more meant to be "this is the old
> > code that does page_launder() etc.."
> > 
> > Trust me. Try my code. It will work.
> 
> Except for the small detail that pages inside the processes
> are often not on the active list  ;)

Yes, you're right - we don't have a good counter to test right now.		

That's actually fairly nasty. We can't even use the "reverse" test,
because while we can make it do something like

	if (nr_inactive + nr_inactive_dirty < X %)

that won't pick up on things like the dentry and inode caches, so that
would be wrong too. 

We would really need to count the number of mapped anonymous pages to get
this right. Damn. That makes it harder than I thought.

(Hmm.. Increment counter in "do_anonymous_page()" and "do_wp_page()".
Decrement in "add_to_swap_cache()". Decrement in "free_pte()" for the
!page->mapping case. Test. Find the places I forgot. Maybe it's not that
bad, after all).

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Subtle MM bug
  2001-01-08 17:50     ` Linus Torvalds
@ 2001-01-08 18:21       ` Rik van Riel
  2001-01-08 18:38         ` Linus Torvalds
  0 siblings, 1 reply; 34+ messages in thread
From: Rik van Riel @ 2001-01-08 18:21 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David S. Miller, Marcelo Tosatti, linux-mm

On Mon, 8 Jan 2001, Linus Torvalds wrote:
> On Mon, 8 Jan 2001, Rik van Riel wrote:
> > On Sun, 7 Jan 2001, Linus Torvalds wrote:
> > 
> > > 	/*
> > > 	 * Too many active pages? That implies that we don't have enough
> > > 	 * of a working set for page_launder() to do a good job. Start by
> > > 	 * walking the VM space..
> > > 	 */
> > > 	if ((nr_active_pages >> 1) > total_pages)
> > > 		swap_out();

> That _is_ the problem the above will fix. Don't read
> "page_launder()" there: it's more meant to be "this is the old
> code that does page_launder() etc.."
> 
> Trust me. Try my code. It will work.

Except for the small detail that pages inside the processes
are often not on the active list  ;)

But I agree with your idea that we really should make sure
we have enough pages available to choose from when swapping
stuff out.

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Subtle MM bug
  2001-01-08 17:29     ` Linus Torvalds
@ 2001-01-08 18:10       ` Stephen C. Tweedie
  2001-01-08 21:52         ` Marcelo Tosatti
  0 siblings, 1 reply; 34+ messages in thread
From: Stephen C. Tweedie @ 2001-01-08 18:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, David S. Miller, Rik van Riel,
	Marcelo Tosatti, linux-mm

On Mon, Jan 08, 2001 at 09:29:15AM -0800, Linus Torvalds wrote:
> On Mon, 8 Jan 2001, Stephen C. Tweedie wrote:

> If you have a well-behaving application that doesn't even have memory
> pressure, but fills up >50% of memory in its VM, nothing will actually
> happen in the steady state. It can have 99% of available memory, and not a
> single soft page fault.

Agreed, but that's not how I read your statement about scanning the VM
regularly.  The problem happens if you are working happily with enough
free memory and you suddenly need a large amount of allocation: having
some relatively uptodate page age information may give you a _much_
better idea of what to page out.

Rik was going to experiment with this --- Rik, do you have any hard
numbers for the benefit of maintaining a background page aging task?

> But think about what happens if you now start up another application? And
> think about what SHOULD happen. The 50% ruls is perfectly fine: 

Right, I interpreted your 50% as a steady-state limit.

> Stephen: have you tried the behaviour of a working set that is dirty in
> the VM's and slightly larger than available ram? Not pretty. 

Yes, and this is something that Marcelo's swap clustering code ought
to be ideal for.

> _really_ well on many loads, but this one we do badly on. And from what
> I've been able to see so far, it's because we're just too damn good at
> waiting on page_launder() and doing refill_inactive_scan().

do_try_to_free_pages() is trying to

	/*
	 * If needed, we move pages from the active list
	 * to the inactive list. We also "eat" pages from
	 * the inode and dentry cache whenever we do this.
	 */
	if (free_shortage() || inactive_shortage()) {
		shrink_dcache_memory(6, gfp_mask);
		shrink_icache_memory(6, gfp_mask);
		ret += refill_inactive(gfp_mask, user);
	} else {

So we're refilling the inactive list regardless of its current size
whenever free_shortage() is true.  In the situation you describe,
there's no point refilling the inactive list too far beyond the
ability of the swapper to launder it, regardless of whether
free_shortage() is set.

refill_inactive contains exactly the opposite logic: it breaks out if

		/*
		 * If we either have enough free memory, or if
		 * page_launder() will be able to make enough
		 * free memory, then stop.
		 */
		if (!inactive_shortage() || !free_shortage())
			goto done;

but that still means that we're doing unnecessary inactive list
refilling whenever free_shortage() is true: this test only occurs
after we've tried at least one swap_out().  We're calling
refill_inactive if either condition is true, but we're staying inside
it only if both conditions are true.

Shouldn't we really just be making the refill_inactive() here depend
on inactive_shortage() alone, not free_shortage()?  By refilling the
inactive list too agressively we actually end up discarding aging
information which might be of use to us.

Rik, any thoughts?  This looks as if it's destroying any hope of
maintaining the intended inactive_shortage() targets.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Subtle MM bug
  2001-01-08 16:45   ` Rik van Riel
@ 2001-01-08 17:50     ` Linus Torvalds
  2001-01-08 18:21       ` Rik van Riel
  0 siblings, 1 reply; 34+ messages in thread
From: Linus Torvalds @ 2001-01-08 17:50 UTC (permalink / raw)
  To: Rik van Riel; +Cc: David S. Miller, Marcelo Tosatti, linux-mm


On Mon, 8 Jan 2001, Rik van Riel wrote:

> On Sun, 7 Jan 2001, Linus Torvalds wrote:
> 
> > 	/*
> > 	 * Too many active pages? That implies that we don't have enough
> > 	 * of a working set for page_launder() to do a good job. Start by
> > 	 * walking the VM space..
> > 	 */
> > 	if ((nr_active_pages >> 1) > total_pages)
> > 		swap_out();
> > 
> > 	/*
> > 	 * This is where we actually free memory
> > 	 */
> > 	page_launder(..);
> 
> Ahhh, but this is NOT the balancing problem we're trying to
> pin down in 2.4...
> 
> The (possible) problem is in the balancing between swap_out()
> and refill_inactive_scan().

That _is_ the problem the above will fix. Don't read "page_launder()"
there: it's more meant to be "this is the old code that does
page_launder() etc.."

Trust me. Try my code. It will work.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Subtle MM bug
  2001-01-08 13:11   ` Marcelo Tosatti
  2001-01-08 16:42     ` Rik van Riel
@ 2001-01-08 17:43     ` Linus Torvalds
  1 sibling, 0 replies; 34+ messages in thread
From: Linus Torvalds @ 2001-01-08 17:43 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: David S. Miller, Rik van Riel, linux-mm


On Mon, 8 Jan 2001, Marcelo Tosatti wrote:
> 
> On Sun, 7 Jan 2001, Linus Torvalds wrote:
> 
> > and just get rid of all the logic to try to "find the best mm". It's bogus
> > anyway: we should get perfectly fair access patterns by just doing
> > everything in round-robin, and each "swap_out_mm(mm)" would just try to
> > walk some fixed percentage of the RSS size (say, something like
> > 
> > 	count = (mm->rss >> 4)
> > 
> > and be done with it.
> 
> I have the impression that a fixed percentage of the RSS will be a problem
> when you have a memory hog (or hogs) running.

Nothing but testing can prove it, but I don't think that's really an
issue.

Remember: we're not actually swapping stuff out any more in VM scanning.
We're just saying "we're low on memory, let's evict the page tables so
that we _could_ swap stuff out if necessary". We're going to have to evict
_something_, and walking the page tables really gives us a lot better
knowledge of WHAT to evict.

The cost of scanning the VM is (a) the cost of scanning itself and (b) the
cost of soft-faults and CPU TLB invalidate cross-calls for the scanning.
Both of which might be noticeable - but I have this fairly strong feeling
that neither of them is big enough to offset the cost of paging out the
wrong page. Which we definitely do now - I've got some simple
test-programs that have a VM footprint that is not _that_ much more than
the available memory, and they _really_ show problems.

(The "lots of dirty pages" case is not the common case under most loads,
so the fact that 2.4.0 has some performance problems with it was not a
show-stopper for me - during my testing with low memory most loads were
very nice indeed).

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Subtle MM bug
  2001-01-08 13:57   ` Stephen C. Tweedie
@ 2001-01-08 17:29     ` Linus Torvalds
  2001-01-08 18:10       ` Stephen C. Tweedie
  0 siblings, 1 reply; 34+ messages in thread
From: Linus Torvalds @ 2001-01-08 17:29 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: David S. Miller, Rik van Riel, Marcelo Tosatti, linux-mm


On Mon, 8 Jan 2001, Stephen C. Tweedie wrote:
> 
> > Then, with something like the above, we just try to make sure that we scan
> > the whole virtual memory space every once in a while. Make the "every once
> > in a while" be some simple heuristic like "try to keep the active list to
> > less than 50% of all memory".
> 
> ... which will produce an enormous storm of soft page faults for
> workloads involving mmaping large amounts of data or where we have
> a lot of space devoted to anonymous pages, such as static
> computational workloads.

I don't think you'll find that in practice. 

It would obviously trigger only on low-memory code _anyway_ (we don't even
get into "try_to_free_pages()" unless there is memory pressure), so I
think you're _completely_ off the mark here.

Remember: the thing doesn't require that < 50% of memory is in the page
tables. It only says: if 50% or more of memory is in the page tables, we
will always scan the page tables first when we try to find free pages.

If you have a well-behaving application that doesn't even have memory
pressure, but fills up >50% of memory in its VM, nothing will actually
happen in the steady state. It can have 99% of available memory, and not a
single soft page fault.

But think about what happens if you now start up another application? And
think about what SHOULD happen. The 50% ruls is perfectly fine: if we're
starting to swap, we're better off taking soft page faults that give us a
better LRU than letting the MM scrub the same pages over and over because
it effectively only sees a subset of the total pages (with the mapped
pages being "invisible").

The fact is, that we absolutely _have_ to do the VM scan in order for the
inactive lists to be at all representative of the state of affairs. If we
just rely on page_launder() and refill_inactive() as the #1 way to get
free pages, we will never consider anything but the pages that are already
on the lists.

Stephen: have you tried the behaviour of a working set that is dirty in
the VM's and slightly larger than available ram? Not pretty. We do
_really_ well on many loads, but this one we do badly on. And from what
I've been able to see so far, it's because we're just too damn good at
waiting on page_launder() and doing refill_inactive_scan().

There's another advantage to the 50% rule: if we are under memory
pressure, and somebody is dirtying pages in its VM (which is otherwise an
"invisible" event to the kernel), the 50% rule is much more likely to mean
that we actually _see_ the dirtying, and can slow it down.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Subtle MM bug
  2001-01-08  6:42 ` Linus Torvalds
  2001-01-08 13:11   ` Marcelo Tosatti
  2001-01-08 13:57   ` Stephen C. Tweedie
@ 2001-01-08 16:45   ` Rik van Riel
  2001-01-08 17:50     ` Linus Torvalds
  2 siblings, 1 reply; 34+ messages in thread
From: Rik van Riel @ 2001-01-08 16:45 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David S. Miller, Marcelo Tosatti, linux-mm

On Sun, 7 Jan 2001, Linus Torvalds wrote:

> 	/*
> 	 * Too many active pages? That implies that we don't have enough
> 	 * of a working set for page_launder() to do a good job. Start by
> 	 * walking the VM space..
> 	 */
> 	if ((nr_active_pages >> 1) > total_pages)
> 		swap_out();
> 
> 	/*
> 	 * This is where we actually free memory
> 	 */
> 	page_launder(..);

Ahhh, but this is NOT the balancing problem we're trying to
pin down in 2.4...

The (possible) problem is in the balancing between swap_out()
and refill_inactive_scan().

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Subtle MM bug
  2001-01-08 13:11   ` Marcelo Tosatti
@ 2001-01-08 16:42     ` Rik van Riel
  2001-01-08 17:43     ` Linus Torvalds
  1 sibling, 0 replies; 34+ messages in thread
From: Rik van Riel @ 2001-01-08 16:42 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Linus Torvalds, David S. Miller, linux-mm

On Mon, 8 Jan 2001, Marcelo Tosatti wrote:
> On Sun, 7 Jan 2001, Linus Torvalds wrote:
> 
> > and just get rid of all the logic to try to "find the best mm". It's bogus
> > anyway: we should get perfectly fair access patterns by just doing
> > everything in round-robin, and each "swap_out_mm(mm)" would just try to
> > walk some fixed percentage of the RSS size (say, something like
> > 
> > 	count = (mm->rss >> 4)
> > 
> > and be done with it.
> 
> I have the impression that a fixed percentage of the RSS will be
> a problem when you have a memory hog (or hogs) running.

My RSS ulimit enforcing patches solve this problem in a
very simple way.

If a process is exceeding its RSS limit, we scan ALL pages
from the process. Otherwise, we scan the normal percentage.

Furthermore, I have put a default soft RSS limit of half
of physical memory in the system. This means that when you
have one big runaway process, kswapd will be more agressive
against that process then against others. The fact that it
is a soft limit, OTOH, means that the process can use all
the available memory if there is no memory pressure in the
system...

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Subtle MM bug
  2001-01-08  6:42 ` Linus Torvalds
  2001-01-08 13:11   ` Marcelo Tosatti
@ 2001-01-08 13:57   ` Stephen C. Tweedie
  2001-01-08 17:29     ` Linus Torvalds
  2001-01-08 16:45   ` Rik van Riel
  2 siblings, 1 reply; 34+ messages in thread
From: Stephen C. Tweedie @ 2001-01-08 13:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David S. Miller, Rik van Riel, Marcelo Tosatti, linux-mm

Hi,

On Sun, Jan 07, 2001 at 10:42:11PM -0800, Linus Torvalds wrote:
> 
> and just get rid of all the logic to try to "find the best mm". It's bogus
> anyway: we should get perfectly fair access patterns by just doing
> everything in round-robin

Definitely.

> Then, with something like the above, we just try to make sure that we scan
> the whole virtual memory space every once in a while. Make the "every once
> in a while" be some simple heuristic like "try to keep the active list to
> less than 50% of all memory".

... which will produce an enormous storm of soft page faults for
workloads involving mmaping large amounts of data or where we have
a lot of space devoted to anonymous pages, such as static
computational workloads.

The idea of an inactive list target is sound, but it needs to be based
on memory pressure: we don't need anything like 50% if we aren't under
any pressure, so compute-bound workloads with large data sets can
achieve stability.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Subtle MM bug
  2001-01-08  6:42 ` Linus Torvalds
@ 2001-01-08 13:11   ` Marcelo Tosatti
  2001-01-08 16:42     ` Rik van Riel
  2001-01-08 17:43     ` Linus Torvalds
  2001-01-08 13:57   ` Stephen C. Tweedie
  2001-01-08 16:45   ` Rik van Riel
  2 siblings, 2 replies; 34+ messages in thread
From: Marcelo Tosatti @ 2001-01-08 13:11 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David S. Miller, Rik van Riel, linux-mm

On Sun, 7 Jan 2001, Linus Torvalds wrote:

> and just get rid of all the logic to try to "find the best mm". It's bogus
> anyway: we should get perfectly fair access patterns by just doing
> everything in round-robin, and each "swap_out_mm(mm)" would just try to
> walk some fixed percentage of the RSS size (say, something like
> 
> 	count = (mm->rss >> 4)
> 
> and be done with it.

I have the impression that a fixed percentage of the RSS will be a problem
when you have a memory hog (or hogs) running.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Subtle MM bug
       [not found] <200101080602.WAA02132@pizda.ninka.net>
@ 2001-01-08  6:42 ` Linus Torvalds
  2001-01-08 13:11   ` Marcelo Tosatti
                     ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Linus Torvalds @ 2001-01-08  6:42 UTC (permalink / raw)
  To: David S. Miller; +Cc: Rik van Riel, Marcelo Tosatti, linux-mm

[ MM people Cc'd, because while I have a plan, I don't have enough time to
  actually put that plan in action. And mayb esomebody can shoot down my
  brilliant plan. ]

On Sun, 7 Jan 2001, David S. Miller wrote:
> 
> BTW, this reminds me.  Now that you keep track of the "all mm's" list
> thingy, you can also keep track of "nr_mms" in the system and do that
> little:
> 
> 	for (i = 0; i < (nr_mms >> priority); i++)
> 		pagetable_scan();
> 
> thing you were talking about last week.

This is the whole reason for making that list in the first place. 

Even more subtle: see the comment in kernel/fork.c about keeping the list
of mm's in order. What I _really_ want to do is something like

void swap_out(void)
{
	for (i = 0; i < (nr_mms >> priority); i++) {
		struct list_head *p;
		struct mm_struct *mm;

		spin_lock(&mmlist_lock);
		p = initmm.mmlist.next;
		if (p != &initmm.mmlist) {
			struct mm_struct *mm = list_entry(p, struct mm_struct, mmlist);

			/* Move it to the back of the queue */
			list_del(p);
			__list_add(p, initmm.mmlist.prev, &initmm.mmlist);
			atomic_inc(&mm->mm_users);
			spin_unlock(&mmlist_lock);

			swap_out_mm(mm);
			continue;
		}
		/* empty mm-list - shouldn't really happen except during bootup */ 
		spin_unlock(&mmlist_lock);
		break;
	}
}

and just get rid of all the logic to try to "find the best mm". It's bogus
anyway: we should get perfectly fair access patterns by just doing
everything in round-robin, and each "swap_out_mm(mm)" would just try to
walk some fixed percentage of the RSS size (say, something like

	count = (mm->rss >> 4)

and be done with it.

Then, with something like the above, we just try to make sure that we scan
the whole virtual memory space every once in a while. Make the "every once
in a while" be some simple heuristic like "try to keep the active list to
less than 50% of all memory". So "try_to_free_memory()" would just start
off with something like

	/*
	 * Too many active pages? That implies that we don't have enough
	 * of a working set for page_launder() to do a good job. Start by
	 * walking the VM space..
	 */
	if ((nr_active_pages >> 1) > total_pages)
		swap_out();

	/*
	 * This is where we actually free memory
	 */
	page_launder(..);

and we'd be all done. (And that "max 50% of all pages should be active"
number was taken out of my ass. AND the above will work really badly if
there is no swap-space, so it needs tweaking - think of it not as a hard
algorithm, but more as a "this is where I think we need to go").

Advantage: it automatically does the right thing: if the reason for the
memory pressure is that we have lots of pages mapped, it will scan the VM
lists. If the reason is that we just have tons of pages cached, it won't
even bother to age the page tables.

Right now we have this cockamamy scheme to try to balance off the lists
against each other, and then at fairly random points we'll get to
"swap_out()" if we haven't found anything nice on the other lists. That's
just not the way to get nice MM behaviour.

I'll bet you $5 USD that the above approach will (a) work fairly and
(b) give much smoother behavior with a much more understandable swap-out
policy.

Of course, I've been wrong before. But I'd like somebody to take a look.

Anybody?

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2001-01-18  1:32 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-01-07 20:59 Subtle MM bug Zlatko Calusic
2001-01-07 21:37 ` Rik van Riel
2001-01-07 22:33   ` Zlatko Calusic
2001-01-09  2:01   ` Zlatko Calusic
2001-01-17  4:48     ` Rik van Riel
2001-01-17 18:53       ` Zlatko Calusic
2001-01-18  1:32         ` Rik van Riel
     [not found] <200101080602.WAA02132@pizda.ninka.net>
2001-01-08  6:42 ` Linus Torvalds
2001-01-08 13:11   ` Marcelo Tosatti
2001-01-08 16:42     ` Rik van Riel
2001-01-08 17:43     ` Linus Torvalds
2001-01-08 13:57   ` Stephen C. Tweedie
2001-01-08 17:29     ` Linus Torvalds
2001-01-08 18:10       ` Stephen C. Tweedie
2001-01-08 21:52         ` Marcelo Tosatti
2001-01-09  0:28           ` Linus Torvalds
2001-01-08 23:49             ` Marcelo Tosatti
2001-01-09  3:12               ` Linus Torvalds
2001-01-09 20:33                 ` Marcelo Tosatti
2001-01-09 22:44                   ` Linus Torvalds
2001-01-09 21:33                     ` Marcelo Tosatti
2001-01-09 23:58                       ` Linus Torvalds
2001-01-09 22:21                         ` Marcelo Tosatti
2001-01-10  0:23                           ` Linus Torvalds
2001-01-10  0:12                             ` Marcelo Tosatti
2001-01-10 11:29                               ` Stephen C. Tweedie
2001-01-11  3:30                             ` Marcelo Tosatti
2001-01-11  9:42                               ` Stephen C. Tweedie
2001-01-11 15:24                                 ` Marcelo Tosatti
2001-01-17  4:54                 ` Rik van Riel
2001-01-08 16:45   ` Rik van Riel
2001-01-08 17:50     ` Linus Torvalds
2001-01-08 18:21       ` Rik van Riel
2001-01-08 18:38         ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox