[PATCH] swap_state.c thinko

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] swap_state.c thinko
@ 2001-04-05 15:56 Ben LaHaise
  2001-04-05 16:05 ` Rik van Riel
  2001-04-05 17:21 ` Hugh Dickins
  0 siblings, 2 replies; 33+ messages in thread
From: Ben LaHaise @ 2001-04-05 15:56 UTC (permalink / raw)
  To: arjanv, alan, torvalds; +Cc: linux-mm

Hey folks,

Here's another one liner that closes an smp race that could corrupt
things.

		-ben

diff -urN v2.4.3/mm/swap_state.c work-2.4.3/mm/swap_state.c
--- v2.4.3/mm/swap_state.c	Fri Dec 29 18:04:27 2000
+++ work-2.4.3/mm/swap_state.c	Thu Apr  5 11:55:00 2001
@@ -140,7 +140,7 @@
 	/*
 	 * If we are the only user, then try to free up the swap cache.
 	 */
-	if (PageSwapCache(page) && !TryLockPage(page)) {
+	if (!TryLockPage(page) && PageSwapCache(page)) {
 		if (!is_page_shared(page)) {
 			delete_from_swap_cache_nolock(page);
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] swap_state.c thinko
  2001-04-05 15:56 [PATCH] swap_state.c thinko Ben LaHaise
@ 2001-04-05 16:05 ` Rik van Riel
  2001-04-05 17:11   ` Ben LaHaise
  2001-04-05 17:21 ` Hugh Dickins
  1 sibling, 1 reply; 33+ messages in thread
From: Rik van Riel @ 2001-04-05 16:05 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: arjanv, alan, torvalds, linux-mm

On Thu, 5 Apr 2001, Ben LaHaise wrote:

> Here's another one liner that closes an smp race that could corrupt
> things.

> -	if (PageSwapCache(page) && !TryLockPage(page)) {
> +	if (!TryLockPage(page) && PageSwapCache(page)) {
>  		if (!is_page_shared(page)) {
>  			delete_from_swap_cache_nolock(page);
>  		}

I sure hope the page is unlocked afterwards, regardless of
whether it's (still) in the swap cache or not ...

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] swap_state.c thinko
  2001-04-05 16:05 ` Rik van Riel
@ 2001-04-05 17:11   ` Ben LaHaise
  2001-04-05 23:40     ` Andrea Arcangeli
  2001-04-06  0:32     ` Linus Torvalds
  0 siblings, 2 replies; 33+ messages in thread
From: Ben LaHaise @ 2001-04-05 17:11 UTC (permalink / raw)
  To: Rik van Riel; +Cc: arjanv, alan, torvalds, linux-mm

On Thu, 5 Apr 2001, Rik van Riel wrote:

> I sure hope the page is unlocked afterwards, regardless of
> whether it's (still) in the swap cache or not ...

You're right.  Here's the hopefully correct version.

		-ben

diff -ur v2.4.3/mm/swap_state.c work-2.4.3/mm/swap_state.c
--- v2.4.3/mm/swap_state.c	Fri Dec 29 18:04:27 2000
+++ work-2.4.3/mm/swap_state.c	Thu Apr  5 13:10:27 2001
@@ -140,10 +140,9 @@
 	/*
 	 * If we are the only user, then try to free up the swap cache.
 	 */
-	if (PageSwapCache(page) && !TryLockPage(page)) {
-		if (!is_page_shared(page)) {
+	if (!TryLockPage(page)) {
+		if (PageSwapCache(page) && !is_page_shared(page))
 			delete_from_swap_cache_nolock(page);
-		}
 		UnlockPage(page);
 	}
 	page_cache_release(page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] swap_state.c thinko
  2001-04-05 15:56 [PATCH] swap_state.c thinko Ben LaHaise
  2001-04-05 16:05 ` Rik van Riel
@ 2001-04-05 17:21 ` Hugh Dickins
  2001-04-05 21:39   ` Richard Jerrell
  1 sibling, 1 reply; 33+ messages in thread
From: Hugh Dickins @ 2001-04-05 17:21 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: arjanv, alan, torvalds, sct, jerrell, linux-mm

On Thu, 5 Apr 2001, Ben LaHaise wrote:
> 
> Here's another one liner that closes an smp race that could corrupt
> things.
> 
> diff -urN v2.4.3/mm/swap_state.c work-2.4.3/mm/swap_state.c
> --- v2.4.3/mm/swap_state.c	Fri Dec 29 18:04:27 2000
> +++ work-2.4.3/mm/swap_state.c	Thu Apr  5 11:55:00 2001
> @@ -140,7 +140,7 @@
>  	/*
>  	 * If we are the only user, then try to free up the swap cache.
>  	 */
> -	if (PageSwapCache(page) && !TryLockPage(page)) {
> +	if (!TryLockPage(page) && PageSwapCache(page)) {
>  		if (!is_page_shared(page)) {
>  			delete_from_swap_cache_nolock(page);
>  		}

I agree that PageSwapCache(page) needs to be retested when(if) the
page lock is acquired, but isn't it best to check PageSwapCache(page)
first as at present - won't it very often fail? won't the overhead of
TryLocking and Unlocking every page slow down a hot path?

And isn't this free_page_and_swap_cache(), precisely the area that's
currently subject to debate and patches, because swap pages are not
getting freed soon enough?  I haven't been following that discussion
with full understanding, and haven't seen a full explanation of the
problem to be solved; but I'd rather _imagined_ it was that the page
would here be on an LRU list, raising its count and causing the
is_page_shared(page) test to succeed despite not really shared.
So I'd been expecting a patch to remove this code completely.

Forgive me if way off base...
Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] swap_state.c thinko
  2001-04-05 17:21 ` Hugh Dickins
@ 2001-04-05 21:39   ` Richard Jerrell
  0 siblings, 0 replies; 33+ messages in thread
From: Richard Jerrell @ 2001-04-05 21:39 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-mm

> I agree that PageSwapCache(page) needs to be retested when(if) the
> page lock is acquired, but isn't it best to check PageSwapCache(page)
> first as at present - won't it very often fail? won't the overhead of
> TryLocking and Unlocking every page slow down a hot path?

Yes and no.  It's only a couple of bit operations, so it's probably a
pretty negligable slow-down.  But, yes it will quite often fail especially
in low memory usage situations where there isn't much swapping going
on.  Adding a second test after the lock would slow down the infrequent
case and make it much more like the way lookup_swap_cache works.

> And isn't this free_page_and_swap_cache(), precisely the area that's
> currently subject to debate and patches, because swap pages are not
> getting freed soon enough?

What I think you are talking about are three seperate problems, somewhat
related.

1)  swap cache pages aren't counted in vm_enough_memory as free, meaning
that you can fail when trying to allocate memory merely because a lot of
pages have already been swapped out but not yet reclaimed or possibly even
laundered.  Because these pages are already in the swap cache, we know
that they can be freed if the normal path of kswapd is followed.

2)  we waste time by laundering swap cache pages that are no longer being
referenced by either ptes or indirectly through the swap cell references.

Problem 1 is what is causing quite a few people to fail prematurely when
trying to allocate memory.  Problem 2 is just wasting our time.  Combined,
however, the two problems have dead swap cache pages eating up swap cells,
memory, and time.

> problem to be solved; but I'd rather _imagined_ it was that the page
> would here be on an LRU list, raising its count and causing the
> is_page_shared(page) test to succeed despite not really shared.

is_page_shared is expecting that someone has one of the references to the
page and is trying to determine whether or not other people are interested
in it.  Being on the LRU isn't necessarily going to have the page's count
bumped by one.  As a matter of practice, though, the pages on the LRU are
all in the swap cache which by way of having the page referenced by the
swap cell will bump the count by one.  All this is not really part of the
problem.  The problem is just that 1) swap cache pages are freeable and
2) we didn't check to make sure anyone wanted that page before writing it
to disk.

Rich

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] swap_state.c thinko
  2001-04-05 17:11   ` Ben LaHaise
@ 2001-04-05 23:40     ` Andrea Arcangeli
  2001-04-06  0:32     ` Linus Torvalds
  1 sibling, 0 replies; 33+ messages in thread
From: Andrea Arcangeli @ 2001-04-05 23:40 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: Rik van Riel, arjanv, alan, torvalds, linux-mm

On Thu, Apr 05, 2001 at 01:11:30PM -0400, Ben LaHaise wrote:
> diff -ur v2.4.3/mm/swap_state.c work-2.4.3/mm/swap_state.c
> --- v2.4.3/mm/swap_state.c	Fri Dec 29 18:04:27 2000
> +++ work-2.4.3/mm/swap_state.c	Thu Apr  5 13:10:27 2001
> @@ -140,10 +140,9 @@
>  	/*
>  	 * If we are the only user, then try to free up the swap cache.
>  	 */
> -	if (PageSwapCache(page) && !TryLockPage(page)) {
> -		if (!is_page_shared(page)) {
> +	if (!TryLockPage(page)) {
> +		if (PageSwapCache(page) && !is_page_shared(page))
>  			delete_from_swap_cache_nolock(page);
> -		}
>  		UnlockPage(page);
>  	}
>  	page_cache_release(page);

swap cache pages should not be freeable by the memory balancing code because if
you're running at that point the reference count of the swap cache has to be > 1.

swapoff will grab the pagetable spinlock before dropping the swap cache
so it shouldn't run under such code either (and swapoff was used to
have other window for races anyways).

could you elaborate what can eat the swap cache from under you if you
don't first lock down the page before checking the swapcache bit? I thought
the reason for grabbing the lock there is just to do the trylock instead
of lock_page(): we can't use the delete_from_swap_cache that could otherwise
sleep if the page was for example locked down by the memory balancing code
while we were running there (if we fail we simply left some more spurious swap
cache).

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] swap_state.c thinko
  2001-04-05 17:11   ` Ben LaHaise
  2001-04-05 23:40     ` Andrea Arcangeli
@ 2001-04-06  0:32     ` Linus Torvalds
  2001-04-06 16:31       ` Hugh Dickins
  1 sibling, 1 reply; 33+ messages in thread
From: Linus Torvalds @ 2001-04-06  0:32 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: Rik van Riel, arjanv, alan, linux-mm


On Thu, 5 Apr 2001, Ben LaHaise wrote:
>
> You're right.  Here's the hopefully correct version.

I'd prefer something more along these lines: it gets rid of
free_page_and_swap_cache() altogether, along with "is_page_shared()",
realizing that "is_page_shared()" was only validly used on swap-cache
pages anyway and thus getting rid of the generic tests it had for other
kinds of pages.

It also fixes the swap sharing criteria to properly accept the case of a
page that has buffers but no other usage (which it pretty much always will
have, if it was truly read in from disk).

As far as I can tell, the lack of buffer-testing meant that we almost
_never_ just re-used the cache page directly on a read-swapin followed by
a write to the page.  Can that really be true? That should be the common
case, and would have made the swap cache much less effective than it
should be.

Anybody see any thinko's here?

		Linus


------
diff -u --recursive --new-file v2.4.3/linux/arch/sparc/mm/generic.c linux/arch/sparc/mm/generic.c
--- v2.4.3/linux/arch/sparc/mm/generic.c	Wed Aug  9 13:49:55 2000
+++ linux/arch/sparc/mm/generic.c	Thu Apr  5 14:38:29 2001
@@ -21,11 +21,7 @@
 		struct page *ptpage = pte_page(page);
 		if ((!VALID_PAGE(ptpage)) || PageReserved(ptpage))
 			return;
-		/*
-		 * free_page() used to be able to clear swap cache
-		 * entries.  We may now have to do it manually.
-		 */
-		free_page_and_swap_cache(ptpage);
+		page_cache_release(page);
 		return;
 	}
 	swap_free(pte_to_swp_entry(page));
diff -u --recursive --new-file v2.4.3/linux/arch/sparc64/mm/generic.c linux/arch/sparc64/mm/generic.c
--- v2.4.3/linux/arch/sparc64/mm/generic.c	Mon Mar 26 15:42:57 2001
+++ linux/arch/sparc64/mm/generic.c	Thu Apr  5 14:38:45 2001
@@ -21,11 +21,7 @@
 		struct page *ptpage = pte_page(page);
 		if ((!VALID_PAGE(ptpage)) || PageReserved(ptpage))
 			return;
-		/*
-		 * free_page() used to be able to clear swap cache
-		 * entries.  We may now have to do it manually.
-		 */
-		free_page_and_swap_cache(ptpage);
+		page_cache_release(ptpage);
 		return;
 	}
 	swap_free(pte_to_swp_entry(page));
diff -u --recursive --new-file v2.4.3/linux/include/linux/swap.h linux/include/linux/swap.h
--- v2.4.3/linux/include/linux/swap.h	Mon Mar 26 15:48:11 2001
+++ linux/include/linux/swap.h	Thu Apr  5 15:44:52 2001
@@ -134,7 +134,6 @@
 extern void __delete_from_swap_cache(struct page *page);
 extern void delete_from_swap_cache(struct page *page);
 extern void delete_from_swap_cache_nolock(struct page *page);
-extern void free_page_and_swap_cache(struct page *page);

 /* linux/mm/swapfile.c */
 extern unsigned int nr_swapfiles;
@@ -166,23 +165,6 @@
 extern unsigned long swap_cache_find_total;
 extern unsigned long swap_cache_find_success;
 #endif
-
-/*
- * Work out if there are any other processes sharing this page, ignoring
- * any page reference coming from the swap cache, or from outstanding
- * swap IO on this page.  (The page cache _does_ count as another valid
- * reference to the page, however.)
- */
-static inline int is_page_shared(struct page *page)
-{
-	unsigned int count;
-	if (PageReserved(page))
-		return 1;
-	count = page_count(page);
-	if (PageSwapCache(page))
-		count += swap_count(page) - 2 - !!page->buffers;
-	return  count > 1;
-}

 extern spinlock_t pagemap_lru_lock;

diff -u --recursive --new-file v2.4.3/linux/mm/memory.c linux/mm/memory.c
--- v2.4.3/linux/mm/memory.c	Mon Mar 26 11:02:24 2001
+++ linux/mm/memory.c	Thu Apr  5 16:41:17 2001
@@ -274,7 +274,7 @@
 		 */
 		if (pte_dirty(pte) && page->mapping)
 			set_page_dirty(page);
-		free_page_and_swap_cache(page);
+		page_cache_release(page);
 		return 1;
 	}
 	swap_free(pte_to_swp_entry(pte));
@@ -815,6 +815,24 @@
 }

 /*
+ * Work out if there are any other processes sharing this
+ * swap cache page. Never mind the buffers.
+ */
+static inline int exclusive_swap_page(struct page *page)
+{
+	unsigned int count;
+
+	if (!PageLocked(page))
+		BUG();
+	if (!PageSwapCache(page))
+		return 0;
+	count = page_count(page) - !!page->buffers;	/*  2: us + swap cache */
+	count += swap_count(page);			/* +1: just swap cache */
+	return count == 3;				/* =3: total */
+}
+
+
+/*
  * This routine handles present pages, when users try to write
  * to a shared page. It is done by copying the page to a new address
  * and decrementing the shared-page counter for the old page.
@@ -853,19 +871,21 @@
 	 *   marked dirty).
 	 */
 	switch (page_count(old_page)) {
+	int can_reuse;
+	case 3:
+		if (!old_page->buffers)
+			break;
+		/* FallThrough */
 	case 2:
-		/*
-		 * Lock the page so that no one can look it up from
-		 * the swap cache, grab a reference and start using it.
-		 * Can not do lock_page, holding page_table_lock.
-		 */
-		if (!PageSwapCache(old_page) || TryLockPage(old_page))
+		if (!PageSwapCache(old_page))
 			break;
-		if (is_page_shared(old_page)) {
-			UnlockPage(old_page);
+		if (TryLockPage(old_page))
 			break;
-		}
+		/* Recheck swapcachedness once the page is locked */
+		can_reuse = exclusive_swap_page(old_page);
 		UnlockPage(old_page);
+		if (!can_reuse)
+			break;
 		/* FallThrough */
 	case 1:
 		if (PageReserved(old_page))
@@ -903,8 +923,7 @@
 	return -1;
 }

-static void vmtruncate_list(struct vm_area_struct *mpnt,
-			    unsigned long pgoff, unsigned long partial)
+static void vmtruncate_list(struct vm_area_struct *mpnt, unsigned long pgoff)
 {
 	do {
 		struct mm_struct *mm = mpnt->vm_mm;
@@ -947,7 +966,7 @@
  */
 void vmtruncate(struct inode * inode, loff_t offset)
 {
-	unsigned long partial, pgoff;
+	unsigned long pgoff;
 	struct address_space *mapping = inode->i_mapping;
 	unsigned long limit;

@@ -959,19 +978,15 @@
 		goto out_unlock;

 	pgoff = (offset + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
-	partial = (unsigned long)offset & (PAGE_CACHE_SIZE - 1);
-
 	if (mapping->i_mmap != NULL)
-		vmtruncate_list(mapping->i_mmap, pgoff, partial);
+		vmtruncate_list(mapping->i_mmap, pgoff);
 	if (mapping->i_mmap_shared != NULL)
-		vmtruncate_list(mapping->i_mmap_shared, pgoff, partial);
+		vmtruncate_list(mapping->i_mmap_shared, pgoff);

 out_unlock:
 	spin_unlock(&mapping->i_shared_lock);
 	truncate_inode_pages(mapping, offset);
-	if (inode->i_op && inode->i_op->truncate)
-		inode->i_op->truncate(inode);
-	return;
+	goto out_truncate;

 do_expand:
 	limit = current->rlim[RLIMIT_FSIZE].rlim_cur;
@@ -986,8 +1001,13 @@
 		}
 	}
 	inode->i_size = offset;
-	if (inode->i_op && inode->i_op->truncate)
+
+out_truncate:
+	if (inode->i_op && inode->i_op->truncate) {
+		lock_kernel();
 		inode->i_op->truncate(inode);
+		unlock_kernel();
+	}
 out:
 	return;
 }
@@ -1077,7 +1097,7 @@
 	pte = mk_pte(page, vma->vm_page_prot);

 	swap_free(entry);
-	if (write_access && !is_page_shared(page))
+	if (write_access && exclusive_swap_page(page))
 		pte = pte_mkwrite(pte_mkdirty(pte));
 	UnlockPage(page);

diff -u --recursive --new-file v2.4.3/linux/mm/swap_state.c linux/mm/swap_state.c
--- v2.4.3/linux/mm/swap_state.c	Fri Dec 29 15:04:27 2000
+++ linux/mm/swap_state.c	Thu Apr  5 14:35:08 2001
@@ -17,8 +17,22 @@

 #include <asm/pgtable.h>

+/*
+ * We may have stale swap cache pages in memory: notice
+ * them here and get rid of the unnecessary final write.
+ */
 static int swap_writepage(struct page *page)
 {
+	/* One for the page cache, one for this user, one for page->buffers */
+	if (page_count(page) > 2 + !!page->buffers)
+		goto in_use;
+	if (swap_count(page) > 1)
+		goto in_use;
+
+	/* We could remove it here, but page_launder will do it anyway */
+	return 0;
+
+in_use:
 	rw_swap_page(WRITE, page, 0);
 	return 0;
 }
@@ -129,26 +143,6 @@
 	delete_from_swap_cache_nolock(page);
 	UnlockPage(page);
 }
-
-/*
- * Perform a free_page(), also freeing any swap cache associated with
- * this page if it is the last user of the page. Can not do a lock_page,
- * as we are holding the page_table_lock spinlock.
- */
-void free_page_and_swap_cache(struct page *page)
-{
-	/*
-	 * If we are the only user, then try to free up the swap cache.
-	 */
-	if (PageSwapCache(page) && !TryLockPage(page)) {
-		if (!is_page_shared(page)) {
-			delete_from_swap_cache_nolock(page);
-		}
-		UnlockPage(page);
-	}
-	page_cache_release(page);
-}
-

 /*
  * Lookup a swap entry in the swap cache. A found page will be returned

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] swap_state.c thinko
  2001-04-06  0:32     ` Linus Torvalds
@ 2001-04-06 16:31       ` Hugh Dickins
  2001-04-06 17:21         ` Linus Torvalds
  0 siblings, 1 reply; 33+ messages in thread
From: Hugh Dickins @ 2001-04-06 16:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ben LaHaise, Rik van Riel, Richard Jerrrell, Stephen Tweedie,
	arjanv, alan, linux-mm

On Thu, 5 Apr 2001, Linus Torvalds wrote:
> 
> I'd prefer something more along these lines: it gets rid of
> free_page_and_swap_cache() altogether, along with "is_page_shared()",
> realizing that "is_page_shared()" was only validly used on swap-cache
> pages anyway and thus getting rid of the generic tests it had for other
> kinds of pages.

I like this direction, but (if I understand the issues better today
than I did yesterday) the patch you posted looks seriously incomplete
to me.  While it deals with one of the issues raised by Rich Jerrell
(writing dead swap pages), doesn't it exacerbate his other issue?

That after a process exits (unmaps), if a page was in the swap cache,
its swap slot will remain allocated until vmscan.c gets to deal with it;
which would be okay, except that vm_enough_memory() may give false
negatives meanwhile, because there's no count of such pages (and Rich
gave the nice example that immediately starting up the same program
again was liable to fail because of this).  Exacerbates, because
that problem was just on the !pte_present entries, now with your
patch it would be on the pte_present entries too.

But I don't agree with the way Rich adds his nr_swap_cache_pages to
"free" in his vm_enough_memory(), because cached pages are all already
counted into "free" from page_cache_size - so I believe he's double-
accounting all the swap cache pages as free, when it should just be
those which (could) have been freed on exit/unmap.  And to count those,
I think you'd have to reinstate code like free_page_and_swap_cache().

Instead, perhaps vm_enough_memory() should force a scan to free
before failing?  And would need to register its own memory pressure,
so the scan tries hard enough to provide what will be needed?

Sorry, we've moved well away from Ben's "swap_state.c thinko".

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] swap_state.c thinko
  2001-04-06 16:31       ` Hugh Dickins
@ 2001-04-06 17:21         ` Linus Torvalds
  2001-04-06 18:23           ` Hugh Dickins
  2001-04-06 18:47           ` Andrea Arcangeli
  0 siblings, 2 replies; 33+ messages in thread
From: Linus Torvalds @ 2001-04-06 17:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Ben LaHaise, Rik van Riel, Richard Jerrrell, Stephen Tweedie,
	arjanv, alan, linux-mm

On Fri, 6 Apr 2001, Hugh Dickins wrote:
>
> I like this direction, but (if I understand the issues better today
> than I did yesterday) the patch you posted looks seriously incomplete
> to me.  While it deals with one of the issues raised by Rich Jerrell
> (writing dead swap pages), doesn't it exacerbate his other issue?

Yes. However, I'm assuming most of that is just "statistics get buggered",
in that swap pages tend to stay around for longer than they really are
used. It was true before, but it would be _consistently_ true now.

I don't agree with your vm_enough_memory() worry - it should be correct
already, because it shows up as page cache pages (and that, in turn, is
already taken care of). In fact, the swap cache pages shouldn't even
create any new special cases: they are exactly equivalent to already-
existing page cache pages.

So if this patch generates any problems, then those problems were
pre-existing, and the patch just triggers them more easily. For example,
it may shift the load to look a bit more like there are lots of dirty
shared pages, and maybe we haven't handled that load well before. In that
sense this is one of those "can trigger cases we haven't tested well
enough" patches, but I don't think it should introduce _new_ problems.

> Instead, perhaps vm_enough_memory() should force a scan to free
> before failing?  And would need to register its own memory pressure,
> so the scan tries hard enough to provide what will be needed?

I don't think so. It _already_ considers every single page cache page to
be completely droppable. So I don't think vm_enough_memory() can fail any
more due to my change - the only thing that happened was that pages that
were counted as "free" got moved to "page cache", but the math ends up
giving the exact same answer anyway:

	...
        free += atomic_read(&page_cache_size);
        free += nr_free_pages();
	...

Does anybody actually see failures with this path? Maybe they are due to
"page_launder()" not getting rid of the page swap pages aggressively
enough..

(I considered moving the swap-cache page earlier in page_launder(), but
I'd just be happier if we could have this all in swap_writepage() and not
pollute any of the rest of the VM at all. Pipe-dream, maybe).

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] swap_state.c thinko
  2001-04-06 17:21         ` Linus Torvalds
@ 2001-04-06 18:23           ` Hugh Dickins
  2001-04-06 18:57             ` Linus Torvalds
  2001-04-06 18:47           ` Andrea Arcangeli
  1 sibling, 1 reply; 33+ messages in thread
From: Hugh Dickins @ 2001-04-06 18:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ben LaHaise, Rik van Riel, Richard Jerrrell, Stephen Tweedie,
	arjanv, alan, linux-mm

On Fri, 6 Apr 2001, Linus Torvalds wrote:
> 
> On Fri, 6 Apr 2001, Hugh Dickins wrote:
> >
> > I like this direction, but (if I understand the issues better today
> > than I did yesterday) the patch you posted looks seriously incomplete
> > to me.  While it deals with one of the issues raised by Rich Jerrell
> > (writing dead swap pages), doesn't it exacerbate his other issue?
> 
> Yes. However, I'm assuming most of that is just "statistics get buggered",
> in that swap pages tend to stay around for longer than they really are
> used. It was true before, but it would be _consistently_ true now.

Yes, I like it that the pte_present route becomes consistent with
the !pte_present route, and I share your belief that any problems
won't be _new_ ones.  But I supposed that Rich was describing a
practical problem, not just temporarily buggered statistics.

> I don't agree with your vm_enough_memory() worry - it should be correct
> already, because it shows up as page cache pages (and that, in turn, is
> already taken care of). In fact, the swap cache pages shouldn't even
> create any new special cases: they are exactly equivalent to already-
> existing page cache pages.

It is, of course, remotely conceivable that I'm confused, but...
I realize that the page cache pages (including those of swap)
are already added into "free" by vm_enough_memory().  But it's also
adding in nr_swap_pages, and that number is significantly less than
what it should be, because freeable swap slots have not been freed.
Therefore I think we need to add in that (sadly unknown) number of
should-have-been-freed slots - not because the memory hasn't been
properly counted, but because the swap hasn't been properly counted.

If this is not the case, then I (again) don't understand Rich's
difficulty in running the program just after it exited.

> (I considered moving the swap-cache page earlier in page_launder(), but
> I'd just be happier if we could have this all in swap_writepage() and not
> pollute any of the rest of the VM at all. Pipe-dream, maybe).

Aside from the vm_enough_memory() issue, if you leave page_launder()
to clean up, then some reordering there might well be good: isn't it
liable to clean and free some aged but potentially useful pages (e.g.
cached pages of live data on swap) before the entirely useless cached
pages of dead process data?

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] swap_state.c thinko
  2001-04-06 18:47           ` Andrea Arcangeli
@ 2001-04-06 18:37             ` Hugh Dickins
  2001-04-06 19:09               ` Andrea Arcangeli
                                 ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Hugh Dickins @ 2001-04-06 18:37 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Ben LaHaise, Rik van Riel, Richard Jerrrell,
	Stephen Tweedie, arjanv, alan, linux-mm

On Fri, 6 Apr 2001, Andrea Arcangeli wrote:
> 
> swap cache also decrease the amount free-swap-space, that will be reclaimed as
> soon as we collect the swap cache. so we must add the swap cache size to the
> amount of virtual memory available (in addition to the in-core pagecachesize)
> to take care of the swap side. I suggested that as the fix for the failed
> malloc issue to the missioncritical guys when they asked me about that.

swapper_space.nrpages, that's neat, but I insist it's not right.
You're then double counting into "free" all the swap cache pages
(already included in page_cache_size) which correspond to pages
of swap for running processes - erring in the opposite direction
to the present code.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] swap_state.c thinko
  2001-04-06 17:21         ` Linus Torvalds
  2001-04-06 18:23           ` Hugh Dickins
@ 2001-04-06 18:47           ` Andrea Arcangeli
  2001-04-06 18:37             ` Hugh Dickins
  1 sibling, 1 reply; 33+ messages in thread
From: Andrea Arcangeli @ 2001-04-06 18:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Ben LaHaise, Rik van Riel, Richard Jerrrell,
	Stephen Tweedie, arjanv, alan, linux-mm

On Fri, Apr 06, 2001 at 10:21:38AM -0700, Linus Torvalds wrote:
> I don't agree with your vm_enough_memory() worry - it should be correct
> already, because it shows up as page cache pages (and that, in turn, is
> already taken care of). In fact, the swap cache pages shouldn't even
> create any new special cases: they are exactly equivalent to already-
> existing page cache pages.

swap cache also decrease the amount free-swap-space, that will be reclaimed as
soon as we collect the swap cache. so we must add the swap cache size to the
amount of virtual memory available (in addition to the in-core pagecachesize)
to take care of the swap side. I suggested that as the fix for the failed
malloc issue to the missioncritical guys when they asked me about that.
However I think I seen some overkill patch floating around, the fix is just a
one liner:

--- 2.4.3aa/mm/mmap.c.~1~	Fri Apr  6 05:10:16 2001
+++ 2.4.3aa/mm/mmap.c	Fri Apr  6 20:44:18 2001
@@ -64,6 +64,7 @@
 	free += atomic_read(&page_cache_size);
 	free += nr_free_pages();
 	free += nr_swap_pages;
+	free += swapper_space.nrpages;
 	/*
 	 * The code below doesn't account for free space in the inode
 	 * and dentry slab cache, slab cache fragmentation, inodes and

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] swap_state.c thinko
  2001-04-06 19:09               ` Andrea Arcangeli
@ 2001-04-06 18:53                 ` Hugh Dickins
  2001-04-06 19:14                 ` Andrea Arcangeli
  1 sibling, 0 replies; 33+ messages in thread
From: Hugh Dickins @ 2001-04-06 18:53 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Ben LaHaise, Rik van Riel, Richard Jerrrell,
	Stephen Tweedie, arjanv, alan, linux-mm

On Fri, 6 Apr 2001, Andrea Arcangeli wrote:
> On Fri, Apr 06, 2001 at 07:37:55PM +0100, Hugh Dickins wrote:
> > swapper_space.nrpages, that's neat, but I insist it's not right.
> > You're then double counting into "free" all the swap cache pages
> > (already included in page_cache_size) which correspond to pages
> > of swap for running processes - erring in the opposite direction
> > to the present code.
> 
> The whole point is that errirng in the opposite direction is perfectly fine,
> that's expected.  Understimating is a bug instead. Period.
> 
> We always overstimate anyways, we have to because we don't have information
> about the really freeable memory (think at the buffer cache pinned in the
> superblock metadata of ext2, do you expect to be able to reclaim it somehow?).

Well, that's an interesting point of view... but if it's so okay
to overestimate, couldn't we simplify vm_enough_pages() somewhat?

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] swap_state.c thinko
  2001-04-06 18:23           ` Hugh Dickins
@ 2001-04-06 18:57             ` Linus Torvalds
  2001-04-06 19:06               ` Rik van Riel
  0 siblings, 1 reply; 33+ messages in thread
From: Linus Torvalds @ 2001-04-06 18:57 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Ben LaHaise, Rik van Riel, Richard Jerrrell, Stephen Tweedie,
	arjanv, alan, linux-mm


On Fri, 6 Apr 2001, Hugh Dickins wrote:
>
> It is, of course, remotely conceivable that I'm confused, but...
> I realize that the page cache pages (including those of swap)
> are already added into "free" by vm_enough_memory().  But it's also
> adding in nr_swap_pages

.. Ahh. Right you are. We did not just move it from the "free page" to the
"swap cache", we also didn't release the space in the actual swap space
bitmaps, and you're right, that certainly changes the accounting.

Mea culpa. Ideas?

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] swap_state.c thinko
  2001-04-06 19:14                 ` Andrea Arcangeli
@ 2001-04-06 19:03                   ` Hugh Dickins
  2001-04-06 20:03                     ` Andrea Arcangeli
  0 siblings, 1 reply; 33+ messages in thread
From: Hugh Dickins @ 2001-04-06 19:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Ben LaHaise, Rik van Riel, Richard Jerrrell,
	Stephen Tweedie, arjanv, alan, linux-mm

On Fri, 6 Apr 2001, Andrea Arcangeli wrote:
> On Fri, Apr 06, 2001 at 09:09:08PM +0200, Andrea Arcangeli wrote:
> > We always overstimate anyways, we have to because we don't have information
> > about the really freeable memory (think at the buffer cache pinned in the
> 
> ah, and btw, even if we would have information about the really freeable memory
> in the cache and swap cache that would still useless in real life because we
> don't reserve memory for the previous malloc calls (endless overcommit
> discussion), so allocation could still fail during page fault and process will
> have to be killed the linux way.

How indelicate you are!  I've been careful to avoid all mention of that...
Seriously, there's good debate to have there, but it's another issue
(and I don't think the drawbacks of the overcommit strategy excuse
sloppy accounting on its own terms).

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] swap_state.c thinko
  2001-04-06 18:57             ` Linus Torvalds
@ 2001-04-06 19:06               ` Rik van Riel
  0 siblings, 0 replies; 33+ messages in thread
From: Rik van Riel @ 2001-04-06 19:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Ben LaHaise, Richard Jerrrell, Stephen Tweedie,
	arjanv, alan, linux-mm

On Fri, 6 Apr 2001, Linus Torvalds wrote:
> On Fri, 6 Apr 2001, Hugh Dickins wrote:
> >
> > It is, of course, remotely conceivable that I'm confused, but...
> > I realize that the page cache pages (including those of swap)
> > are already added into "free" by vm_enough_memory().  But it's also
> > adding in nr_swap_pages
>
> .. Ahh. Right you are. We did not just move it from the "free
> page" to the "swap cache", we also didn't release the space in
> the actual swap space bitmaps, and you're right, that certainly
> changes the accounting.

>From all swap space, the following is available (in principle):
  - free swap
  - swapcached space  (except that we currently cannot reclaim it)

The only problem with this calculation is that it is also too
optimistic, since we've already counted all swapcached space as
free memory as well ...

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] swap_state.c thinko
  2001-04-06 18:37             ` Hugh Dickins
@ 2001-04-06 19:09               ` Andrea Arcangeli
  2001-04-06 18:53                 ` Hugh Dickins
  2001-04-06 19:14                 ` Andrea Arcangeli
  2001-04-06 19:12               ` Richard Jerrell
  2001-04-06 19:52               ` Linus Torvalds
  2 siblings, 2 replies; 33+ messages in thread
From: Andrea Arcangeli @ 2001-04-06 19:09 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Ben LaHaise, Rik van Riel, Richard Jerrrell,
	Stephen Tweedie, arjanv, alan, linux-mm

On Fri, Apr 06, 2001 at 07:37:55PM +0100, Hugh Dickins wrote:
> swapper_space.nrpages, that's neat, but I insist it's not right.
> You're then double counting into "free" all the swap cache pages
> (already included in page_cache_size) which correspond to pages
> of swap for running processes - erring in the opposite direction
> to the present code.

The whole point is that errirng in the opposite direction is perfectly fine,
that's expected.  Understimating is a bug instead. Period.

We always overstimate anyways, we have to because we don't have information
about the really freeable memory (think at the buffer cache pinned in the
superblock metadata of ext2, do you expect to be able to reclaim it somehow?).

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] swap_state.c thinko
  2001-04-06 18:37             ` Hugh Dickins
  2001-04-06 19:09               ` Andrea Arcangeli
@ 2001-04-06 19:12               ` Richard Jerrell
  2001-04-06 19:52               ` Linus Torvalds
  2 siblings, 0 replies; 33+ messages in thread
From: Richard Jerrell @ 2001-04-06 19:12 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Linus Torvalds, Ben LaHaise, Rik van Riel,
	Stephen Tweedie, arjanv, alan, linux-mm

> swapper_space.nrpages, that's neat, but I insist it's not right.
> You're then double counting into "free" all the swap cache pages
> (already included in page_cache_size) which correspond to pages
> of swap for running processes - erring in the opposite direction
> to the present code.

Right.  We still have the same problem.  If you count the swap cache pages
once, you are underestimating the free memory.  If you count them twice,
you are overcommitting memory.  What we would need is to check in
swap_free for swap_map[i] count of 1 and a page->count of 1.  The cell
references the page, and the page references the cell.  These pages are
freeable _twice_ because they are sitting on twice as much memory as they
should be.  I'd say that overestimating your memory is better than denying
allocation when it's possible.

Rich

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] swap_state.c thinko
  2001-04-06 19:09               ` Andrea Arcangeli
  2001-04-06 18:53                 ` Hugh Dickins
@ 2001-04-06 19:14                 ` Andrea Arcangeli
  2001-04-06 19:03                   ` Hugh Dickins
  1 sibling, 1 reply; 33+ messages in thread
From: Andrea Arcangeli @ 2001-04-06 19:14 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Ben LaHaise, Rik van Riel, Richard Jerrrell,
	Stephen Tweedie, arjanv, alan, linux-mm

On Fri, Apr 06, 2001 at 09:09:08PM +0200, Andrea Arcangeli wrote:
> We always overstimate anyways, we have to because we don't have information
> about the really freeable memory (think at the buffer cache pinned in the

ah, and btw, even if we would have information about the really freeable memory
in the cache and swap cache that would still useless in real life because we
don't reserve memory for the previous malloc calls (endless overcommit
discussion), so allocation could still fail during page fault and process will
have to be killed the linux way.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] swap_state.c thinko
  2001-04-06 18:37             ` Hugh Dickins
  2001-04-06 19:09               ` Andrea Arcangeli
  2001-04-06 19:12               ` Richard Jerrell
@ 2001-04-06 19:52               ` Linus Torvalds
  2001-04-06 20:22                 ` Andrea Arcangeli
  2001-04-06 20:48                 ` Hugh Dickins
  2 siblings, 2 replies; 33+ messages in thread
From: Linus Torvalds @ 2001-04-06 19:52 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Ben LaHaise, Rik van Riel, Richard Jerrrell,
	Stephen Tweedie, arjanv, alan, linux-mm

On Fri, 6 Apr 2001, Hugh Dickins wrote:
>
> swapper_space.nrpages, that's neat, but I insist it's not right.

It's not "right", but I suspect it's actually good enough.

Also, note that when if get _really_ low on memory, the swap cache effect
should be going away: if we still have the swap cache pages in memory,
we've obviously not paged everything out yet. So the double accounting
should have a limit error of zero as we approach being truly low on
memory. And that, I suspect, is the most important thing - making sure
that we allow programs to run when they can, but at least having _some_
concept of "enough is enough".

Note that we should probably also have a small "negative" count: it might
not be a bad idea to say "we always want to have X MB free in _some_ form,
be it swap or RAM. So I don't think it would necessarily be wrong to say
something like

	free -= num_physpages >> 6;

to approximate the notion of "keep 1 percent slop" (remember, the 1% may
well be on the swap device, not actually kept as free memory).

vm_enough_memory() is a heuristic, nothing more. We want it to reflect
_some_ view of reality, but the Linux VM is _fundamentally_ based on the
notion of over-commit, and that won't change. vm_enough_memory() is only
meant to give a first-order appearance of not overcommitting wildly. It
has never been anything more than that.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] swap_state.c thinko
  2001-04-06 19:03                   ` Hugh Dickins
@ 2001-04-06 20:03                     ` Andrea Arcangeli
  0 siblings, 0 replies; 33+ messages in thread
From: Andrea Arcangeli @ 2001-04-06 20:03 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Ben LaHaise, Rik van Riel, Richard Jerrrell,
	Stephen Tweedie, arjanv, alan, linux-mm

On Fri, Apr 06, 2001 at 08:03:08PM +0100, Hugh Dickins wrote:
> On Fri, 6 Apr 2001, Andrea Arcangeli wrote:
> > On Fri, Apr 06, 2001 at 09:09:08PM +0200, Andrea Arcangeli wrote:
> > > We always overstimate anyways, we have to because we don't have information
> > > about the really freeable memory (think at the buffer cache pinned in the
> > 
> > ah, and btw, even if we would have information about the really freeable memory
> > in the cache and swap cache that would still useless in real life because we
> > don't reserve memory for the previous malloc calls (endless overcommit
> > discussion), so allocation could still fail during page fault and process will
> > have to be killed the linux way.
> 
> How indelicate you are!  I've been careful to avoid all mention of that...
> Seriously, there's good debate to have there, but it's another issue

This is not another issue. It is the same issue. If you don't do reservation
for the userspace memory you can only overstimate, if you understimate you are
wrong because you are not even able to use all the resources in your machine
and that is the current bug (while overstimating and overcommit may allow you
to better optimize the memory usage with the downside that you can have to kill
a task during a page fault and that is why it should be optional behaviour).

So I repeat: if you don't reserve the memory to make sure you will always be
able to allocate the malloced memory you just overstimate, and so the
vm_enough_memory automatically become a very imprecise guess, it's an heuristic
that is only meant to avoid ridicolous big allocations and that must _never_
understimate.

This is my last email about this issue.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] swap_state.c thinko
  2001-04-06 19:52               ` Linus Torvalds
@ 2001-04-06 20:22                 ` Andrea Arcangeli
  2001-04-06 21:04                   ` Rik van Riel
  2001-04-09 18:16                   ` Alan Cox
  2001-04-06 20:48                 ` Hugh Dickins
  1 sibling, 2 replies; 33+ messages in thread
From: Andrea Arcangeli @ 2001-04-06 20:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Ben LaHaise, Rik van Riel, Richard Jerrrell,
	Stephen Tweedie, arjanv, alan, linux-mm

On Fri, Apr 06, 2001 at 12:52:26PM -0700, Linus Torvalds wrote:
> vm_enough_memory() is a heuristic, nothing more. We want it to reflect
> _some_ view of reality, but the Linux VM is _fundamentally_ based on the
> notion of over-commit, and that won't change. vm_enough_memory() is only
> meant to give a first-order appearance of not overcommitting wildly. It
> has never been anything more than that.

200% agreed.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] swap_state.c thinko
  2001-04-06 19:52               ` Linus Torvalds
  2001-04-06 20:22                 ` Andrea Arcangeli
@ 2001-04-06 20:48                 ` Hugh Dickins
  1 sibling, 0 replies; 33+ messages in thread
From: Hugh Dickins @ 2001-04-06 20:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, Ben LaHaise, Rik van Riel, Richard Jerrrell,
	Stephen Tweedie, arjanv, alan, linux-mm

On Fri, 6 Apr 2001, Linus Torvalds wrote:
> On Fri, 6 Apr 2001, Hugh Dickins wrote:
> >
> > swapper_space.nrpages, that's neat, but I insist it's not right.
> 
> It's not "right", but I suspect it's actually good enough.

Yes, even before Andrea's final email, I found myself warming to
his point of view.  As you both point out, vm_enough_pages() is
merely a heuristic for rejecting the impossible, and overcommit
laughs in the face of strict calculation here.  And it does a
better job than the comparison with infinity I was implying.

> Also, note that when if get _really_ low on memory, the swap cache effect
> should be going away: if we still have the swap cache pages in memory,
> we've obviously not paged everything out yet. So the double accounting
> should have a limit error of zero as we approach being truly low on
> memory. And that, I suspect, is the most important thing - making sure
> that we allow programs to run when they can, but at least having _some_
> concept of "enough is enough".

I've been rehearsing this same "limit error of zero" argument to
myself here, since your call for "Ideas?" which shut us up.  It's
plausible, I bet it's not strictly true, but it feels good enough:
we all know there are cases where it will go wrong, nothing new there.

But maybe a comment in vm_enough_pages() to make the false accounting
explicit?  And pace those who hate multiple returns, why add all those
pages every time, including call to nr_free_pages(), when most often
it can succeed right away?  I haven't got your "num_physpages >> 6"
in there: sounds very reasonable - but there's a 23/11/98 NJC comment
(omitted from mine below, since no such code recently) to suggest it
was tried once before, anyone remember why that was abandoned?

Hugh

int vm_enough_memory(long pages)
{
	/* Stupid algorithm to decide if we have enough memory: while
	 * simple, it hopefully works in most obvious cases.. Easy to
	 * fool it, but this should catch most mistakes.
	 */
	long free;
	
        /* Sometimes we want to use more memory than we have. */
	if (sysctl_overcommit_memory)
		return 1;

	free = nr_swap_pages;
	if (free > pages)
		return 1;
	free += atomic_read(&page_cache_size);
	if (free > pages)
		return 1;
	free += atomic_read(&buffermem_pages);
	if (free > pages)
		return 1;
	/*
	 * swapper_space.nrpages is the number of swap pages cached:
	 * they have already been included in page_cache_size, but
	 * this compensates for recently unmapped and freeable pages
	 * of swap not yet included in nr_swap_pages: when in doubt,
	 * let vm_enough_memory() err towards success.
	 */
	free += swapper_space.nrpages;
	if (free > pages)
		return 1;
	free += nr_free_pages();
	if (free > pages)
		return 1;
	/*
	 * The code below doesn't account for free space in the inode
	 * and dentry slab cache, slab cache fragmentation, inodes and
	 * dentries which will become freeable under VM load, etc.
	 * Lets just hope all these (complex) factors balance out...
	 */
	free += (dentry_stat.nr_unused * sizeof(struct dentry)) >> PAGE_SHIFT;
	free += (inodes_stat.nr_unused * sizeof(struct inode)) >> PAGE_SHIFT;

	return free > pages;
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] swap_state.c thinko
  2001-04-06 20:22                 ` Andrea Arcangeli
@ 2001-04-06 21:04                   ` Rik van Riel
  2001-04-07  1:27                     ` Andrea Arcangeli
  2001-04-09 18:16                   ` Alan Cox
  1 sibling, 1 reply; 33+ messages in thread
From: Rik van Riel @ 2001-04-06 21:04 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Hugh Dickins, Ben LaHaise, Richard Jerrrell,
	Stephen Tweedie, arjanv, alan, linux-mm

On Fri, 6 Apr 2001, Andrea Arcangeli wrote:
> On Fri, Apr 06, 2001 at 12:52:26PM -0700, Linus Torvalds wrote:
> > vm_enough_memory() is a heuristic, nothing more. We want it to reflect
> > _some_ view of reality, but the Linux VM is _fundamentally_ based on the
> > notion of over-commit, and that won't change. vm_enough_memory() is only
> > meant to give a first-order appearance of not overcommitting wildly. It
> > has never been anything more than that.
>
> 200% agreed.

I don't think we should approximate THAT roughly ;))

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] swap_state.c thinko
  2001-04-06 21:04                   ` Rik van Riel
@ 2001-04-07  1:27                     ` Andrea Arcangeli
  0 siblings, 0 replies; 33+ messages in thread
From: Andrea Arcangeli @ 2001-04-07  1:27 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Hugh Dickins, Ben LaHaise, Richard Jerrrell,
	Stephen Tweedie, arjanv, alan, linux-mm

On Fri, Apr 06, 2001 at 06:04:58PM -0300, Rik van Riel wrote:
> On Fri, 6 Apr 2001, Andrea Arcangeli wrote:
> > On Fri, Apr 06, 2001 at 12:52:26PM -0700, Linus Torvalds wrote:
> > > vm_enough_memory() is a heuristic, nothing more. We want it to reflect
> > > _some_ view of reality, but the Linux VM is _fundamentally_ based on the
> > > notion of over-commit, and that won't change. vm_enough_memory() is only
> > > meant to give a first-order appearance of not overcommitting wildly. It
> > > has never been anything more than that.
> >
> > 200% agreed.
> 
> I don't think we should approximate THAT roughly ;))

;);)

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] swap_state.c thinko
  2001-04-06 20:22                 ` Andrea Arcangeli
  2001-04-06 21:04                   ` Rik van Riel
@ 2001-04-09 18:16                   ` Alan Cox
  2001-04-09 18:45                     ` Andrea Arcangeli
  2001-04-09 20:32                     ` Linus Torvalds
  1 sibling, 2 replies; 33+ messages in thread
From: Alan Cox @ 2001-04-09 18:16 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Hugh Dickins, Ben LaHaise, Rik van Riel,
	Richard Jerrrell, Stephen Tweedie, arjanv, alan, linux-mm

> On Fri, Apr 06, 2001 at 12:52:26PM -0700, Linus Torvalds wrote:
> > vm_enough_memory() is a heuristic, nothing more. We want it to reflect
> > _some_ view of reality, but the Linux VM is _fundamentally_ based on the
> > notion of over-commit, and that won't change. vm_enough_memory() is only
> > meant to give a first-order appearance of not overcommitting wildly. It
> > has never been anything more than that.
> 
> 200% agreed.

Given that strict address space management is not that hard would you 
accept patches to allow optional non-overcommit in 2.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] swap_state.c thinko
  2001-04-09 18:16                   ` Alan Cox
@ 2001-04-09 18:45                     ` Andrea Arcangeli
  2001-04-09 20:32                     ` Linus Torvalds
  1 sibling, 0 replies; 33+ messages in thread
From: Andrea Arcangeli @ 2001-04-09 18:45 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linus Torvalds, Hugh Dickins, Ben LaHaise, Rik van Riel,
	Richard Jerrrell, Stephen Tweedie, arjanv, linux-mm

On Mon, Apr 09, 2001 at 02:16:59PM -0400, Alan Cox wrote:
> > On Fri, Apr 06, 2001 at 12:52:26PM -0700, Linus Torvalds wrote:
> > > vm_enough_memory() is a heuristic, nothing more. We want it to reflect
> > > _some_ view of reality, but the Linux VM is _fundamentally_ based on the
> > > notion of over-commit, and that won't change. vm_enough_memory() is only
> > > meant to give a first-order appearance of not overcommitting wildly. It
> > > has never been anything more than that.
> > 
> > 200% agreed.
> 
> Given that strict address space management is not that hard would you 
> accept patches to allow optional non-overcommit in 2.5

Since we are not able to estimate how much cache is really freeable a simple
early implementation will have to shrink the cache at mmap time, instead of
page fault time and that should be acceptable to the people who needs it.

I suggest three modes:

1)	non overcommit (optional)
2)	2.4 default (default)
3)	vm_enough_memory always returns 1, equivalent to 2.4 with overcommit
	set to 1 (optional)

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] swap_state.c thinko
  2001-04-09 18:16                   ` Alan Cox
  2001-04-09 18:45                     ` Andrea Arcangeli
@ 2001-04-09 20:32                     ` Linus Torvalds
  2001-04-09 20:54                       ` David L. Parsley
  2001-04-10 21:07                       ` James Antill
  1 sibling, 2 replies; 33+ messages in thread
From: Linus Torvalds @ 2001-04-09 20:32 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andrea Arcangeli, Hugh Dickins, Ben LaHaise, Rik van Riel,
	Richard Jerrrell, Stephen Tweedie, arjanv, linux-mm

On Mon, 9 Apr 2001, Alan Cox wrote:
>
> Given that strict address space management is not that hard would you
> accept patches to allow optional non-overcommit in 2.5

I really doubt anybody wants to use a truly non-overcommit system.

It would basically imply counting every single vma that is privately
writable, and assuming it becomes totally non-shared.

Try this on your system as root:

	cat /proc/*/maps | grep ' .w.p '

and see how much memory that is.

On my machine, running X, that's about 53M with just a few windows open if
I did my script right. It grew to 159M when starting StarOffice.

(I'm oldfashioned, and not a perl person, so:

	cat /proc/*/maps |
		grep 'w.p ' |
		cut -d' ' -f1 |
		tr '-' ' ' |
		while read i  j; do export k=$(($k + 0x$j-0x$i)) ; echo $k; done

I haven't verified that it gets it right. And that's not counting the
really hardwired pages at all, only th epages that might be pageable).

It would disallow a lot of stuff that actually _does_ work in practice.

But maybe some people do want this. I agree that it shouldn't be
fundamentally hard to do accounting at the vma level.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] swap_state.c thinko
  2001-04-09 20:32                     ` Linus Torvalds
@ 2001-04-09 20:54                       ` David L. Parsley
  2001-04-10 21:07                       ` James Antill
  1 sibling, 0 replies; 33+ messages in thread
From: David L. Parsley @ 2001-04-09 20:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Andrea Arcangeli, Hugh Dickins, Ben LaHaise,
	Rik van Riel, Richard Jerrrell, Stephen Tweedie, arjanv,
	linux-mm

Linus Torvalds wrote:
> 
> On Mon, 9 Apr 2001, Alan Cox wrote:
> >
> > Given that strict address space management is not that hard would you
> > accept patches to allow optional non-overcommit in 2.5
> 
> I really doubt anybody wants to use a truly non-overcommit system.

Eh, how about embedded developers.  Like, say, Transmeta. ;-)  Things
get real ugly when my X terminal runs out of RAM - I gotta think it
would be better for mallocs to just fail in userspace.

regards,
	David

-- 
David L. Parsley
Network Administrator
Roanoke College
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] swap_state.c thinko
  2001-04-09 20:32                     ` Linus Torvalds
  2001-04-09 20:54                       ` David L. Parsley
@ 2001-04-10 21:07                       ` James Antill
  2001-04-10 22:20                         ` Jeff Garzik
  1 sibling, 1 reply; 33+ messages in thread
From: James Antill @ 2001-04-10 21:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Andrea Arcangeli, Hugh Dickins, Ben LaHaise,
	Rik van Riel, Richard Jerrrell, Stephen Tweedie, arjanv,
	linux-mm

> On Mon, 9 Apr 2001, Alan Cox wrote:
> >
> > Given that strict address space management is not that hard would you
> > accept patches to allow optional non-overcommit in 2.5
> 
> I really doubt anybody wants to use a truly non-overcommit system.
> 
> It would basically imply counting every single vma that is privately
> writable, and assuming it becomes totally non-shared.
> 
> Try this on your system as root:
> 
> 	cat /proc/*/maps | grep ' .w.p '
> 
> and see how much memory that is.
> 
> On my machine, running X, that's about 53M with just a few windows open if
> I did my script right. It grew to 159M when starting StarOffice.

 Disk space is cheap(tm), in comparison to a couple of years ago I
have more disk space than $DIETY.

# cat /proc/swaps 
Filename			Type		Size	Used	Priority
/dev/hda3                       partition	979956	0	1
/dev/sda2                       partition	976888	21524	4
/dev/sdb1                       partition	976872	21452	4

 If I could have a sysctl for non-overcommit[1], I'd be pretty
happy. I'd imagine that a _lot_ of people in the server space would
prefer non-overcommit.

[1] Assuming that it doesn't kill performance by allocating non shared
mappings, or chunks of swap etc. Ie. it just knows that it can
allocate swap when it needs it later on.

-- 
# James Antill -- james@and.org
:0:
* ^From: .*james@and\.org
/dev/null
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] swap_state.c thinko
  2001-04-10 21:07                       ` James Antill
@ 2001-04-10 22:20                         ` Jeff Garzik
  0 siblings, 0 replies; 33+ messages in thread
From: Jeff Garzik @ 2001-04-10 22:20 UTC (permalink / raw)
  To: James Antill
  Cc: Linus Torvalds, Alan Cox, Andrea Arcangeli, Hugh Dickins,
	Ben LaHaise, Rik van Riel, Richard Jerrrell, Stephen Tweedie,
	arjanv, linux-mm

James Antill wrote:
> [1] Assuming that it doesn't kill performance by allocating non shared
> mappings, or chunks of swap etc. Ie. it just knows that it can
> allocate swap when it needs it later on.

Just FWIW...   from my VM-ignorant standpoint, it seems like for the
no-overcommit case "reserving" swap space is a much cheaper operation
than unconditionally allocating swap space....

-- 
Jeff Garzik       | Sam: "Mind if I drive?"
Building 1024     | Max: "Not if you don't mind me clawing at the dash
MandrakeSoft      |       and shrieking like a cheerleader."
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] swap_state.c thinko
  2001-04-06 20:20 Bulent Abali
@ 2001-04-06 20:33 ` Jeff Garzik
  0 siblings, 0 replies; 33+ messages in thread
From: Jeff Garzik @ 2001-04-06 20:33 UTC (permalink / raw)
  To: Bulent Abali; +Cc: Linus Torvalds, Rik van Riel, linux-mm

Bulent Abali wrote:
> By the way, disk space is cheap why not give more than 1 percent slop?
> This is really accounted in the swap space and not the memory.
> It will also help system out of oom_killer's radar.

Dumb question...  is the OOM killer accounting for icache and dcache
memory usage?

-- 
Jeff Garzik       | Sam: "Mind if I drive?"
Building 1024     | Max: "Not if you don't mind me clawing at the dash
MandrakeSoft      |       and shrieking like a cheerleader."
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] swap_state.c thinko
@ 2001-04-06 20:20 Bulent Abali
  2001-04-06 20:33 ` Jeff Garzik
  0 siblings, 1 reply; 33+ messages in thread
From: Bulent Abali @ 2001-04-06 20:20 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rik van Riel, linux-mm

>So I don't think it would necessarily be wrong to say
>something like
>
>    free -= num_physpages >> 6;
>
>to approximate the notion of "keep 1 percent slop" (remember, the 1% may
>well be on the swap device, not actually kept as free memory).

Hi,

I suggested the same thing to Rik but he rightfully said that it would
not work well for diskless (or swap-less) machines.  You may want to
consider the following instead.

     free -= (nr_swap_pages)? num_physpages >> 6 : 0;

By the way, disk space is cheap why not give more than 1 percent slop?
This is really accounted in the swap space and not the memory.
It will also help system out of oom_killer's radar.

Bulent Abali  (abali@us.ibm.com)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2001-04-10 22:20 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-04-05 15:56 [PATCH] swap_state.c thinko Ben LaHaise
2001-04-05 16:05 ` Rik van Riel
2001-04-05 17:11   ` Ben LaHaise
2001-04-05 23:40     ` Andrea Arcangeli
2001-04-06  0:32     ` Linus Torvalds
2001-04-06 16:31       ` Hugh Dickins
2001-04-06 17:21         ` Linus Torvalds
2001-04-06 18:23           ` Hugh Dickins
2001-04-06 18:57             ` Linus Torvalds
2001-04-06 19:06               ` Rik van Riel
2001-04-06 18:47           ` Andrea Arcangeli
2001-04-06 18:37             ` Hugh Dickins
2001-04-06 19:09               ` Andrea Arcangeli
2001-04-06 18:53                 ` Hugh Dickins
2001-04-06 19:14                 ` Andrea Arcangeli
2001-04-06 19:03                   ` Hugh Dickins
2001-04-06 20:03                     ` Andrea Arcangeli
2001-04-06 19:12               ` Richard Jerrell
2001-04-06 19:52               ` Linus Torvalds
2001-04-06 20:22                 ` Andrea Arcangeli
2001-04-06 21:04                   ` Rik van Riel
2001-04-07  1:27                     ` Andrea Arcangeli
2001-04-09 18:16                   ` Alan Cox
2001-04-09 18:45                     ` Andrea Arcangeli
2001-04-09 20:32                     ` Linus Torvalds
2001-04-09 20:54                       ` David L. Parsley
2001-04-10 21:07                       ` James Antill
2001-04-10 22:20                         ` Jeff Garzik
2001-04-06 20:48                 ` Hugh Dickins
2001-04-05 17:21 ` Hugh Dickins
2001-04-05 21:39   ` Richard Jerrell
2001-04-06 20:20 Bulent Abali
2001-04-06 20:33 ` Jeff Garzik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox