linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/8] mm: from gup to vmscan
@ 2008-11-23 21:53 Hugh Dickins
  2008-11-23 21:55 ` [PATCH 1/8] mm: gup persist for write permission Hugh Dickins
                   ` (7 more replies)
  0 siblings, 8 replies; 24+ messages in thread
From: Hugh Dickins @ 2008-11-23 21:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, linux-mm

Here's a batch of 8 mm patches, intended for 2.6.29: a narrative
beginning in __get_user_pages and do_wp_page, centering on swap
cache freeing, and ending in shrink_page_list and add_to_swap.

Though most of the testing has been against 2.6.28-rc5 and its
precursors, these patches are diffed to slot in to the mmotm series
after my 8 cleanups, just before "mmend".  No patch clashes with later,
but 3/8 needs an accompanying fix to memcg-memswap-controller-core.patch.

 Documentation/vm/unevictable-lru.txt |   63 ++++------------
 include/linux/swap.h                 |   21 ++---
 mm/memory.c                          |   31 ++++++--
 mm/page_io.c                         |    2 
 mm/swap.c                            |    3 
 mm/swap_state.c                      |   12 +--
 mm/swapfile.c                        |   96 ++++++++-----------------
 mm/vmscan.c                          |   17 +---
 8 files changed, 96 insertions(+), 149 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/8] mm: gup persist for write permission
  2008-11-23 21:53 [PATCH 0/8] mm: from gup to vmscan Hugh Dickins
@ 2008-11-23 21:55 ` Hugh Dickins
  2008-11-23 21:56 ` [PATCH 2/8] mm: wp lock page before deciding cow Hugh Dickins
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Hugh Dickins @ 2008-11-23 21:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, Robin Holt, linux-mm

do_wp_page()'s VM_FAULT_WRITE return value tells __get_user_pages() that
COW has been done if necessary, though it may be leaving the pte without
write permission - for the odd case of forced writing to a readonly vma
for ptrace.  At present GUP then retries the follow_page() without asking
for write permission, to escape an endless loop when forced.

But an application may be relying on GUP to guarantee a writable page
which won't be COWed again when written from userspace, whereas a race
here might leave a readonly pte in place?  Change the VM_FAULT_WRITE
handling to ask follow_page() for write permission again, except in
that odd case of forced writing to a readonly vma.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 mm/memory.c |   10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

--- swapfree0/mm/memory.c	2008-11-19 15:26:28.000000000 +0000
+++ swapfree1/mm/memory.c	2008-11-21 18:50:41.000000000 +0000
@@ -1251,9 +1251,15 @@ int __get_user_pages(struct task_struct 
 				 * do_wp_page has broken COW when necessary,
 				 * even if maybe_mkwrite decided not to set
 				 * pte_write. We can thus safely do subsequent
-				 * page lookups as if they were reads.
+				 * page lookups as if they were reads. But only
+				 * do so when looping for pte_write is futile:
+				 * in some cases userspace may also be wanting
+				 * to write to the gotten user page, which a
+				 * read fault here might prevent (a readonly
+				 * page might get reCOWed by userspace write).
 				 */
-				if (ret & VM_FAULT_WRITE)
+				if ((ret & VM_FAULT_WRITE) &&
+				    !(vma->vm_flags & VM_WRITE))
 					foll_flags &= ~FOLL_WRITE;
 
 				cond_resched();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 2/8] mm: wp lock page before deciding cow
  2008-11-23 21:53 [PATCH 0/8] mm: from gup to vmscan Hugh Dickins
  2008-11-23 21:55 ` [PATCH 1/8] mm: gup persist for write permission Hugh Dickins
@ 2008-11-23 21:56 ` Hugh Dickins
  2008-11-23 21:58 ` [PATCH 3/8] mm: reuse_swap_page replaces can_share_swap_page Hugh Dickins
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Hugh Dickins @ 2008-11-23 21:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, Robin Holt, linux-mm

An application may rely on get_user_pages() to give it pages writable
from userspace and shared with a driver, GUP breaking COW if necessary.
It may mprotect() the pages' writability, off and on, from time to time.

Normally this works fine (so long as the app does not fork); but just
occasionally, under memory pressure, a readonly pte in a newly writable
area is COWed unnecessarily, breaking the link with the driver: because
do_wp_page() does trylock_page, and falls back to COW whenever that fails.

For reliable behaviour in the unshared case, when the trylock_page fails,
now unlock pagetable, lock page and relock pagetable, before deciding
whether Copy-On-Write is really necessary.

Reported-by: Zhou Yingchao
Signed-off-by: Hugh Dickins <hugh@veritas.com>
---
This is not the patch I posted in June: in the end I decided it better just
to relock page as Nick suggested, than impose subtle ordering constraints
elsewhere; and also realized that page migration spoilt my optimizations.

 mm/memory.c |   17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

--- swapfree1/mm/memory.c	2008-11-21 18:50:41.000000000 +0000
+++ swapfree2/mm/memory.c	2008-11-21 18:50:43.000000000 +0000
@@ -1819,10 +1819,21 @@ static int do_wp_page(struct mm_struct *
 	 * not dirty accountable.
 	 */
 	if (PageAnon(old_page)) {
-		if (trylock_page(old_page)) {
-			reuse = can_share_swap_page(old_page);
-			unlock_page(old_page);
+		if (!trylock_page(old_page)) {
+			page_cache_get(old_page);
+			pte_unmap_unlock(page_table, ptl);
+			lock_page(old_page);
+			page_table = pte_offset_map_lock(mm, pmd, address,
+							 &ptl);
+			if (!pte_same(*page_table, orig_pte)) {
+				unlock_page(old_page);
+				page_cache_release(old_page);
+				goto unlock;
+			}
+			page_cache_release(old_page);
 		}
+		reuse = can_share_swap_page(old_page);
+		unlock_page(old_page);
 	} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
 					(VM_WRITE|VM_SHARED))) {
 		/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 3/8] mm: reuse_swap_page replaces can_share_swap_page
  2008-11-23 21:53 [PATCH 0/8] mm: from gup to vmscan Hugh Dickins
  2008-11-23 21:55 ` [PATCH 1/8] mm: gup persist for write permission Hugh Dickins
  2008-11-23 21:56 ` [PATCH 2/8] mm: wp lock page before deciding cow Hugh Dickins
@ 2008-11-23 21:58 ` Hugh Dickins
  2008-11-23 22:11   ` [PATCH] memcg: memswap controller core swapcache fixes Hugh Dickins
  2008-11-23 22:43   ` [PATCH 3/8] mm: reuse_swap_page replaces can_share_swap_page Rik van Riel
  2008-11-23 22:00 ` [PATCH 4/8] mm: try_to_free_swap replaces remove_exclusive_swap_page Hugh Dickins
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 24+ messages in thread
From: Hugh Dickins @ 2008-11-23 21:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Rik van Riel, Lee Schermerhorn, KAMEZAWA Hiroyuki, linux-mm

A good place to free up old swap is where do_wp_page(), or do_swap_page(),
is about to redirty the page: the data on disk is then stale and won't be
read again; and if we do decide to write the page out later, using the
previous swap location makes an unnecessary disk seek very likely.

So give can_share_swap_page() the side-effect of delete_from_swap_cache()
when it safely can.  And can_share_swap_page() was always a misleading
name, the more so if it has a side-effect: rename it reuse_swap_page().

Irrelevant cleanup nearby: remove swap_token_default_timeout definition
from swap.h: it's used nowhere.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---
I'm going to have to ask for your indulgence on this one: please would
you put it in -mm on a trial basis?  There's nothing incorrect about it
that I know of, but I'm having difficulty understanding its performance,
and need to get on with sending in the rest of my patches.

What I'd intended to report is that it cuts 30% off the elapsed time of
a test which does random writes into 600MB on machine with 512MB RAM and
plenty of swap.  That's roughly consistent across different machines
with mem=512M.  (It does nothing much for my swapping load tests, which
may be too close to thrashing to notice.)  I took it for granted that it
would do nothing much for a similar test which writes sequentially into
600MB given 512MB RAM: but I was wrong, it adds 30% and I don't get why.

I also took it for granted that it would do nothing much for swapping to
SD card: wrong again, it adds 30% to the sequential test, but speeds up
the random test by a factor of... 25 times at first, but then later
measurements showed "only" a factor 7.

It's usually like this when I try to come up with numbers: I get lost in
them!  If the patch is in -mm for a while, maybe someone else can make
better sense of it: I'll come back to study what's going on later.

 include/linux/swap.h |    6 ++----
 mm/memory.c          |    4 ++--
 mm/swapfile.c        |   15 +++++++++++----
 3 files changed, 15 insertions(+), 10 deletions(-)

--- swapfree2/include/linux/swap.h	2008-11-19 15:26:28.000000000 +0000
+++ swapfree3/include/linux/swap.h	2008-11-21 18:50:47.000000000 +0000
@@ -307,7 +307,7 @@ extern unsigned int count_swap_pages(int
 extern sector_t map_swap_page(struct swap_info_struct *, pgoff_t);
 extern sector_t swapdev_block(int, pgoff_t);
 extern struct swap_info_struct *get_swap_info_struct(unsigned);
-extern int can_share_swap_page(struct page *);
+extern int reuse_swap_page(struct page *);
 extern int remove_exclusive_swap_page(struct page *);
 extern int remove_exclusive_swap_page_ref(struct page *);
 struct backing_dev_info;
@@ -375,8 +375,6 @@ static inline struct page *lookup_swap_c
 	return NULL;
 }
 
-#define can_share_swap_page(p)			(page_mapcount(p) == 1)
-
 static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
 							gfp_t gfp_mask)
 {
@@ -391,7 +389,7 @@ static inline void delete_from_swap_cach
 {
 }
 
-#define swap_token_default_timeout		0
+#define reuse_swap_page(page)	(page_mapcount(page) == 1)
 
 static inline int remove_exclusive_swap_page(struct page *p)
 {
--- swapfree2/mm/memory.c	2008-11-21 18:50:43.000000000 +0000
+++ swapfree3/mm/memory.c	2008-11-21 18:50:48.000000000 +0000
@@ -1832,7 +1832,7 @@ static int do_wp_page(struct mm_struct *
 			}
 			page_cache_release(old_page);
 		}
-		reuse = can_share_swap_page(old_page);
+		reuse = reuse_swap_page(old_page);
 		unlock_page(old_page);
 	} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
 					(VM_WRITE|VM_SHARED))) {
@@ -2363,7 +2363,7 @@ static int do_swap_page(struct mm_struct
 
 	inc_mm_counter(mm, anon_rss);
 	pte = mk_pte(page, vma->vm_page_prot);
-	if (write_access && can_share_swap_page(page)) {
+	if (write_access && reuse_swap_page(page)) {
 		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
 		write_access = 0;
 	}
--- swapfree2/mm/swapfile.c	2008-11-19 15:26:26.000000000 +0000
+++ swapfree3/mm/swapfile.c	2008-11-21 18:50:48.000000000 +0000
@@ -326,17 +326,24 @@ static inline int page_swapcount(struct 
 }
 
 /*
- * We can use this swap cache entry directly
- * if there are no other references to it.
+ * We can write to an anon page without COW if there are no other references
+ * to it.  And as a side-effect, free up its swap: because the old content
+ * on disk will never be read, and seeking back there to write new content
+ * later would only waste time away from clustering.
  */
-int can_share_swap_page(struct page *page)
+int reuse_swap_page(struct page *page)
 {
 	int count;
 
 	VM_BUG_ON(!PageLocked(page));
 	count = page_mapcount(page);
-	if (count <= 1 && PageSwapCache(page))
+	if (count <= 1 && PageSwapCache(page)) {
 		count += page_swapcount(page);
+		if (count == 1 && !PageWriteback(page)) {
+			delete_from_swap_cache(page);
+			SetPageDirty(page);
+		}
+	}
 	return count == 1;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 4/8] mm: try_to_free_swap replaces remove_exclusive_swap_page
  2008-11-23 21:53 [PATCH 0/8] mm: from gup to vmscan Hugh Dickins
                   ` (2 preceding siblings ...)
  2008-11-23 21:58 ` [PATCH 3/8] mm: reuse_swap_page replaces can_share_swap_page Hugh Dickins
@ 2008-11-23 22:00 ` Hugh Dickins
  2008-11-23 22:01 ` [PATCH 5/8] mm: try_to_unuse check removing right swap Hugh Dickins
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Hugh Dickins @ 2008-11-23 22:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, Rik van Riel, Lee Schermerhorn, linux-mm

It doesn't matter if someone else has a reference to the page (raised
page_count); it doesn't matter if the page is mapped into userspace
(raised page_mapcount - though that hints it may be worth keeping the
swap): all that matters is that there be no more references to the swap
(and no writeback in progress).

swapoff (try_to_unuse) has been removing pages from swapcache for years,
with no concern for page count or page mapcount, and we used to have a
comment in lookup_swap_cache() recognizing that: if you go for a page
of swapcache, you'll get the right page, but it could have been removed
from swapcache by the time you get page lock.

So, give up asking for exclusivity: get rid of remove_exclusive_swap_page(),
and remove_exclusive_swap_page_ref() and remove_exclusive_swap_page_count()
which were spawned for the recent LRU work: replace them by the simpler
try_to_free_swap() which just checks page_swapcount().

Similarly, remove the page_count limitation from free_swap_and_count(),
but assume that it's worth holding on to the swap if page is mapped and
swap nowhere near full.  Add a vm_swap_full() test in free_swap_cache()?
It would be consistent, but I think we probably have enough for now.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 include/linux/swap.h |   10 +----
 mm/memory.c          |    2 -
 mm/page_io.c         |    2 -
 mm/swap.c            |    3 -
 mm/swap_state.c      |    8 ++--
 mm/swapfile.c        |   70 +++++++----------------------------------
 mm/vmscan.c          |    2 -
 7 files changed, 22 insertions(+), 75 deletions(-)

--- swapfree3/include/linux/swap.h	2008-11-21 18:50:47.000000000 +0000
+++ swapfree4/include/linux/swap.h	2008-11-21 18:50:50.000000000 +0000
@@ -308,8 +308,7 @@ extern sector_t map_swap_page(struct swa
 extern sector_t swapdev_block(int, pgoff_t);
 extern struct swap_info_struct *get_swap_info_struct(unsigned);
 extern int reuse_swap_page(struct page *);
-extern int remove_exclusive_swap_page(struct page *);
-extern int remove_exclusive_swap_page_ref(struct page *);
+extern int try_to_free_swap(struct page *);
 struct backing_dev_info;
 
 /* linux/mm/thrash.c */
@@ -391,12 +390,7 @@ static inline void delete_from_swap_cach
 
 #define reuse_swap_page(page)	(page_mapcount(page) == 1)
 
-static inline int remove_exclusive_swap_page(struct page *p)
-{
-	return 0;
-}
-
-static inline int remove_exclusive_swap_page_ref(struct page *page)
+static inline int try_to_free_swap(struct page *page)
 {
 	return 0;
 }
--- swapfree3/mm/memory.c	2008-11-21 18:50:48.000000000 +0000
+++ swapfree4/mm/memory.c	2008-11-21 18:50:50.000000000 +0000
@@ -2374,7 +2374,7 @@ static int do_swap_page(struct mm_struct
 
 	swap_free(entry);
 	if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
-		remove_exclusive_swap_page(page);
+		try_to_free_swap(page);
 	unlock_page(page);
 
 	if (write_access) {
--- swapfree3/mm/page_io.c	2008-11-19 15:26:26.000000000 +0000
+++ swapfree4/mm/page_io.c	2008-11-21 18:50:50.000000000 +0000
@@ -98,7 +98,7 @@ int swap_writepage(struct page *page, st
 	struct bio *bio;
 	int ret = 0, rw = WRITE;
 
-	if (remove_exclusive_swap_page(page)) {
+	if (try_to_free_swap(page)) {
 		unlock_page(page);
 		goto out;
 	}
--- swapfree3/mm/swap.c	2008-11-19 15:26:28.000000000 +0000
+++ swapfree4/mm/swap.c	2008-11-21 18:50:50.000000000 +0000
@@ -467,8 +467,7 @@ void pagevec_swap_free(struct pagevec *p
 		struct page *page = pvec->pages[i];
 
 		if (PageSwapCache(page) && trylock_page(page)) {
-			if (PageSwapCache(page))
-				remove_exclusive_swap_page_ref(page);
+			try_to_free_swap(page);
 			unlock_page(page);
 		}
 	}
--- swapfree3/mm/swap_state.c	2008-11-19 15:26:26.000000000 +0000
+++ swapfree4/mm/swap_state.c	2008-11-21 18:50:50.000000000 +0000
@@ -195,14 +195,14 @@ void delete_from_swap_cache(struct page 
  * If we are the only user, then try to free up the swap cache. 
  * 
  * Its ok to check for PageSwapCache without the page lock
- * here because we are going to recheck again inside 
- * exclusive_swap_page() _with_ the lock. 
+ * here because we are going to recheck again inside
+ * try_to_free_swap() _with_ the lock.
  * 					- Marcelo
  */
 static inline void free_swap_cache(struct page *page)
 {
-	if (PageSwapCache(page) && trylock_page(page)) {
-		remove_exclusive_swap_page(page);
+	if (PageSwapCache(page) && !page_mapped(page) && trylock_page(page)) {
+		try_to_free_swap(page);
 		unlock_page(page);
 	}
 }
--- swapfree3/mm/swapfile.c	2008-11-21 18:50:48.000000000 +0000
+++ swapfree4/mm/swapfile.c	2008-11-21 18:50:50.000000000 +0000
@@ -348,68 +348,23 @@ int reuse_swap_page(struct page *page)
 }
 
 /*
- * Work out if there are any other processes sharing this
- * swap cache page. Free it if you can. Return success.
+ * If swap is getting full, or if there are no more mappings of this page,
+ * then try_to_free_swap is called to free its swap space.
  */
-static int remove_exclusive_swap_page_count(struct page *page, int count)
+int try_to_free_swap(struct page *page)
 {
-	int retval;
-	struct swap_info_struct * p;
-	swp_entry_t entry;
-
 	VM_BUG_ON(!PageLocked(page));
 
 	if (!PageSwapCache(page))
 		return 0;
 	if (PageWriteback(page))
 		return 0;
-	if (page_count(page) != count) /* us + cache + ptes */
-		return 0;
-
-	entry.val = page_private(page);
-	p = swap_info_get(entry);
-	if (!p)
+	if (page_swapcount(page))
 		return 0;
 
-	/* Is the only swap cache user the cache itself? */
-	retval = 0;
-	if (p->swap_map[swp_offset(entry)] == 1) {
-		/* Recheck the page count with the swapcache lock held.. */
-		spin_lock_irq(&swapper_space.tree_lock);
-		if ((page_count(page) == count) && !PageWriteback(page)) {
-			__delete_from_swap_cache(page);
-			SetPageDirty(page);
-			retval = 1;
-		}
-		spin_unlock_irq(&swapper_space.tree_lock);
-	}
-	spin_unlock(&swap_lock);
-
-	if (retval) {
-		swap_free(entry);
-		page_cache_release(page);
-	}
-
-	return retval;
-}
-
-/*
- * Most of the time the page should have two references: one for the
- * process and one for the swap cache.
- */
-int remove_exclusive_swap_page(struct page *page)
-{
-	return remove_exclusive_swap_page_count(page, 2);
-}
-
-/*
- * The pageout code holds an extra reference to the page.  That raises
- * the reference count to test for to 2 for a page that is only in the
- * swap cache plus 1 for each process that maps the page.
- */
-int remove_exclusive_swap_page_ref(struct page *page)
-{
-	return remove_exclusive_swap_page_count(page, 2 + page_mapcount(page));
+	delete_from_swap_cache(page);
+	SetPageDirty(page);
+	return 1;
 }
 
 /*
@@ -436,13 +391,12 @@ void free_swap_and_cache(swp_entry_t ent
 		spin_unlock(&swap_lock);
 	}
 	if (page) {
-		int one_user;
-
-		one_user = (page_count(page) == 2);
-		/* Only cache user (+us), or swap space full? Free it! */
-		/* Also recheck PageSwapCache after page is locked (above) */
+		/*
+		 * Not mapped elsewhere, or swap space full? Free it!
+		 * Also recheck PageSwapCache now page is locked (above).
+		 */
 		if (PageSwapCache(page) && !PageWriteback(page) &&
-					(one_user || vm_swap_full())) {
+				(!page_mapped(page) || vm_swap_full())) {
 			delete_from_swap_cache(page);
 			SetPageDirty(page);
 		}
--- swapfree3/mm/vmscan.c	2008-11-19 15:26:13.000000000 +0000
+++ swapfree4/mm/vmscan.c	2008-11-21 18:50:50.000000000 +0000
@@ -805,7 +805,7 @@ cull_mlocked:
 activate_locked:
 		/* Not a candidate for swapping, so reclaim swap space. */
 		if (PageSwapCache(page) && vm_swap_full())
-			remove_exclusive_swap_page_ref(page);
+			try_to_free_swap(page);
 		VM_BUG_ON(PageActive(page));
 		SetPageActive(page);
 keep_locked:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 5/8] mm: try_to_unuse check removing right swap
  2008-11-23 21:53 [PATCH 0/8] mm: from gup to vmscan Hugh Dickins
                   ` (3 preceding siblings ...)
  2008-11-23 22:00 ` [PATCH 4/8] mm: try_to_free_swap replaces remove_exclusive_swap_page Hugh Dickins
@ 2008-11-23 22:01 ` Hugh Dickins
  2008-11-23 22:03 ` [PATCH 6/8] mm: remove try_to_munlock from vmscan Hugh Dickins
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Hugh Dickins @ 2008-11-23 22:01 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, linux-mm

There's a possible race in try_to_unuse() which Nick Piggin led me to two
years ago.  Where it does lock_page() after read_swap_cache_async(), what
if another task removed that page from swapcache just before we locked it?

It would sail though the (*swap_map > 1) tests doing nothing (because it
could not have been removed from swapcache before its swap references were
gone), until it reaches the delete_from_swap_cache(page) near the bottom.

Now imagine that this page has been allocated to swap on a different swap
area while we dropped page lock (perhaps at the top, perhaps in unuse_mm):
we could wrongly remove from swap cache before the page has been written
to swap, so a subsequent do_swap_page() would read in stale data from swap.

I think this case could not happen before: remove_exclusive_swap_page()
refused while page count was raised.  But now with reuse_swap_page() and
try_to_free_swap() removing from swap cache without minding page count,
I think it could happen - the previous patch argued that it was safe
because try_to_unuse() already ignored page count, but overlooked that
it might be breaking the assumptions in try_to_unuse() itself.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 mm/swapfile.c |   11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

--- swapfree4/mm/swapfile.c	2008-11-21 18:50:50.000000000 +0000
+++ swapfree5/mm/swapfile.c	2008-11-21 18:50:59.000000000 +0000
@@ -889,7 +889,16 @@ static int try_to_unuse(unsigned int typ
 			lock_page(page);
 			wait_on_page_writeback(page);
 		}
-		if (PageSwapCache(page))
+
+		/*
+		 * It is conceivable that a racing task removed this page from
+		 * swap cache just before we acquired the page lock at the top,
+		 * or while we dropped it in unuse_mm().  The page might even
+		 * be back in swap cache on another swap area: that we must not
+		 * delete, since it may not have been written out to swap yet.
+		 */
+		if (PageSwapCache(page) &&
+		    likely(page_private(page) == entry.val))
 			delete_from_swap_cache(page);
 
 		/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 6/8] mm: remove try_to_munlock from vmscan
  2008-11-23 21:53 [PATCH 0/8] mm: from gup to vmscan Hugh Dickins
                   ` (4 preceding siblings ...)
  2008-11-23 22:01 ` [PATCH 5/8] mm: try_to_unuse check removing right swap Hugh Dickins
@ 2008-11-23 22:03 ` Hugh Dickins
  2008-11-23 22:53   ` Rik van Riel
  2008-11-24 17:34   ` Lee Schermerhorn
  2008-11-23 22:05 ` [PATCH 7/8] mm: remove gfp_mask from add_to_swap Hugh Dickins
  2008-11-23 22:07 ` [PATCH 8/8] mm: add add_to_swap stub Hugh Dickins
  7 siblings, 2 replies; 24+ messages in thread
From: Hugh Dickins @ 2008-11-23 22:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, Rik van Riel, Lee Schermerhorn, linux-mm

An unfortunate feature of the Unevictable LRU work was that reclaiming an
anonymous page involved an extra scan through the anon_vma: to check that
the page is evictable before allocating swap, because the swap could not
be freed reliably soon afterwards.

Now try_to_free_swap() has replaced remove_exclusive_swap_page(), that's
not an issue any more: remove try_to_munlock() call from shrink_page_list(),
leaving it to try_to_munmap() to discover if the page is one to be culled
to the unevictable list - in which case then try_to_free_swap().

Update unevictable-lru.txt to remove comments on the try_to_munlock()
in shrink_page_list(), and shorten some lines over 80 columns.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---
I've not tested this against whatever test showed the need for that
try_to_munlock() in shrink_page_list() in the first place.  Rik or Lee,
please, would you have the time to run that test on the next -mm that has
this patch in, to check that I've not messed things up?  Alternatively,
please point me to such a test - but I think you've been targeting
larger machines than I have access to - thanks.

 Documentation/vm/unevictable-lru.txt |   63 +++++++------------------
 mm/vmscan.c                          |   11 ----
 2 files changed, 20 insertions(+), 54 deletions(-)

--- swapfree5/Documentation/vm/unevictable-lru.txt	2008-10-24 09:27:37.000000000 +0100
+++ swapfree6/Documentation/vm/unevictable-lru.txt	2008-11-21 18:51:01.000000000 +0000
@@ -137,13 +137,6 @@ shrink_page_list() where they will be de
 map in try_to_unmap().  If try_to_unmap() returns SWAP_MLOCK, shrink_page_list()
 will cull the page at that point.
 
-Note that for anonymous pages, shrink_page_list() attempts to add the page to
-the swap cache before it tries to unmap the page.  To avoid this unnecessary
-consumption of swap space, shrink_page_list() calls try_to_munlock() to check
-whether any VM_LOCKED vmas map the page without attempting to unmap the page.
-If try_to_munlock() returns SWAP_MLOCK, shrink_page_list() will cull the page
-without consuming swap space.  try_to_munlock() will be described below.
-
 To "cull" an unevictable page, vmscan simply puts the page back on the lru
 list using putback_lru_page()--the inverse operation to isolate_lru_page()--
 after dropping the page lock.  Because the condition which makes the page
@@ -190,8 +183,8 @@ several places:
    in the VM_LOCKED flag being set for the vma.
 3) in the fault path, if mlocked pages are "culled" in the fault path,
    and when a VM_LOCKED stack segment is expanded.
-4) as mentioned above, in vmscan:shrink_page_list() with attempting to
-   reclaim a page in a VM_LOCKED vma--via try_to_unmap() or try_to_munlock().
+4) as mentioned above, in vmscan:shrink_page_list() when attempting to
+   reclaim a page in a VM_LOCKED vma via try_to_unmap().
 
 Mlocked pages become unlocked and rescued from the unevictable list when:
 
@@ -260,9 +253,9 @@ mlock_fixup() filters several classes of
 
 2) vmas mapping hugetlbfs page are already effectively pinned into memory.
    We don't need nor want to mlock() these pages.  However, to preserve the
-   prior behavior of mlock()--before the unevictable/mlock changes--mlock_fixup()
-   will call make_pages_present() in the hugetlbfs vma range to allocate the
-   huge pages and populate the ptes.
+   prior behavior of mlock()--before the unevictable/mlock changes--
+   mlock_fixup() will call make_pages_present() in the hugetlbfs vma range
+   to allocate the huge pages and populate the ptes.
 
 3) vmas with VM_DONTEXPAND|VM_RESERVED are generally user space mappings of
    kernel pages, such as the vdso page, relay channel pages, etc.  These pages
@@ -322,7 +315,7 @@ __mlock_vma_pages_range()--the same func
 passing a flag to indicate that munlock() is being performed.
 
 Because the vma access protections could have been changed to PROT_NONE after
-faulting in and mlocking some pages, get_user_pages() was unreliable for visiting
+faulting in and mlocking pages, get_user_pages() was unreliable for visiting
 these pages for munlocking.  Because we don't want to leave pages mlocked(),
 get_user_pages() was enhanced to accept a flag to ignore the permissions when
 fetching the pages--all of which should be resident as a result of previous
@@ -416,8 +409,8 @@ Mlocked Pages:  munmap()/exit()/exec() S
 When unmapping an mlocked region of memory, whether by an explicit call to
 munmap() or via an internal unmap from exit() or exec() processing, we must
 munlock the pages if we're removing the last VM_LOCKED vma that maps the pages.
-Before the unevictable/mlock changes, mlocking did not mark the pages in any way,
-so unmapping them required no processing.
+Before the unevictable/mlock changes, mlocking did not mark the pages in any
+way, so unmapping them required no processing.
 
 To munlock a range of memory under the unevictable/mlock infrastructure, the
 munmap() hander and task address space tear down function call
@@ -517,12 +510,10 @@ couldn't be mlocked.
 Mlocked pages:  try_to_munlock() Reverse Map Scan
 
 TODO/FIXME:  a better name might be page_mlocked()--analogous to the
-page_referenced() reverse map walker--especially if we continue to call this
-from shrink_page_list().  See related TODO/FIXME below.
+page_referenced() reverse map walker.
 
-When munlock_vma_page()--see "Mlocked Pages:  munlock()/munlockall() System
-Call Handling" above--tries to munlock a page, or when shrink_page_list()
-encounters an anonymous page that is not yet in the swap cache, they need to
+When munlock_vma_page()--see "Mlocked Pages:  munlock()/munlockall()
+System Call Handling" above--tries to munlock a page, it needs to
 determine whether or not the page is mapped by any VM_LOCKED vma, without
 actually attempting to unmap all ptes from the page.  For this purpose, the
 unevictable/mlock infrastructure introduced a variant of try_to_unmap() called
@@ -535,10 +526,7 @@ for VM_LOCKED vmas.  When such a vma is 
 pages mapped in linear VMAs, as in the try_to_unmap() case, the functions
 attempt to acquire the associated mmap semphore, mlock the page via
 mlock_vma_page() and return SWAP_MLOCK.  This effectively undoes the
-pre-clearing of the page's PG_mlocked done by munlock_vma_page() and informs
-shrink_page_list() that the anonymous page should be culled rather than added
-to the swap cache in preparation for a try_to_unmap() that will almost
-certainly fail.
+pre-clearing of the page's PG_mlocked done by munlock_vma_page.
 
 If try_to_unmap() is unable to acquire a VM_LOCKED vma's associated mmap
 semaphore, it will return SWAP_AGAIN.  This will allow shrink_page_list()
@@ -557,10 +545,7 @@ However, the scan can terminate when it 
 successfully acquire the vma's mmap semphore for read and mlock the page.
 Although try_to_munlock() can be called many [very many!] times when
 munlock()ing a large region or tearing down a large address space that has been
-mlocked via mlockall(), overall this is a fairly rare event.  In addition,
-although shrink_page_list() calls try_to_munlock() for every anonymous page that
-it handles that is not yet in the swap cache, on average anonymous pages will
-have very short reverse map lists.
+mlocked via mlockall(), overall this is a fairly rare event.
 
 Mlocked Page:  Page Reclaim in shrink_*_list()
 
@@ -588,8 +573,8 @@ Some examples of these unevictable pages
    munlock_vma_page() was forced to let the page back on to the normal
    LRU list for vmscan to handle.
 
-shrink_inactive_list() also culls any unevictable pages that it finds
-on the inactive lists, again diverting them to the appropriate zone's unevictable
+shrink_inactive_list() also culls any unevictable pages that it finds on
+the inactive lists, again diverting them to the appropriate zone's unevictable
 lru list.  shrink_inactive_list() should only see SHM_LOCKed pages that became
 SHM_LOCKed after shrink_active_list() had moved them to the inactive list, or
 pages mapped into VM_LOCKED vmas that munlock_vma_page() couldn't isolate from
@@ -597,19 +582,7 @@ the lru to recheck via try_to_munlock().
 the latter, but will pass on to shrink_page_list().
 
 shrink_page_list() again culls obviously unevictable pages that it could
-encounter for similar reason to shrink_inactive_list().  As already discussed,
-shrink_page_list() proactively looks for anonymous pages that should have
-PG_mlocked set but don't--these would not be detected by page_evictable()--to
-avoid adding them to the swap cache unnecessarily.  File pages mapped into
+encounter for similar reason to shrink_inactive_list().  Pages mapped into
 VM_LOCKED vmas but without PG_mlocked set will make it all the way to
-try_to_unmap().  shrink_page_list() will divert them to the unevictable list when
-try_to_unmap() returns SWAP_MLOCK, as discussed above.
-
-TODO/FIXME:  If we can enhance the swap cache to reliably remove entries
-with page_count(page) > 2, as long as all ptes are mapped to the page and
-not the swap entry, we can probably remove the call to try_to_munlock() in
-shrink_page_list() and just remove the page from the swap cache when
-try_to_unmap() returns SWAP_MLOCK.   Currently, remove_exclusive_swap_page()
-doesn't seem to allow that.
-
-
+try_to_unmap().  shrink_page_list() will divert them to the unevictable list
+when try_to_unmap() returns SWAP_MLOCK, as discussed above.
--- swapfree5/mm/vmscan.c	2008-11-21 18:50:50.000000000 +0000
+++ swapfree6/mm/vmscan.c	2008-11-21 18:51:01.000000000 +0000
@@ -673,15 +673,6 @@ static unsigned long shrink_page_list(st
 		if (PageAnon(page) && !PageSwapCache(page)) {
 			if (!(sc->gfp_mask & __GFP_IO))
 				goto keep_locked;
-			switch (try_to_munlock(page)) {
-			case SWAP_FAIL:		/* shouldn't happen */
-			case SWAP_AGAIN:
-				goto keep_locked;
-			case SWAP_MLOCK:
-				goto cull_mlocked;
-			case SWAP_SUCCESS:
-				; /* fall thru'; add to swap cache */
-			}
 			if (!add_to_swap(page, GFP_ATOMIC))
 				goto activate_locked;
 			may_enter_fs = 1;
@@ -798,6 +789,8 @@ free_it:
 		continue;
 
 cull_mlocked:
+		if (PageSwapCache(page))
+			try_to_free_swap(page);
 		unlock_page(page);
 		putback_lru_page(page);
 		continue;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 7/8] mm: remove gfp_mask from add_to_swap
  2008-11-23 21:53 [PATCH 0/8] mm: from gup to vmscan Hugh Dickins
                   ` (5 preceding siblings ...)
  2008-11-23 22:03 ` [PATCH 6/8] mm: remove try_to_munlock from vmscan Hugh Dickins
@ 2008-11-23 22:05 ` Hugh Dickins
  2008-11-23 22:07 ` [PATCH 8/8] mm: add add_to_swap stub Hugh Dickins
  7 siblings, 0 replies; 24+ messages in thread
From: Hugh Dickins @ 2008-11-23 22:05 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, linux-mm

Remove gfp_mask argument from add_to_swap(): it's misleading because its
only caller, shrink_page_list(), is not atomic at that point; and in due
course (implementing discard) we'll sometimes want to allocate some memory
with GFP_NOIO (as is used in swap_writepage) when allocating swap.

No change to the gfp_mask passed down to add_to_swap_cache(): still use
__GFP_HIGH without __GFP_WAIT (with nomemalloc and nowarn as before):
though it's not obvious if that's the best combination to ask for here.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 include/linux/swap.h |    2 +-
 mm/swap_state.c      |    4 ++--
 mm/vmscan.c          |    2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

--- swapfree6/include/linux/swap.h	2008-11-21 18:50:50.000000000 +0000
+++ swapfree7/include/linux/swap.h	2008-11-21 18:51:05.000000000 +0000
@@ -281,7 +281,7 @@ extern void end_swap_bio_read(struct bio
 extern struct address_space swapper_space;
 #define total_swapcache_pages  swapper_space.nrpages
 extern void show_swap_cache_info(void);
-extern int add_to_swap(struct page *, gfp_t);
+extern int add_to_swap(struct page *);
 extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t);
 extern void __delete_from_swap_cache(struct page *);
 extern void delete_from_swap_cache(struct page *);
--- swapfree6/mm/swap_state.c	2008-11-21 18:50:50.000000000 +0000
+++ swapfree7/mm/swap_state.c	2008-11-21 18:51:05.000000000 +0000
@@ -128,7 +128,7 @@ void __delete_from_swap_cache(struct pag
  * Allocate swap space for the page and add the page to the
  * swap cache.  Caller needs to hold the page lock. 
  */
-int add_to_swap(struct page * page, gfp_t gfp_mask)
+int add_to_swap(struct page *page)
 {
 	swp_entry_t entry;
 	int err;
@@ -153,7 +153,7 @@ int add_to_swap(struct page * page, gfp_
 		 * Add it to the swap cache and mark it dirty
 		 */
 		err = add_to_swap_cache(page, entry,
-				gfp_mask|__GFP_NOMEMALLOC|__GFP_NOWARN);
+				__GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN);
 
 		switch (err) {
 		case 0:				/* Success */
--- swapfree6/mm/vmscan.c	2008-11-21 18:51:01.000000000 +0000
+++ swapfree7/mm/vmscan.c	2008-11-21 18:51:05.000000000 +0000
@@ -673,7 +673,7 @@ static unsigned long shrink_page_list(st
 		if (PageAnon(page) && !PageSwapCache(page)) {
 			if (!(sc->gfp_mask & __GFP_IO))
 				goto keep_locked;
-			if (!add_to_swap(page, GFP_ATOMIC))
+			if (!add_to_swap(page))
 				goto activate_locked;
 			may_enter_fs = 1;
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 8/8] mm: add add_to_swap stub
  2008-11-23 21:53 [PATCH 0/8] mm: from gup to vmscan Hugh Dickins
                   ` (6 preceding siblings ...)
  2008-11-23 22:05 ` [PATCH 7/8] mm: remove gfp_mask from add_to_swap Hugh Dickins
@ 2008-11-23 22:07 ` Hugh Dickins
  2008-11-23 22:55   ` Rik van Riel
  7 siblings, 1 reply; 24+ messages in thread
From: Hugh Dickins @ 2008-11-23 22:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, Rik van Riel, Lee Schermerhorn, linux-mm

If we add a failing stub for add_to_swap(),
then we can remove the #ifdef CONFIG_SWAP from mm/vmscan.c.

This was intended as a source cleanup, but looking more closely, it turns
out that the !CONFIG_SWAP case was going to keep_locked for an anonymous
page, whereas now it goes to the more suitable activate_locked, like the
CONFIG_SWAP nr_swap_pages 0 case.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---
Shouldn't PageSwapBacked pages be made PageUnevictable
when CONFIG_UNEVICTABLE_LRU but !CONFIG_SWAP?

 include/linux/swap.h |    5 +++++
 mm/vmscan.c          |    2 --
 2 files changed, 5 insertions(+), 2 deletions(-)

--- swapfree7/include/linux/swap.h	2008-11-21 18:51:05.000000000 +0000
+++ swapfree8/include/linux/swap.h	2008-11-21 18:51:08.000000000 +0000
@@ -374,6 +374,11 @@ static inline struct page *lookup_swap_c
 	return NULL;
 }
 
+static inline int add_to_swap(struct page *page)
+{
+	return 0;
+}
+
 static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
 							gfp_t gfp_mask)
 {
--- swapfree7/mm/vmscan.c	2008-11-21 18:51:05.000000000 +0000
+++ swapfree8/mm/vmscan.c	2008-11-21 18:51:08.000000000 +0000
@@ -665,7 +665,6 @@ static unsigned long shrink_page_list(st
 					referenced && page_mapping_inuse(page))
 			goto activate_locked;
 
-#ifdef CONFIG_SWAP
 		/*
 		 * Anonymous process memory has backing store?
 		 * Try to allocate it some swap space here.
@@ -677,7 +676,6 @@ static unsigned long shrink_page_list(st
 				goto activate_locked;
 			may_enter_fs = 1;
 		}
-#endif /* CONFIG_SWAP */
 
 		mapping = page_mapping(page);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH] memcg: memswap controller core swapcache fixes
  2008-11-23 21:58 ` [PATCH 3/8] mm: reuse_swap_page replaces can_share_swap_page Hugh Dickins
@ 2008-11-23 22:11   ` Hugh Dickins
  2008-11-24  5:43     ` KAMEZAWA Hiroyuki
  2008-11-23 22:43   ` [PATCH 3/8] mm: reuse_swap_page replaces can_share_swap_page Rik van Riel
  1 sibling, 1 reply; 24+ messages in thread
From: Hugh Dickins @ 2008-11-23 22:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

Two SwapCache bug fixes to mmotm's memcg-memswap-controller-core.patch:

One bug is independent of my current changes: there is no guarantee that
the page passed to mem_cgroup_try_charge_swapin() is still in SwapCache.

The other bug is a consequence of my changes, but the fix is okay without
them: mem_cgroup_commit_charge_swapin() expects swp_entry in page->private,
but now reuse_swap_page() (formerly can_share_swap_page()) might already
have done delete_from_swap_cache(): move commit_charge_swapin() earlier.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 mm/memcontrol.c |    8 ++++++++
 mm/memory.c     |   15 +++++++++++++--
 2 files changed, 21 insertions(+), 2 deletions(-)

--- mmotm.orig/mm/memcontrol.c	2008-11-23 21:03:48.000000000 +0000
+++ mmotm/mm/memcontrol.c	2008-11-23 21:06:12.000000000 +0000
@@ -847,6 +847,14 @@ int mem_cgroup_try_charge_swapin(struct 
 	if (!do_swap_account)
 		goto charge_cur_mm;
 
+	/*
+	 * A racing thread's fault, or swapoff, may have already updated
+	 * the pte, and even removed page from swap cache: return success
+	 * to go on to do_swap_page()'s pte_same() test, which should fail.
+	 */
+	if (!PageSwapCache(page))
+		return 0;
+
 	ent.val = page_private(page);
 
 	mem = lookup_swap_cgroup(ent);
--- mmotm.orig/mm/memory.c	2008-11-23 21:03:48.000000000 +0000
+++ mmotm/mm/memory.c	2008-11-23 21:06:12.000000000 +0000
@@ -2361,8 +2361,20 @@ static int do_swap_page(struct mm_struct
 		goto out_nomap;
 	}
 
-	/* The page isn't present yet, go ahead with the fault. */
+	/*
+	 * The page isn't present yet, go ahead with the fault.
+	 *
+	 * Be careful about the sequence of operations here.
+	 * To get its accounting right, reuse_swap_page() must be called
+	 * while the page is counted on swap but not yet in mapcount i.e.
+	 * before page_add_anon_rmap() and swap_free(); try_to_free_swap()
+	 * must be called after the swap_free(), or it will never succeed.
+	 * And mem_cgroup_commit_charge_swapin(), which uses the swp_entry
+	 * in page->private, must be called before reuse_swap_page(),
+	 * which may delete_from_swap_cache().
+	 */
 
+	mem_cgroup_commit_charge_swapin(page, ptr);
 	inc_mm_counter(mm, anon_rss);
 	pte = mk_pte(page, vma->vm_page_prot);
 	if (write_access && reuse_swap_page(page)) {
@@ -2373,7 +2385,6 @@ static int do_swap_page(struct mm_struct
 	flush_icache_page(vma, page);
 	set_pte_at(mm, address, page_table, pte);
 	page_add_anon_rmap(page, vma, address);
-	mem_cgroup_commit_charge_swapin(page, ptr);
 
 	swap_free(entry);
 	if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/8] mm: reuse_swap_page replaces can_share_swap_page
  2008-11-23 21:58 ` [PATCH 3/8] mm: reuse_swap_page replaces can_share_swap_page Hugh Dickins
  2008-11-23 22:11   ` [PATCH] memcg: memswap controller core swapcache fixes Hugh Dickins
@ 2008-11-23 22:43   ` Rik van Riel
  1 sibling, 0 replies; 24+ messages in thread
From: Rik van Riel @ 2008-11-23 22:43 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Nick Piggin, Lee Schermerhorn, KAMEZAWA Hiroyuki,
	linux-mm

Hugh Dickins wrote:
> A good place to free up old swap is where do_wp_page(), or do_swap_page(),
> is about to redirty the page: the data on disk is then stale and won't be
> read again; and if we do decide to write the page out later, using the
> previous swap location makes an unnecessary disk seek very likely.

Better still, it frees up swap space that will never be read
anyway.  This helps with workloads that push the system closer
to the edge and helps leave read-only pages on swap, avoiding
IO.

> Signed-off-by: Hugh Dickins <hugh@veritas.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/8] mm: remove try_to_munlock from vmscan
  2008-11-23 22:03 ` [PATCH 6/8] mm: remove try_to_munlock from vmscan Hugh Dickins
@ 2008-11-23 22:53   ` Rik van Riel
  2008-11-24 17:34   ` Lee Schermerhorn
  1 sibling, 0 replies; 24+ messages in thread
From: Rik van Riel @ 2008-11-23 22:53 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, Nick Piggin, Lee Schermerhorn, linux-mm

Hugh Dickins wrote:

> Now try_to_free_swap() has replaced remove_exclusive_swap_page(), that's
> not an issue any more: remove try_to_munlock() call from shrink_page_list(),
> leaving it to try_to_munmap() to discover if the page is one to be culled
> to the unevictable list - in which case then try_to_free_swap().

Nice simplification.

> Update unevictable-lru.txt to remove comments on the try_to_munlock()
> in shrink_page_list(), and shorten some lines over 80 columns.
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 8/8] mm: add add_to_swap stub
  2008-11-23 22:07 ` [PATCH 8/8] mm: add add_to_swap stub Hugh Dickins
@ 2008-11-23 22:55   ` Rik van Riel
  2008-11-24 13:49     ` Hugh Dickins
  0 siblings, 1 reply; 24+ messages in thread
From: Rik van Riel @ 2008-11-23 22:55 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, Nick Piggin, Lee Schermerhorn, linux-mm

Hugh Dickins wrote:
> If we add a failing stub for add_to_swap(),
> then we can remove the #ifdef CONFIG_SWAP from mm/vmscan.c.
> 
> This was intended as a source cleanup, but looking more closely, it turns
> out that the !CONFIG_SWAP case was going to keep_locked for an anonymous
> page, whereas now it goes to the more suitable activate_locked, like the
> CONFIG_SWAP nr_swap_pages 0 case.

If there is no swap space available, we will not scan the
anon pages at all.

Hmm, maybe we need a special simplified get_scan_ratio()
for !CONFIG_SWAP?

> Signed-off-by: Hugh Dickins <hugh@veritas.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] memcg: memswap controller core swapcache fixes
  2008-11-23 22:11   ` [PATCH] memcg: memswap controller core swapcache fixes Hugh Dickins
@ 2008-11-24  5:43     ` KAMEZAWA Hiroyuki
  2008-11-24  6:15       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-24  5:43 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, Balbir Singh, linux-mm

On Sun, 23 Nov 2008 22:11:07 +0000 (GMT)
Hugh Dickins <hugh@veritas.com> wrote:

> Two SwapCache bug fixes to mmotm's memcg-memswap-controller-core.patch:
> 
> One bug is independent of my current changes: there is no guarantee that
> the page passed to mem_cgroup_try_charge_swapin() is still in SwapCache.
> 

Ah, yes. I'm wrong that the page may not be SwapCache when lock_page() is
called...

Thanks!
-Kame

> The other bug is a consequence of my changes, but the fix is okay without
> them: mem_cgroup_commit_charge_swapin() expects swp_entry in page->private,
> but now reuse_swap_page() (formerly can_share_swap_page()) might already
> have done delete_from_swap_cache(): move commit_charge_swapin() earlier.
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>
> ---
> 
>  mm/memcontrol.c |    8 ++++++++
>  mm/memory.c     |   15 +++++++++++++--
>  2 files changed, 21 insertions(+), 2 deletions(-)
> 
> --- mmotm.orig/mm/memcontrol.c	2008-11-23 21:03:48.000000000 +0000
> +++ mmotm/mm/memcontrol.c	2008-11-23 21:06:12.000000000 +0000
> @@ -847,6 +847,14 @@ int mem_cgroup_try_charge_swapin(struct 
>  	if (!do_swap_account)
>  		goto charge_cur_mm;
>  
> +	/*
> +	 * A racing thread's fault, or swapoff, may have already updated
> +	 * the pte, and even removed page from swap cache: return success
> +	 * to go on to do_swap_page()'s pte_same() test, which should fail.
> +	 */
> +	if (!PageSwapCache(page))
> +		return 0;
> +
>  	ent.val = page_private(page);
>  
>  	mem = lookup_swap_cgroup(ent);
> --- mmotm.orig/mm/memory.c	2008-11-23 21:03:48.000000000 +0000
> +++ mmotm/mm/memory.c	2008-11-23 21:06:12.000000000 +0000
> @@ -2361,8 +2361,20 @@ static int do_swap_page(struct mm_struct
>  		goto out_nomap;
>  	}
>  
> -	/* The page isn't present yet, go ahead with the fault. */
> +	/*
> +	 * The page isn't present yet, go ahead with the fault.
> +	 *
> +	 * Be careful about the sequence of operations here.
> +	 * To get its accounting right, reuse_swap_page() must be called
> +	 * while the page is counted on swap but not yet in mapcount i.e.
> +	 * before page_add_anon_rmap() and swap_free(); try_to_free_swap()
> +	 * must be called after the swap_free(), or it will never succeed.
> +	 * And mem_cgroup_commit_charge_swapin(), which uses the swp_entry
> +	 * in page->private, must be called before reuse_swap_page(),
> +	 * which may delete_from_swap_cache().
> +	 */
>  
> +	mem_cgroup_commit_charge_swapin(page, ptr);
>  	inc_mm_counter(mm, anon_rss);
>  	pte = mk_pte(page, vma->vm_page_prot);
>  	if (write_access && reuse_swap_page(page)) {
> @@ -2373,7 +2385,6 @@ static int do_swap_page(struct mm_struct
>  	flush_icache_page(vma, page);
>  	set_pte_at(mm, address, page_table, pte);
>  	page_add_anon_rmap(page, vma, address);
> -	mem_cgroup_commit_charge_swapin(page, ptr);
>  
>  	swap_free(entry);
>  	if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] memcg: memswap controller core swapcache fixes
  2008-11-24  5:43     ` KAMEZAWA Hiroyuki
@ 2008-11-24  6:15       ` KAMEZAWA Hiroyuki
  2008-11-24 12:29         ` Hugh Dickins
  0 siblings, 1 reply; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-24  6:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Hugh Dickins, Andrew Morton, Balbir Singh, linux-mm

On Mon, 24 Nov 2008 14:43:44 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Sun, 23 Nov 2008 22:11:07 +0000 (GMT)
> Hugh Dickins <hugh@veritas.com> wrote:
> 
> > Two SwapCache bug fixes to mmotm's memcg-memswap-controller-core.patch:
> > 
> > One bug is independent of my current changes: there is no guarantee that
> > the page passed to mem_cgroup_try_charge_swapin() is still in SwapCache.
> > 
> 
> Ah, yes. I'm wrong that the page may not be SwapCache when lock_page() is
> called...

Ah...sorry again,

> > +	/*
> > +	 * A racing thread's fault, or swapoff, may have already updated
> > +	 * the pte, and even removed page from swap cache: return success
> > +	 * to go on to do_swap_page()'s pte_same() test, which should fail.
> > +	 */
> > +	if (!PageSwapCache(page))
> > +		return 0;
> > +
> >  	ent.val = page_private(page);

I think 
==
	if (!PageSwapCache(page))
		goto charge_cur_mm;
==
is better.

In following case,
==
	CPUA				CPUB
				    remove_from_swapcache
	lock_page()                 <==========================(*)
	try_charge_swapin()          
	....
	commit_charge()
==
At (*), the page may be fully unmapped and not charged
(and PCG_USED bit is cleared.)
If so, returing without any charge here means leak of charge.

Even if *charged* here,
  - mem_cgroup_cancel_charge_swapin (handles !pte_same() case.)
  - mem_cgroup_commit_charge_swapin (handles page is doubly charged case.)

try-commit-cancel is introduced to handle this kind of case and bug in my code
is accessing page->private without checking PageSwapCache().

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] memcg: memswap controller core swapcache fixes
  2008-11-24  6:15       ` KAMEZAWA Hiroyuki
@ 2008-11-24 12:29         ` Hugh Dickins
  2008-11-24 12:57           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 24+ messages in thread
From: Hugh Dickins @ 2008-11-24 12:29 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Andrew Morton, Balbir Singh, linux-mm

On Mon, 24 Nov 2008, KAMEZAWA Hiroyuki wrote:
> On Mon, 24 Nov 2008 14:43:44 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Sun, 23 Nov 2008 22:11:07 +0000 (GMT)
> > Hugh Dickins <hugh@veritas.com> wrote:
> > > +	/*
> > > +	 * A racing thread's fault, or swapoff, may have already updated
> > > +	 * the pte, and even removed page from swap cache: return success
> > > +	 * to go on to do_swap_page()'s pte_same() test, which should fail.
> > > +	 */
> > > +	if (!PageSwapCache(page))
> > > +		return 0;
> > > +
> > >  	ent.val = page_private(page);
> 
> I think 
> ==
> 	if (!PageSwapCache(page))
> 		goto charge_cur_mm;
> ==
> is better.
> 
> In following case,
> ==
> 	CPUA				CPUB
> 				    remove_from_swapcache
> 	lock_page()                 <==========================(*)
> 	try_charge_swapin()          
> 	....
> 	commit_charge()
> ==
> At (*), the page may be fully unmapped and not charged
> (and PCG_USED bit is cleared.)
> If so, returing without any charge here means leak of charge.
> 
> Even if *charged* here,
>   - mem_cgroup_cancel_charge_swapin (handles !pte_same() case.)
>   - mem_cgroup_commit_charge_swapin (handles page is doubly charged case.)
> 
> try-commit-cancel is introduced to handle this kind of case and bug in my code
> is accessing page->private without checking PageSwapCache().

I've not studied your charging regime at all, but I think either
my way or yours should work.

There shouldn't be a leak of charge with my patch, because CPUB cannot
remove the page from swapcache until all references to that swap have
been removed: so do_swap_page's (second) pte_same test will fail, and
it'll goto out_nomap.

With my patch, no charge was made, ptr was left NULL and no uncharge
will be made: it was easier for me to see that way.  Doing it your
way, ptr will be set and charged and there will be uncharging to do.

But your way does look better, given that above we've already done
	if (!do_swap_account)
		goto charge_cur_mm;
It looks rather suspicious to "return 0" in some cases after that.

Which of us should update the patch to Andrew?  I'd prefer you
to do it, since you understand the charging and uncharging,
but I can send it if you're busy.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] memcg: memswap controller core swapcache fixes
  2008-11-24 12:29         ` Hugh Dickins
@ 2008-11-24 12:57           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-24 12:57 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, Balbir Singh, linux-mm

On Mon, 24 Nov 2008 12:29:54 +0000 (GMT)
Hugh Dickins <hugh@veritas.com> wrote:

> On Mon, 24 Nov 2008, KAMEZAWA Hiroyuki wrote:
> > On Mon, 24 Nov 2008 14:43:44 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > On Sun, 23 Nov 2008 22:11:07 +0000 (GMT)
> > > Hugh Dickins <hugh@veritas.com> wrote:
> > > > +	/*
> > > > +	 * A racing thread's fault, or swapoff, may have already updated
> > > > +	 * the pte, and even removed page from swap cache: return success
> > > > +	 * to go on to do_swap_page()'s pte_same() test, which should fail.
> > > > +	 */
> > > > +	if (!PageSwapCache(page))
> > > > +		return 0;
> > > > +
> > > >  	ent.val = page_private(page);
> > 
> > I think 
> > ==
> > 	if (!PageSwapCache(page))
> > 		goto charge_cur_mm;
> > ==
> > is better.
> > 
> > In following case,
> > ==
> > 	CPUA				CPUB
> > 				    remove_from_swapcache
> > 	lock_page()                 <==========================(*)
> > 	try_charge_swapin()          
> > 	....
> > 	commit_charge()
> > ==
> > At (*), the page may be fully unmapped and not charged
> > (and PCG_USED bit is cleared.)
> > If so, returing without any charge here means leak of charge.
> > 
> > Even if *charged* here,
> >   - mem_cgroup_cancel_charge_swapin (handles !pte_same() case.)
> >   - mem_cgroup_commit_charge_swapin (handles page is doubly charged case.)
> > 
> > try-commit-cancel is introduced to handle this kind of case and bug in my code
> > is accessing page->private without checking PageSwapCache().
> 
> I've not studied your charging regime at all, but I think either
> my way or yours should work.
> 
> There shouldn't be a leak of charge with my patch, because CPUB cannot
> remove the page from swapcache until all references to that swap have
> been removed: so do_swap_page's (second) pte_same test will fail, and
> it'll goto out_nomap.
> 
> With my patch, no charge was made, ptr was left NULL and no uncharge
> will be made: it was easier for me to see that way.  Doing it your
> way, ptr will be set and charged and there will be uncharging to do.
> 
Thank you for confirmation.
I have no objection to your way. 
I'd like to have review-all-again time.

-Kame

> But your way does look better, given that above we've already done
> 	if (!do_swap_account)
> 		goto charge_cur_mm;
> It looks rather suspicious to "return 0" in some cases after that.
> 
> Which of us should update the patch to Andrew?  I'd prefer you
> to do it, since you understand the charging and uncharging,
> but I can send it if you're busy.
> 
> Hugh
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 8/8] mm: add add_to_swap stub
  2008-11-23 22:55   ` Rik van Riel
@ 2008-11-24 13:49     ` Hugh Dickins
  2008-11-24 13:53       ` [PATCH 9/8] mm: optimize get_scan_ratio for no swap Hugh Dickins
  0 siblings, 1 reply; 24+ messages in thread
From: Hugh Dickins @ 2008-11-24 13:49 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, Nick Piggin, Lee Schermerhorn, linux-mm

On Sun, 23 Nov 2008, Rik van Riel wrote:
> Hugh Dickins wrote:
> > 
> > This was intended as a source cleanup, but looking more closely, it turns
> > out that the !CONFIG_SWAP case was going to keep_locked for an anonymous
> > page, whereas now it goes to the more suitable activate_locked, like the
> > CONFIG_SWAP nr_swap_pages 0 case.
> 
> If there is no swap space available, we will not scan the
> anon pages at all.

Ah, yes, you explained that to me already a few months ago: sorry.

I thought it might be the case, but didn't spot where, so actually
ran !CONFIG_SWAP, counting how many add_to_swap()s occurred, before
sending in the patch - I wasn't sure how big a deal to make of the
keep_locked issue in the comment.

On one run I _did_ see a flurry of add_to_swap()s, but wasn't able
to reproduce it - found it hard to get the balance right between
trying to swap and OOMing, and my test wasn't very inventive.

It didn't seem worth pursuing further at the time, but now you say
"we will not scan the anon pages at all", I wonder if I ought to
try to reproduce it, to see how we came to be trying add_to_swap()
in that case?  Or is there a corner case you know of, and it's not
worth worrying further?

> 
> Hmm, maybe we need a special simplified get_scan_ratio()
> for !CONFIG_SWAP?

But without adding #ifdef CONFIG_SWAPs back in: patch follows.

> 
> > Signed-off-by: Hugh Dickins <hugh@veritas.com>
> 
> Acked-by: Rik van Riel <riel@redhat.com>

Thanks a lot for looking at these.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 9/8] mm: optimize get_scan_ratio for no swap
  2008-11-24 13:49     ` Hugh Dickins
@ 2008-11-24 13:53       ` Hugh Dickins
  2008-11-24 14:11         ` Rik van Riel
  0 siblings, 1 reply; 24+ messages in thread
From: Hugh Dickins @ 2008-11-24 13:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, Nick Piggin, Lee Schermerhorn, linux-mm

Rik suggests a simplified get_scan_ratio() for !CONFIG_SWAP.  Yes,
the gcc optimizer gives us that, when nr_swap_pages is #defined as 0L.
Move usual declaration to swapfile.c: it never belonged in page_alloc.c.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 include/linux/swap.h |    5 +++--
 mm/page_alloc.c      |    1 -
 mm/swapfile.c        |    1 +
 mm/vmscan.c          |   12 ++++++------
 4 files changed, 10 insertions(+), 9 deletions(-)

--- swapfree8/include/linux/swap.h	2008-11-21 18:51:08.000000000 +0000
+++ swapfree9/include/linux/swap.h	2008-11-24 13:27:00.000000000 +0000
@@ -163,7 +163,6 @@ struct swap_list_t {
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
 extern unsigned long totalreserve_pages;
-extern long nr_swap_pages;
 extern unsigned int nr_free_buffer_pages(void);
 extern unsigned int nr_free_pagecache_pages(void);
 
@@ -294,6 +293,7 @@ extern struct page *swapin_readahead(swp
 			struct vm_area_struct *vma, unsigned long addr);
 
 /* linux/mm/swapfile.c */
+extern long nr_swap_pages;
 extern long total_swap_pages;
 extern void si_swapinfo(struct sysinfo *);
 extern swp_entry_t get_swap_page(void);
@@ -334,7 +334,8 @@ static inline void disable_swap_token(vo
 
 #else /* CONFIG_SWAP */
 
-#define total_swap_pages			0
+#define nr_swap_pages				0L
+#define total_swap_pages			0L
 #define total_swapcache_pages			0UL
 
 #define si_swapinfo(val) \
--- swapfree8/mm/page_alloc.c	2008-11-19 15:25:12.000000000 +0000
+++ swapfree9/mm/page_alloc.c	2008-11-24 13:27:00.000000000 +0000
@@ -69,7 +69,6 @@ EXPORT_SYMBOL(node_states);
 
 unsigned long totalram_pages __read_mostly;
 unsigned long totalreserve_pages __read_mostly;
-long nr_swap_pages;
 int percpu_pagelist_fraction;
 
 #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
--- swapfree8/mm/swapfile.c	2008-11-21 18:50:59.000000000 +0000
+++ swapfree9/mm/swapfile.c	2008-11-24 13:27:00.000000000 +0000
@@ -35,6 +35,7 @@
 
 static DEFINE_SPINLOCK(swap_lock);
 static unsigned int nr_swapfiles;
+long nr_swap_pages;
 long total_swap_pages;
 static int swap_overflow;
 static int least_priority;
--- swapfree8/mm/vmscan.c	2008-11-21 18:51:08.000000000 +0000
+++ swapfree9/mm/vmscan.c	2008-11-24 13:27:00.000000000 +0000
@@ -1374,12 +1374,6 @@ static void get_scan_ratio(struct zone *
 	unsigned long anon_prio, file_prio;
 	unsigned long ap, fp;
 
-	anon  = zone_page_state(zone, NR_ACTIVE_ANON) +
-		zone_page_state(zone, NR_INACTIVE_ANON);
-	file  = zone_page_state(zone, NR_ACTIVE_FILE) +
-		zone_page_state(zone, NR_INACTIVE_FILE);
-	free  = zone_page_state(zone, NR_FREE_PAGES);
-
 	/* If we have no swap space, do not bother scanning anon pages. */
 	if (nr_swap_pages <= 0) {
 		percent[0] = 0;
@@ -1387,6 +1381,12 @@ static void get_scan_ratio(struct zone *
 		return;
 	}
 
+	anon  = zone_page_state(zone, NR_ACTIVE_ANON) +
+		zone_page_state(zone, NR_INACTIVE_ANON);
+	file  = zone_page_state(zone, NR_ACTIVE_FILE) +
+		zone_page_state(zone, NR_INACTIVE_FILE);
+	free  = zone_page_state(zone, NR_FREE_PAGES);
+
 	/* If we have very few page cache pages, force-scan anon pages. */
 	if (unlikely(file + free <= zone->pages_high)) {
 		percent[0] = 100;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 9/8] mm: optimize get_scan_ratio for no swap
  2008-11-24 13:53       ` [PATCH 9/8] mm: optimize get_scan_ratio for no swap Hugh Dickins
@ 2008-11-24 14:11         ` Rik van Riel
  0 siblings, 0 replies; 24+ messages in thread
From: Rik van Riel @ 2008-11-24 14:11 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, Nick Piggin, Lee Schermerhorn, linux-mm

Hugh Dickins wrote:
> Rik suggests a simplified get_scan_ratio() for !CONFIG_SWAP.  Yes,
> the gcc optimizer gives us that, when nr_swap_pages is #defined as 0L.
> Move usual declaration to swapfile.c: it never belonged in page_alloc.c.
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/8] mm: remove try_to_munlock from vmscan
  2008-11-23 22:03 ` [PATCH 6/8] mm: remove try_to_munlock from vmscan Hugh Dickins
  2008-11-23 22:53   ` Rik van Riel
@ 2008-11-24 17:34   ` Lee Schermerhorn
  2008-11-24 19:29     ` Hugh Dickins
  1 sibling, 1 reply; 24+ messages in thread
From: Lee Schermerhorn @ 2008-11-24 17:34 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, Nick Piggin, Rik van Riel, linux-mm

On Sun, 2008-11-23 at 22:03 +0000, Hugh Dickins wrote:
> An unfortunate feature of the Unevictable LRU work was that reclaiming an
> anonymous page involved an extra scan through the anon_vma: to check that
> the page is evictable before allocating swap, because the swap could not
> be freed reliably soon afterwards.
> 
> Now try_to_free_swap() has replaced remove_exclusive_swap_page(), that's
> not an issue any more: remove try_to_munlock() call from shrink_page_list(),
> leaving it to try_to_munmap() to discover if the page is one to be culled
> to the unevictable list - in which case then try_to_free_swap().
> 
> Update unevictable-lru.txt to remove comments on the try_to_munlock()
> in shrink_page_list(), and shorten some lines over 80 columns.


Hugh:  Thanks for doing this.   Another item on my to-do list, as noted
in the document.

> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>
> ---
> I've not tested this against whatever test showed the need for that
> try_to_munlock() in shrink_page_list() in the first place.  Rik or Lee,
> please, would you have the time to run that test on the next -mm that has
> this patch in, to check that I've not messed things up?  Alternatively,
> please point me to such a test - but I think you've been targeting
> larger machines than I have access to - thanks.

I will rerun my test workload when this shows up in mmotm.  

I added the extra try_to_munlock() [TODO:  maybe "page_mlocked()" is
better name?] to prevent using swap space for pages that were destined
for the unevictable list.  This is more likely, I think, now that we've
removed the lru_drain_all() calls from the mlock[all]() handlers.  Back
when I added this, I wasn't sure that we could reliably remove swap from
a page with an arbitrary number of mappers.  Rik had warned against
making that assumption.

Lee

<snip>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/8] mm: remove try_to_munlock from vmscan
  2008-11-24 17:34   ` Lee Schermerhorn
@ 2008-11-24 19:29     ` Hugh Dickins
  2008-12-01 20:16       ` Lee Schermerhorn
  0 siblings, 1 reply; 24+ messages in thread
From: Hugh Dickins @ 2008-11-24 19:29 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: Andrew Morton, Nick Piggin, Rik van Riel, linux-mm

On Mon, 24 Nov 2008, Lee Schermerhorn wrote:
> On Sun, 2008-11-23 at 22:03 +0000, Hugh Dickins wrote:
> 
> Hugh:  Thanks for doing this.   Another item on my to-do list, as noted
> in the document.

Taking the page_count() check out of remove_exclusive_swap_page()
has been on my to-do list for about four years, so I'd have been
extra ashamed if you got there before me.  Most of that time I'd
been thinking we needed a page_mapcount() check instead, it's only
recently I've realized that it was silly to be requiring "exclusive"
in the first place.

> > 
> > Signed-off-by: Hugh Dickins <hugh@veritas.com>
> > ---
> > I've not tested this against whatever test showed the need for that
> > try_to_munlock() in shrink_page_list() in the first place.  Rik or Lee,
> > please, would you have the time to run that test on the next -mm that has
> > this patch in, to check that I've not messed things up?  Alternatively,
> > please point me to such a test - but I think you've been targeting
> > larger machines than I have access to - thanks.
> 
> I will rerun my test workload when this shows up in mmotm.  

Great, thanks a lot.

> 
> I added the extra try_to_munlock() [TODO:  maybe "page_mlocked()" is
> better name?]

I think it's a much better name, so left in that part of the TODO;
but for some reason felt I'd better leave that change to you.

> to prevent using swap space for pages that were destined
> for the unevictable list.  This is more likely, I think, now that we've
> removed the lru_drain_all() calls from the mlock[all]() handlers.  Back
> when I added this, I wasn't sure that we could reliably remove swap from
> a page with an arbitrary number of mappers.  Rik had warned against
> making that assumption.

Yes, it's bitten us before.  I expect you could have handled it (in
all but racy cases) by use of your remove_exclusive_swap_page_count()
but it's a lot easier never having to worry about exclusivity at all.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/8] mm: remove try_to_munlock from vmscan
  2008-11-24 19:29     ` Hugh Dickins
@ 2008-12-01 20:16       ` Lee Schermerhorn
  2008-12-02  0:51         ` Hugh Dickins
  0 siblings, 1 reply; 24+ messages in thread
From: Lee Schermerhorn @ 2008-12-01 20:16 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, Nick Piggin, Rik van Riel, linux-mm

On Mon, 2008-11-24 at 19:29 +0000, Hugh Dickins wrote:
> On Mon, 24 Nov 2008, Lee Schermerhorn wrote:
> > On Sun, 2008-11-23 at 22:03 +0000, Hugh Dickins wrote:
> > 
<snip>
> 
> > > 
> > > Signed-off-by: Hugh Dickins <hugh@veritas.com>
> > > ---
> > > I've not tested this against whatever test showed the need for that
> > > try_to_munlock() in shrink_page_list() in the first place.  Rik or Lee,
> > > please, would you have the time to run that test on the next -mm that has
> > > this patch in, to check that I've not messed things up?  Alternatively,
> > > please point me to such a test - but I think you've been targeting
> > > larger machines than I have access to - thanks.
> > 
> > I will rerun my test workload when this shows up in mmotm.  
> 
> Great, thanks a lot.

Hugh:  I got a chance to start my test workload on 28-rc6-mmotm-081130
today [after finding the patch for the "BUG_ON(!dot)" boot-time panic].
I added a couple of temporary vmstat counters to count attempts to free
swap space in the vmscan "cull" path and successful frees, so I could
tell that we were exercising your changes.

Unfortunately, both my x86_64 and ia64 platforms eventually [after an
hour and a half or so] hit a null pointer deref [Nat consumption on
ia64] in __get_user_pages().  In both cases, __get_user_pages was called
while ps(1) was trying to read the task's command line via /proc.

The ia64 platform eventually locked up.  I rebooted the x86_64 and hit a
"kernel BUG at fs/dcache.c:666".  I don't know that these were related
to your changes, so I'll report them separately.

I did manage to grab some selected vmstats from the run on the x86_64,
after couple of hours of running [it stayed up longer than the ia64]:

egrep '^pgp|^pswp|^pgfau|^pgmaj|^unev|^swap_' /proc/vmstat
pgpgin 288501203
pgpgout 89224219
pswpin 1063928
pswpout 1637706
pgfault 1471335469
pgmajfault 517119
unevictable_pgs_culled 108794397
unevictable_pgs_scanned 28835840
unevictable_pgs_rescued 260444075
unevictable_pgs_mlocked 250282969
unevictable_pgs_munlocked 239272025
unevictable_pgs_cleared 6750236
unevictable_pgs_stranded 0
unevictable_pgs_mlockfreed 0
swap_try_free_mlocked_pgs 823799
swap_freed_mlocked_pgs 823799

the last two items are the temporary counters where we
try_to_free_swap() in the vmscan cull path.

I tested this by mmap()ing a largish [20G] anon segment, writing to each
page to populate it, then mlocking the segment.  Other tests kept the
system under memory pressure so that quite a few of the pages of the
anon segment got swapped out before I mlocked it.  The fact that I hit
these counters indicates that many of the mlocked pages were culled by
vmscan rather than by mlock itself--possibly because we recently removed
the lru_drain_all() from mlock, so we don't necessarily see all of the
pages on the lru on return from get_user_pages() in mlock.

Later,
Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/8] mm: remove try_to_munlock from vmscan
  2008-12-01 20:16       ` Lee Schermerhorn
@ 2008-12-02  0:51         ` Hugh Dickins
  0 siblings, 0 replies; 24+ messages in thread
From: Hugh Dickins @ 2008-12-02  0:51 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: Andrew Morton, Nick Piggin, Rik van Riel, linux-mm

On Mon, 1 Dec 2008, Lee Schermerhorn wrote:
> 
> Hugh:  I got a chance to start my test workload on 28-rc6-mmotm-081130

Thanks!

> today [after finding the patch for the "BUG_ON(!dot)" boot-time panic].
> I added a couple of temporary vmstat counters to count attempts to free
> swap space in the vmscan "cull" path and successful frees, so I could
> tell that we were exercising your changes.
> 
> Unfortunately, both my x86_64 and ia64 platforms eventually [after an
> hour and a half or so] hit a null pointer deref [Nat consumption on
> ia64] in __get_user_pages().  In both cases, __get_user_pages was called
> while ps(1) was trying to read the task's command line via /proc.
> 
> The ia64 platform eventually locked up.  I rebooted the x86_64 and hit a
> "kernel BUG at fs/dcache.c:666".

Beware the BUG of the Beast.

> I don't know that these were related
> to your changes, so I'll report them separately.

Please do.  They don't sound so close that I feel guilty yet,
but they don't sound so far away that I can rest all that easy.

> 
> I did manage to grab some selected vmstats from the run on the x86_64,
> after couple of hours of running [it stayed up longer than the ia64]:
> 
> egrep '^pgp|^pswp|^pgfau|^pgmaj|^unev|^swap_' /proc/vmstat
> pgpgin 288501203
> pgpgout 89224219
> pswpin 1063928
> pswpout 1637706
> pgfault 1471335469
> pgmajfault 517119
> unevictable_pgs_culled 108794397
> unevictable_pgs_scanned 28835840
> unevictable_pgs_rescued 260444075
> unevictable_pgs_mlocked 250282969
> unevictable_pgs_munlocked 239272025
> unevictable_pgs_cleared 6750236
> unevictable_pgs_stranded 0
> unevictable_pgs_mlockfreed 0
> swap_try_free_mlocked_pgs 823799
> swap_freed_mlocked_pgs 823799
> 
> the last two items are the temporary counters where we
> try_to_free_swap() in the vmscan cull path.
> 
> I tested this by mmap()ing a largish [20G] anon segment, writing to each
> page to populate it, then mlocking the segment.  Other tests kept the
> system under memory pressure so that quite a few of the pages of the
> anon segment got swapped out before I mlocked it.  The fact that I hit
> these counters indicates that many of the mlocked pages were culled by
> vmscan rather than by mlock itself--possibly because we recently removed
> the lru_drain_all() from mlock, so we don't necessarily see all of the
> pages on the lru on return from get_user_pages() in mlock.

That does sound like you've really been testing it, and it looks to be
a satisfying result, that it actually swap_freed every one it tried.

Thanks for the reassurance (but let's see about those bugs),

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2008-12-02  0:51 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-11-23 21:53 [PATCH 0/8] mm: from gup to vmscan Hugh Dickins
2008-11-23 21:55 ` [PATCH 1/8] mm: gup persist for write permission Hugh Dickins
2008-11-23 21:56 ` [PATCH 2/8] mm: wp lock page before deciding cow Hugh Dickins
2008-11-23 21:58 ` [PATCH 3/8] mm: reuse_swap_page replaces can_share_swap_page Hugh Dickins
2008-11-23 22:11   ` [PATCH] memcg: memswap controller core swapcache fixes Hugh Dickins
2008-11-24  5:43     ` KAMEZAWA Hiroyuki
2008-11-24  6:15       ` KAMEZAWA Hiroyuki
2008-11-24 12:29         ` Hugh Dickins
2008-11-24 12:57           ` KAMEZAWA Hiroyuki
2008-11-23 22:43   ` [PATCH 3/8] mm: reuse_swap_page replaces can_share_swap_page Rik van Riel
2008-11-23 22:00 ` [PATCH 4/8] mm: try_to_free_swap replaces remove_exclusive_swap_page Hugh Dickins
2008-11-23 22:01 ` [PATCH 5/8] mm: try_to_unuse check removing right swap Hugh Dickins
2008-11-23 22:03 ` [PATCH 6/8] mm: remove try_to_munlock from vmscan Hugh Dickins
2008-11-23 22:53   ` Rik van Riel
2008-11-24 17:34   ` Lee Schermerhorn
2008-11-24 19:29     ` Hugh Dickins
2008-12-01 20:16       ` Lee Schermerhorn
2008-12-02  0:51         ` Hugh Dickins
2008-11-23 22:05 ` [PATCH 7/8] mm: remove gfp_mask from add_to_swap Hugh Dickins
2008-11-23 22:07 ` [PATCH 8/8] mm: add add_to_swap stub Hugh Dickins
2008-11-23 22:55   ` Rik van Riel
2008-11-24 13:49     ` Hugh Dickins
2008-11-24 13:53       ` [PATCH 9/8] mm: optimize get_scan_ratio for no swap Hugh Dickins
2008-11-24 14:11         ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox