[RFC][PATCH] synchrouns swap freeing at zapping vmas

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH] synchrouns swap freeing at zapping vmas
@ 2009-05-21  7:41 KAMEZAWA Hiroyuki
  2009-05-21  7:43 ` [RFC][PATCH 1/2] change swapcount handling KAMEZAWA Hiroyuki
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-21  7:41 UTC (permalink / raw)
  To: linux-mm; +Cc: nishimura, hugh, balbir, akpm

In these 6-7 weeks, we tried to fix memcg's swap-leak race by checking
swap is valid or not after I/O. But Andrew Morton pointed out that
"trylock in free_swap_and_cache() is not good"
Oh, yes. it's not good.

Then, this patch series is a trial to remove trylock for swapcache AMAP.
Patches are more complex and larger than expected but the behavior itself is
much appreciate than prevoius my posts for memcg...

This series contains 2 patches.
  1. change refcounting in swap_map.
     This is for allowing swap_map to indicate there is swap reference/cache.
  2. synchronous freeing of swap entries.
     For avoiding race, free swap_entries in appropriate way with lock_page().
     After this patch, race between swapin-readahead v.s. zap_page_range()
     will go away.
     Note: the whole code for zap_page_range() will not work until the system
     or cgroup is very swappy. So, no influence in typical case.

There are used trylocks more than this patch treats. But IIUC, they are not
racy with memcg and I don't care them.
(And....I have no idea to remove trylock() in free_pages_and_swapcache(),
 which is called via tlb_flush_mmu()....preemption disabled and using percpu.)

These patches + Nishimura-san's writeback fix will do complete work, I think.
But test is not enough.

Any comments are welcome. 

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC][PATCH 1/2] change swapcount handling
  2009-05-21  7:41 [RFC][PATCH] synchrouns swap freeing at zapping vmas KAMEZAWA Hiroyuki
@ 2009-05-21  7:43 ` KAMEZAWA Hiroyuki
  2009-05-21  7:43 ` [RFC][PATCH 2/2] synchrouns swap freeing without trylock KAMEZAWA Hiroyuki
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-21  7:43 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, nishimura, hugh, balbir, akpm

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Now, there are 2types of reference to swap entry. One is reference from
page tables (and shmem, etc..) and another is SwapCache.

At freeing swap, we cannot know there is still reference or there is
just a swap cache. This changes swap entry refcnt to be
  - account by 2 at new reference (SWAP_MAP)
  - account by 1 at swap cache.   (SWAP_CACHE)

To do this, adds a new argument to swap alloc/free functions.

After this, if swap_entry_free() returns 1, it means "no reference but
swap cache" state. And this makes
  get_swap_page/swap_duplicate()->add_to_swap_cache()
to be an atomic operation. (means no confilcts in add_to_swap_cache())

Consideration:
This makes SWAP_MAX_MAP to be half. If this is bad, can't we
increase SWAP_MAX_MAP ? (makes SWAP_MAP_BAD to be 0xfff0)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/swap.h  |   20 +++++++++---
 kernel/power/swsusp.c |    6 +--
 mm/memory.c           |    4 +-
 mm/rmap.c             |    2 -
 mm/shmem.c            |   12 +++----
 mm/swap_state.c       |   14 ++++----
 mm/swapfile.c         |   81 +++++++++++++++++++++++++++++++++++---------------
 mm/vmscan.c           |    2 -
 8 files changed, 93 insertions(+), 48 deletions(-)

Index: mmotm-2.6.30-May17/include/linux/swap.h
===================================================================
--- mmotm-2.6.30-May17.orig/include/linux/swap.h
+++ mmotm-2.6.30-May17/include/linux/swap.h
@@ -129,7 +129,12 @@ enum {
 
 #define SWAP_CLUSTER_MAX 32
 
-#define SWAP_MAP_MAX	0x7fff
+/*
+ * Reference to swap is incremented by 2 when new reference comes.
+ * incremented by 1 when swap cache is newly added.
+ * This means the lowest bit of swap_map indicates there is swapcache or not.
+ */
+#define SWAP_MAP_MAX	0x7ffe
 #define SWAP_MAP_BAD	0x8000
 
 /*
@@ -298,11 +303,16 @@ extern struct page *swapin_readahead(swp
 extern long nr_swap_pages;
 extern long total_swap_pages;
 extern void si_swapinfo(struct sysinfo *);
-extern swp_entry_t get_swap_page(void);
-extern swp_entry_t get_swap_page_of_type(int);
-extern int swap_duplicate(swp_entry_t);
+extern swp_entry_t get_swap_page(int);
+extern swp_entry_t get_swap_page_of_type(int, int);
+
+enum {
+	SWAP_MAP,
+	SWAP_CACHE,
+};
+extern int swap_duplicate(swp_entry_t, int);
 extern int valid_swaphandles(swp_entry_t, unsigned long *);
-extern void swap_free(swp_entry_t);
+extern void swap_free(swp_entry_t, int);
 extern int free_swap_and_cache(swp_entry_t);
 extern int swap_type_of(dev_t, sector_t, struct block_device **);
 extern unsigned int count_swap_pages(int, int);
Index: mmotm-2.6.30-May17/mm/memory.c
===================================================================
--- mmotm-2.6.30-May17.orig/mm/memory.c
+++ mmotm-2.6.30-May17/mm/memory.c
@@ -552,7 +552,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
 		if (!pte_file(pte)) {
 			swp_entry_t entry = pte_to_swp_entry(pte);
 
-			swap_duplicate(entry);
+			swap_duplicate(entry, SWAP_MAP);
 			/* make sure dst_mm is on swapoff's mmlist. */
 			if (unlikely(list_empty(&dst_mm->mmlist))) {
 				spin_lock(&mmlist_lock);
@@ -2670,7 +2670,7 @@ static int do_swap_page(struct mm_struct
 	/* It's better to call commit-charge after rmap is established */
 	mem_cgroup_commit_charge_swapin(page, ptr);
 
-	swap_free(entry);
+	swap_free(entry, SWAP_MAP);
 	if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
 		try_to_free_swap(page);
 	unlock_page(page);
Index: mmotm-2.6.30-May17/mm/rmap.c
===================================================================
--- mmotm-2.6.30-May17.orig/mm/rmap.c
+++ mmotm-2.6.30-May17/mm/rmap.c
@@ -949,7 +949,7 @@ static int try_to_unmap_one(struct page 
 			 * Store the swap location in the pte.
 			 * See handle_pte_fault() ...
 			 */
-			swap_duplicate(entry);
+			swap_duplicate(entry, SWAP_MAP);
 			if (list_empty(&mm->mmlist)) {
 				spin_lock(&mmlist_lock);
 				if (list_empty(&mm->mmlist))
Index: mmotm-2.6.30-May17/mm/shmem.c
===================================================================
--- mmotm-2.6.30-May17.orig/mm/shmem.c
+++ mmotm-2.6.30-May17/mm/shmem.c
@@ -986,7 +986,7 @@ found:
 		set_page_dirty(page);
 		info->flags |= SHMEM_PAGEIN;
 		shmem_swp_set(info, ptr, 0);
-		swap_free(entry);
+		swap_free(entry, SWAP_MAP);
 		error = 1;	/* not an error, but entry was found */
 	}
 	if (ptr)
@@ -1051,7 +1051,7 @@ static int shmem_writepage(struct page *
 	 * want to check if there's a redundant swappage to be discarded.
 	 */
 	if (wbc->for_reclaim)
-		swap = get_swap_page();
+		swap = get_swap_page(SWAP_CACHE);
 	else
 		swap.val = 0;
 
@@ -1080,7 +1080,7 @@ static int shmem_writepage(struct page *
 		else
 			inode = NULL;
 		spin_unlock(&info->lock);
-		swap_duplicate(swap);
+		swap_duplicate(swap, SWAP_MAP);
 		BUG_ON(page_mapped(page));
 		page_cache_release(page);	/* pagecache ref */
 		swap_writepage(page, wbc);
@@ -1097,7 +1097,7 @@ static int shmem_writepage(struct page *
 	shmem_swp_unmap(entry);
 unlock:
 	spin_unlock(&info->lock);
-	swap_free(swap);
+	swap_free(swap, SWAP_CACHE);
 redirty:
 	set_page_dirty(page);
 	if (wbc->for_reclaim)
@@ -1325,7 +1325,7 @@ repeat:
 			flush_dcache_page(filepage);
 			SetPageUptodate(filepage);
 			set_page_dirty(filepage);
-			swap_free(swap);
+			swap_free(swap, SWAP_MAP);
 		} else if (!(error = add_to_page_cache_locked(swappage, mapping,
 					idx, GFP_NOWAIT))) {
 			info->flags |= SHMEM_PAGEIN;
@@ -1335,7 +1335,7 @@ repeat:
 			spin_unlock(&info->lock);
 			filepage = swappage;
 			set_page_dirty(filepage);
-			swap_free(swap);
+			swap_free(swap, SWAP_MAP);
 		} else {
 			shmem_swp_unmap(entry);
 			spin_unlock(&info->lock);
Index: mmotm-2.6.30-May17/mm/swap_state.c
===================================================================
--- mmotm-2.6.30-May17.orig/mm/swap_state.c
+++ mmotm-2.6.30-May17/mm/swap_state.c
@@ -138,7 +138,7 @@ int add_to_swap(struct page *page)
 	VM_BUG_ON(!PageUptodate(page));
 
 	for (;;) {
-		entry = get_swap_page();
+		entry = get_swap_page(SWAP_CACHE);
 		if (!entry.val)
 			return 0;
 
@@ -161,12 +161,12 @@ int add_to_swap(struct page *page)
 			SetPageDirty(page);
 			return 1;
 		case -EEXIST:
-			/* Raced with "speculative" read_swap_cache_async */
-			swap_free(entry);
+			/* Raced with "speculative" read_swap_cache_async ? */
+			swap_free(entry, SWAP_CACHE);
 			continue;
 		default:
 			/* -ENOMEM radix-tree allocation failure */
-			swap_free(entry);
+			swap_free(entry, SWAP_CACHE);
 			return 0;
 		}
 	}
@@ -189,7 +189,7 @@ void delete_from_swap_cache(struct page 
 	spin_unlock_irq(&swapper_space.tree_lock);
 
 	mem_cgroup_uncharge_swapcache(page, entry);
-	swap_free(entry);
+	swap_free(entry, SWAP_CACHE);
 	page_cache_release(page);
 }
 
@@ -293,7 +293,7 @@ struct page *read_swap_cache_async(swp_e
 		/*
 		 * Swap entry may have been freed since our caller observed it.
 		 */
-		if (!swap_duplicate(entry))
+		if (!swap_duplicate(entry, SWAP_CACHE))
 			break;
 
 		/*
@@ -317,7 +317,7 @@ struct page *read_swap_cache_async(swp_e
 		}
 		ClearPageSwapBacked(new_page);
 		__clear_page_locked(new_page);
-		swap_free(entry);
+		swap_free(entry, SWAP_CACHE);
 	} while (err != -ENOMEM);
 
 	if (new_page)
Index: mmotm-2.6.30-May17/mm/swapfile.c
===================================================================
--- mmotm-2.6.30-May17.orig/mm/swapfile.c
+++ mmotm-2.6.30-May17/mm/swapfile.c
@@ -167,7 +167,7 @@ static int wait_for_discard(void *word)
 #define SWAPFILE_CLUSTER	256
 #define LATENCY_LIMIT		256
 
-static inline unsigned long scan_swap_map(struct swap_info_struct *si)
+static inline unsigned long scan_swap_map(struct swap_info_struct *si, int ops)
 {
 	unsigned long offset;
 	unsigned long scan_base;
@@ -285,7 +285,12 @@ checks:
 		si->lowest_bit = si->max;
 		si->highest_bit = 0;
 	}
-	si->swap_map[offset] = 1;
+
+	if (ops == SWAP_CACHE)
+		si->swap_map[offset] = 1; /* usually start from swap-cache */
+	else
+		si->swap_map[offset] = 2; /* swsusp does this. */
+
 	si->cluster_next = offset + 1;
 	si->flags -= SWP_SCANNING;
 
@@ -374,7 +379,7 @@ no_page:
 	return 0;
 }
 
-swp_entry_t get_swap_page(void)
+swp_entry_t get_swap_page(int ops)
 {
 	struct swap_info_struct *si;
 	pgoff_t offset;
@@ -401,7 +406,7 @@ swp_entry_t get_swap_page(void)
 			continue;
 
 		swap_list.next = next;
-		offset = scan_swap_map(si);
+		offset = scan_swap_map(si, ops);
 		if (offset) {
 			spin_unlock(&swap_lock);
 			return swp_entry(type, offset);
@@ -415,7 +420,7 @@ noswap:
 	return (swp_entry_t) {0};
 }
 
-swp_entry_t get_swap_page_of_type(int type)
+swp_entry_t get_swap_page_of_type(int type, int ops)
 {
 	struct swap_info_struct *si;
 	pgoff_t offset;
@@ -424,7 +429,7 @@ swp_entry_t get_swap_page_of_type(int ty
 	si = swap_info + type;
 	if (si->flags & SWP_WRITEOK) {
 		nr_swap_pages--;
-		offset = scan_swap_map(si);
+		offset = scan_swap_map(si, ops);
 		if (offset) {
 			spin_unlock(&swap_lock);
 			return swp_entry(type, offset);
@@ -471,13 +476,17 @@ out:
 	return NULL;
 }
 
-static int swap_entry_free(struct swap_info_struct *p, swp_entry_t ent)
+static int
+swap_entry_free(struct swap_info_struct *p, swp_entry_t ent, int ops)
 {
 	unsigned long offset = swp_offset(ent);
 	int count = p->swap_map[offset];
 
 	if (count < SWAP_MAP_MAX) {
-		count--;
+		if (ops == SWAP_CACHE)
+			count -= 1;
+		else
+			count -= 2;
 		p->swap_map[offset] = count;
 		if (!count) {
 			if (offset < p->lowest_bit)
@@ -498,13 +507,13 @@ static int swap_entry_free(struct swap_i
  * Caller has made sure that the swapdevice corresponding to entry
  * is still around or has not been recycled.
  */
-void swap_free(swp_entry_t entry)
+void swap_free(swp_entry_t entry, int ops)
 {
 	struct swap_info_struct * p;
 
 	p = swap_info_get(entry);
 	if (p) {
-		swap_entry_free(p, entry);
+		swap_entry_free(p, entry, ops);
 		spin_unlock(&swap_lock);
 	}
 }
@@ -584,7 +593,7 @@ int free_swap_and_cache(swp_entry_t entr
 
 	p = swap_info_get(entry);
 	if (p) {
-		if (swap_entry_free(p, entry) == 1) {
+		if (swap_entry_free(p, entry, SWAP_MAP) == 1) {
 			page = find_get_page(&swapper_space, entry.val);
 			if (page && !trylock_page(page)) {
 				page_cache_release(page);
@@ -717,7 +726,7 @@ static int unuse_pte(struct vm_area_stru
 		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
 	page_add_anon_rmap(page, vma, addr);
 	mem_cgroup_commit_charge_swapin(page, ptr);
-	swap_free(entry);
+	swap_free(entry, SWAP_MAP);
 	/*
 	 * Move the page to the active list so it is not
 	 * immediately swapped out again after swapon.
@@ -1069,9 +1078,13 @@ static int try_to_unuse(unsigned int typ
 		 * We know "Undead"s can happen, they're okay, so don't
 		 * report them; but do report if we reset SWAP_MAP_MAX.
 		 */
-		if (*swap_map == SWAP_MAP_MAX) {
+		if ((*swap_map == SWAP_MAP_MAX) ||
+		    (*swap_map == SWAP_MAP_MAX+1)) {
 			spin_lock(&swap_lock);
-			*swap_map = 1;
+			if (*swap_map == SWAP_MAP_MAX)
+				*swap_map = 2; /* there isn't a swap cache */
+			else
+				*swap_map = 3; /* there is a swap cache */
 			spin_unlock(&swap_lock);
 			reset_overflow = 1;
 		}
@@ -1939,11 +1952,13 @@ void si_swapinfo(struct sysinfo *val)
 
 /*
  * Verify that a swap entry is valid and increment its swap map count.
- *
+ * If new reference is for new map, increment by 2.(type=SWAP_MAP)
+ * If new reference is for swap cache, increment by 1 (type = SWAP_CACHE)
  * Note: if swap_map[] reaches SWAP_MAP_MAX the entries are treated as
  * "permanent", but will be reclaimed by the next swapoff.
+ *
  */
-int swap_duplicate(swp_entry_t entry)
+int swap_duplicate(swp_entry_t entry, int ops)
 {
 	struct swap_info_struct * p;
 	unsigned long offset, type;
@@ -1959,15 +1974,35 @@ int swap_duplicate(swp_entry_t entry)
 	offset = swp_offset(entry);
 
 	spin_lock(&swap_lock);
-	if (offset < p->max && p->swap_map[offset]) {
-		if (p->swap_map[offset] < SWAP_MAP_MAX - 1) {
-			p->swap_map[offset]++;
-			result = 1;
-		} else if (p->swap_map[offset] <= SWAP_MAP_MAX) {
+	/*
+	 * When we tries to create new SwapCache, increment count by 1.
+	 * When we adds new reference to swap entry, increment count by 2.
+	 * If type==SWAP_CACHE and swap_map[] shows there is a swap cache,
+	 * it means racy swapin. The caller should cancel his work.
+	 */
+	if (offset < p->max && (p->swap_map[offset])) {
+		if (p->swap_map[offset] < SWAP_MAP_MAX - 2) {
+			if (ops == SWAP_CACHE) {
+				if (!(p->swap_map[offset] & 0x1)) {
+					p->swap_map[offset] += 1;
+					result = 1;
+				}
+			} else {
+				p->swap_map[offset] += 2;
+				result = 1;
+			}
+		} else if (p->swap_map[offset] <= SWAP_MAP_MAX - 1) {
 			if (swap_overflow++ < 5)
 				printk(KERN_WARNING "swap_dup: swap entry overflow\n");
-			p->swap_map[offset] = SWAP_MAP_MAX;
-			result = 1;
+			if (ops == SWAP_CACHE) {
+				if (!(p->swap_map[offset] & 0x1)) {
+					p->swap_map[offset] += 1;
+					result = 1;
+				}
+			} else {
+				p->swap_map[offset] += 2;
+				result = 1;
+			}
 		}
 	}
 	spin_unlock(&swap_lock);
Index: mmotm-2.6.30-May17/mm/vmscan.c
===================================================================
--- mmotm-2.6.30-May17.orig/mm/vmscan.c
+++ mmotm-2.6.30-May17/mm/vmscan.c
@@ -478,7 +478,7 @@ static int __remove_mapping(struct addre
 		__delete_from_swap_cache(page);
 		spin_unlock_irq(&mapping->tree_lock);
 		mem_cgroup_uncharge_swapcache(page, swap);
-		swap_free(swap);
+		swap_free(swap, SWAP_CACHE);
 	} else {
 		__remove_from_page_cache(page);
 		spin_unlock_irq(&mapping->tree_lock);
Index: mmotm-2.6.30-May17/kernel/power/swsusp.c
===================================================================
--- mmotm-2.6.30-May17.orig/kernel/power/swsusp.c
+++ mmotm-2.6.30-May17/kernel/power/swsusp.c
@@ -120,10 +120,10 @@ sector_t alloc_swapdev_block(int swap)
 {
 	unsigned long offset;
 
-	offset = swp_offset(get_swap_page_of_type(swap));
+	offset = swp_offset(get_swap_page_of_type(swap, SWAP_MAP));
 	if (offset) {
 		if (swsusp_extents_insert(offset))
-			swap_free(swp_entry(swap, offset));
+			swap_free(swp_entry(swap, offset), SWAP_MAP);
 		else
 			return swapdev_block(swap, offset);
 	}
@@ -147,7 +147,7 @@ void free_all_swap_pages(int swap)
 		ext = container_of(node, struct swsusp_extent, node);
 		rb_erase(node, &swsusp_extents);
 		for (offset = ext->start; offset <= ext->end; offset++)
-			swap_free(swp_entry(swap, offset));
+			swap_free(swp_entry(swap, offset), SWAP_MAP);
 
 		kfree(ext);
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC][PATCH 2/2] synchrouns swap freeing without trylock.
  2009-05-21  7:41 [RFC][PATCH] synchrouns swap freeing at zapping vmas KAMEZAWA Hiroyuki
  2009-05-21  7:43 ` [RFC][PATCH 1/2] change swapcount handling KAMEZAWA Hiroyuki
@ 2009-05-21  7:43 ` KAMEZAWA Hiroyuki
  2009-05-21 12:44   ` Johannes Weiner
  2009-05-21 21:00 ` [RFC][PATCH] synchrouns swap freeing at zapping vmas Hugh Dickins
  2009-05-22  4:39 ` Daisuke Nishimura
  3 siblings, 1 reply; 9+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-21  7:43 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, nishimura, hugh, balbir, akpm

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

While unmap/exiting, zap_page_range() is called and pages on page tables
and swp_entries on it are freed.

But At unmapping, all codes are under preepmt_disable and we can't call
functions which may sleep. (because of tlb_xxxx functions.)

By this limitation, free_swap_and_cache() called by zap_pte_range() uses
trylock() and this creates race-window between other swap ops. At last,
memcg has to handle this kind of "not used but exists as cache" swap entries.

This patch tries to remove trylock() for freeing SwapCache under
zap_page_range(). At freeing swap entry in page table,
"If there are no other refernce than swap cache", the function remember it
into stale_swap_buffer and free it later after exiting preempt disable state.


Maybe there are some more points to be cleaned up.
(And this patch is a little larger than I expected...)
Any comments are welcome.
Comments like "you need more explanation here." is helpful.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/swap.h |    8 ++
 mm/fremap.c          |   26 +++++++-
 mm/memory.c          |  126 ++++++++++++++++++++++++++++++++++++++---
 mm/shmem.c           |   27 +++++++-
 mm/swap_state.c      |   16 ++++-
 mm/swapfile.c        |  154 +++++++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 336 insertions(+), 21 deletions(-)

Index: mmotm-2.6.30-May17/mm/fremap.c
===================================================================
--- mmotm-2.6.30-May17.orig/mm/fremap.c
+++ mmotm-2.6.30-May17/mm/fremap.c
@@ -24,7 +24,7 @@
 #include "internal.h"
 
 static void zap_pte(struct mm_struct *mm, struct vm_area_struct *vma,
-			unsigned long addr, pte_t *ptep)
+		    unsigned long addr, pte_t *ptep, swp_entry_t *swp)
 {
 	pte_t pte = *ptep;
 
@@ -43,8 +43,15 @@ static void zap_pte(struct mm_struct *mm
 			dec_mm_counter(mm, file_rss);
 		}
 	} else {
-		if (!pte_file(pte))
-			free_swap_and_cache(pte_to_swp_entry(pte));
+		if (!pte_file(pte)) {
+			if (free_swap_and_check(pte_to_swp_entry(pte)) == 1) {
+				/*
+				 * This swap entry has a swap cache and it can
+				 * be freed.
+				 */
+				*swp = pte_to_swp_entry(pte);
+			}
+		}
 		pte_clear_not_present_full(mm, addr, ptep, 0);
 	}
 }
@@ -59,13 +66,15 @@ static int install_file_pte(struct mm_st
 	int err = -ENOMEM;
 	pte_t *pte;
 	spinlock_t *ptl;
+	swp_entry_t swp;
 
+	swp.val = ~0UL;
 	pte = get_locked_pte(mm, addr, &ptl);
 	if (!pte)
 		goto out;
 
 	if (!pte_none(*pte))
-		zap_pte(mm, vma, addr, pte);
+		zap_pte(mm, vma, addr, pte, &swp);
 
 	set_pte_at(mm, addr, pte, pgoff_to_pte(pgoff));
 	/*
@@ -77,6 +86,15 @@ static int install_file_pte(struct mm_st
 	 */
 	pte_unmap_unlock(pte, ptl);
 	err = 0;
+	if (swp.val != ~0UL) {
+		struct page *page;
+
+		page = find_get_page(&swapper_space, swp.val);
+		lock_page(page);
+		try_to_free_swap(page);
+		unlock_page(page);
+		page_cache_release(page);
+	}
 out:
 	return err;
 }
Index: mmotm-2.6.30-May17/include/linux/swap.h
===================================================================
--- mmotm-2.6.30-May17.orig/include/linux/swap.h
+++ mmotm-2.6.30-May17/include/linux/swap.h
@@ -291,6 +291,7 @@ extern int add_to_swap(struct page *);
 extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t);
 extern void __delete_from_swap_cache(struct page *);
 extern void delete_from_swap_cache(struct page *);
+extern void delete_from_swap_cache_keep_swap(struct page *);
 extern void free_page_and_swap_cache(struct page *);
 extern void free_pages_and_swap_cache(struct page **, int);
 extern struct page *lookup_swap_cache(swp_entry_t);
@@ -314,6 +315,9 @@ extern int swap_duplicate(swp_entry_t, i
 extern int valid_swaphandles(swp_entry_t, unsigned long *);
 extern void swap_free(swp_entry_t, int);
 extern int free_swap_and_cache(swp_entry_t);
+extern int free_swap_and_check(swp_entry_t);
+extern void free_swap_batch(int, swp_entry_t *);
+extern int try_free_swap_and_cache_atomic(swp_entry_t);
 extern int swap_type_of(dev_t, sector_t, struct block_device **);
 extern unsigned int count_swap_pages(int, int);
 extern sector_t map_swap_page(struct swap_info_struct *, pgoff_t);
@@ -382,6 +386,10 @@ static inline void show_swap_cache_info(
 #define free_swap_and_cache(swp)	is_migration_entry(swp)
 #define swap_duplicate(swp)		is_migration_entry(swp)
 
+static inline void swap_free_batch(int swaps, swp_entry_t *swaps)
+{
+}
+
 static inline void swap_free(swp_entry_t swp)
 {
 }
Index: mmotm-2.6.30-May17/mm/memory.c
===================================================================
--- mmotm-2.6.30-May17.orig/mm/memory.c
+++ mmotm-2.6.30-May17/mm/memory.c
@@ -758,10 +758,84 @@ int copy_page_range(struct mm_struct *ds
 	return ret;
 }
 
+
+/*
+ * Because we are under preempt_disable (see tlb_xxx functions), we can't call
+ * lcok_page() etc..which may sleep. At freeing swap, gatering swp_entry
+ * which seems of-no-use but has swap cache to this struct and remove them
+ * in batch. Because the condition to gather swp_entry to this bix is
+ * - There is no other swap reference. &&
+ * - There is a swap cache. &&
+ * - Page table entry was "Not Present"
+ * The number of entries which is caught in this is very small.
+ */
+#define NR_SWAP_FREE_BATCH		(63)
+struct stale_swap_buffer {
+	int nr;
+	swp_entry_t ents[NR_SWAP_FREE_BATCH];
+};
+
+#ifdef CONFIG_SWAP
+static inline void push_swap_ssb(struct stale_swap_buffer *ssb, swp_entry_t ent)
+{
+	if (!ssb)
+		return;
+	ssb->ents[ssb->nr++] = ent;
+}
+
+static inline int ssb_full(struct stale_swap_buffer *ssb)
+{
+	if (!ssb)
+		return 0;
+	return ssb->nr == NR_SWAP_FREE_BATCH;
+}
+
+static void free_stale_swaps(struct stale_swap_buffer *ssb)
+{
+	if (!ssb || !ssb->nr)
+		return;
+	free_swap_batch(ssb->nr, ssb->ents);
+	ssb->nr = 0;
+}
+
+static struct stale_swap_buffer *alloc_ssb(void)
+{
+	/*
+	 * Considering the case zap_xxx can be called as a result of OOM,
+	 * gfp_mask here should be GFP_ATOMIC. Even if we fails to allocate,
+	 * global LRU can find and remove stale swap caches in such case.
+	 */
+	return kzalloc(sizeof(struct stale_swap_buffer), GFP_ATOMIC);
+}
+static inline void free_ssb(struct stale_swap_buffer *ssb)
+{
+	kfree(ssb);
+}
+#else
+static inline void push_swap_ssb(struct stale_swap_buffer *ssb, swp_entry_t ent)
+{
+}
+static inline int ssb_full(struct stale_swap_buufer *ssb)
+{
+	return 0;
+}
+static inline void free_stale_swaps(struct stale_swap_buffer *ssb)
+{
+}
+static inline struct stale_swap_buffer *alloc_ssb(void)
+{
+	return NULL;
+}
+static inline void free_ssb(struct stale_swap_buffer *ssb)
+{
+}
+#endif
+
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pmd_t *pmd,
 				unsigned long addr, unsigned long end,
-				long *zap_work, struct zap_details *details)
+				long *zap_work, struct zap_details *details,
+				struct stale_swap_buffer *ssb)
 {
 	struct mm_struct *mm = tlb->mm;
 	pte_t *pte;
@@ -837,8 +911,17 @@ static unsigned long zap_pte_range(struc
 		if (pte_file(ptent)) {
 			if (unlikely(!(vma->vm_flags & VM_NONLINEAR)))
 				print_bad_pte(vma, addr, ptent, NULL);
-		} else if
-		  (unlikely(!free_swap_and_cache(pte_to_swp_entry(ptent))))
+		} else if (likely(ssb)) {
+			int ret = free_swap_and_check(pte_to_swp_entry(ptent));
+			if (unlikely(!ret))
+				print_bad_pte(vma, addr, ptent, NULL);
+			if (ret == 1) {
+				push_swap_ssb(ssb, pte_to_swp_entry(ptent));
+				/* need to free swaps ? */
+				if (ssb_full(ssb))
+					*zap_work = 0;
+			}
+		} else if (free_swap_and_cache(pte_to_swp_entry(ptent)))
 			print_bad_pte(vma, addr, ptent, NULL);
 		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
 	} while (pte++, addr += PAGE_SIZE, (addr != end && *zap_work > 0));
@@ -853,7 +936,8 @@ static unsigned long zap_pte_range(struc
 static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pud_t *pud,
 				unsigned long addr, unsigned long end,
-				long *zap_work, struct zap_details *details)
+				long *zap_work, struct zap_details *details,
+				struct stale_swap_buffer *ssb)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -866,7 +950,7 @@ static inline unsigned long zap_pmd_rang
 			continue;
 		}
 		next = zap_pte_range(tlb, vma, pmd, addr, next,
-						zap_work, details);
+						zap_work, details, ssb);
 	} while (pmd++, addr = next, (addr != end && *zap_work > 0));
 
 	return addr;
@@ -875,7 +959,8 @@ static inline unsigned long zap_pmd_rang
 static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pgd_t *pgd,
 				unsigned long addr, unsigned long end,
-				long *zap_work, struct zap_details *details)
+				long *zap_work, struct zap_details *details,
+				struct stale_swap_buffer *ssb)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -888,7 +973,7 @@ static inline unsigned long zap_pud_rang
 			continue;
 		}
 		next = zap_pmd_range(tlb, vma, pud, addr, next,
-						zap_work, details);
+						zap_work, details, ssb);
 	} while (pud++, addr = next, (addr != end && *zap_work > 0));
 
 	return addr;
@@ -897,7 +982,8 @@ static inline unsigned long zap_pud_rang
 static unsigned long unmap_page_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma,
 				unsigned long addr, unsigned long end,
-				long *zap_work, struct zap_details *details)
+				long *zap_work, struct zap_details *details,
+				struct stale_swap_buffer *ssb)
 {
 	pgd_t *pgd;
 	unsigned long next;
@@ -915,7 +1001,7 @@ static unsigned long unmap_page_range(st
 			continue;
 		}
 		next = zap_pud_range(tlb, vma, pgd, addr, next,
-						zap_work, details);
+						zap_work, details, ssb);
 	} while (pgd++, addr = next, (addr != end && *zap_work > 0));
 	tlb_end_vma(tlb, vma);
 
@@ -967,6 +1053,15 @@ unsigned long unmap_vmas(struct mmu_gath
 	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
 	int fullmm = (*tlbp)->fullmm;
 	struct mm_struct *mm = vma->vm_mm;
+	struct stale_swap_buffer *ssb = NULL;
+
+	/*
+	 * At freeing gatherd stale swap, we may sleep.In that case, we can't
+	 * handle spinlock_break. But, If !details, we don't free swap entry.
+	 * (see zap_pte_range())
+	 */
+	if (!i_mmap_lock)
+		ssb = alloc_ssb();
 
 	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
@@ -1012,7 +1107,7 @@ unsigned long unmap_vmas(struct mmu_gath
 				start = end;
 			} else
 				start = unmap_page_range(*tlbp, vma,
-						start, end, &zap_work, details);
+					 start, end, &zap_work, details, ssb);
 
 			if (zap_work > 0) {
 				BUG_ON(start != end);
@@ -1021,13 +1116,15 @@ unsigned long unmap_vmas(struct mmu_gath
 
 			tlb_finish_mmu(*tlbp, tlb_start, start);
 
-			if (need_resched() ||
+			if (need_resched() || ssb_full(ssb) ||
 				(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
 				if (i_mmap_lock) {
 					*tlbp = NULL;
 					goto out;
 				}
 				cond_resched();
+				/* This call may sleep */
+				free_stale_swaps(ssb);
 			}
 
 			*tlbp = tlb_gather_mmu(vma->vm_mm, fullmm);
@@ -1037,6 +1134,13 @@ unsigned long unmap_vmas(struct mmu_gath
 	}
 out:
 	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
+	/* there is stale swap cache. We may sleep and release per-cpu.*/
+	if (ssb && ssb->nr) {
+		tlb_finish_mmu(*tlbp, tlb_start, start);
+		free_stale_swaps(ssb);
+		*tlbp = tlb_gather_mmu(mm, fullmm);
+	}
+	free_ssb(ssb);
 	return start;	/* which is now the end (or restart) address */
 }
 
Index: mmotm-2.6.30-May17/mm/shmem.c
===================================================================
--- mmotm-2.6.30-May17.orig/mm/shmem.c
+++ mmotm-2.6.30-May17/mm/shmem.c
@@ -466,14 +466,22 @@ static swp_entry_t *shmem_swp_alloc(stru
  * @edir:       pointer after last entry of the directory
  * @punch_lock: pointer to spinlock when needed for the holepunch case
  */
+#define SWAP_FREE_BATCH (16)
 static int shmem_free_swp(swp_entry_t *dir, swp_entry_t *edir,
 						spinlock_t *punch_lock)
 {
 	spinlock_t *punch_unlock = NULL;
+	spinlock_t *punch_lock_saved = punch_lock;
 	swp_entry_t *ptr;
+	swp_entry_t swp[SWAP_FREE_BATCH];
 	int freed = 0;
+	int swaps;
 
-	for (ptr = dir; ptr < edir; ptr++) {
+	ptr = dir;
+again:
+	swaps = 0;
+	punch_lock = punch_lock_saved;
+	for (; swaps < SWAP_FREE_BATCH && ptr < edir; ptr++) {
 		if (ptr->val) {
 			if (unlikely(punch_lock)) {
 				punch_unlock = punch_lock;
@@ -482,13 +490,21 @@ static int shmem_free_swp(swp_entry_t *d
 				if (!ptr->val)
 					continue;
 			}
-			free_swap_and_cache(*ptr);
+			if (free_swap_and_check(*ptr) == 1)
+				swp[swaps++] = *ptr;
 			*ptr = (swp_entry_t){0};
 			freed++;
 		}
 	}
 	if (punch_unlock)
 		spin_unlock(punch_unlock);
+
+	if (swaps) {
+		/* Drop swap caches if we can */
+		free_swap_batch(swaps, swp);
+		if (ptr < edir)
+			goto again;
+	}
 	return freed;
 }
 
@@ -1065,8 +1081,10 @@ static int shmem_writepage(struct page *
 		/*
 		 * The more uptodate page coming down from a stacked
 		 * writepage should replace our old swappage.
+		 * But we can do only trylock on this. so call try_free.
 		 */
-		free_swap_and_cache(*entry);
+		if (try_free_swap_and_cache_atomic(*entry))
+			goto unmap_unlock;
 		shmem_swp_set(info, entry, 0);
 	}
 	shmem_recalc_inode(inode);
@@ -1093,11 +1111,12 @@ static int shmem_writepage(struct page *
 		}
 		return 0;
 	}
-
+unmap_unlock:
 	shmem_swp_unmap(entry);
 unlock:
 	spin_unlock(&info->lock);
 	swap_free(swap, SWAP_CACHE);
+
 redirty:
 	set_page_dirty(page);
 	if (wbc->for_reclaim)
Index: mmotm-2.6.30-May17/mm/swapfile.c
===================================================================
--- mmotm-2.6.30-May17.orig/mm/swapfile.c
+++ mmotm-2.6.30-May17/mm/swapfile.c
@@ -582,6 +582,7 @@ int try_to_free_swap(struct page *page)
 /*
  * Free the swap entry like above, but also try to
  * free the page cache entry if it is the last user.
+ * Because this uses trylock, "entry" may not be freed.
  */
 int free_swap_and_cache(swp_entry_t entry)
 {
@@ -618,6 +619,159 @@ int free_swap_and_cache(swp_entry_t entr
 	return p != NULL;
 }
 
+/*
+ * Free the swap entry like above, but
+ * returns 1 if swap entry has swap cache and ready to be freed.
+ * returns 2 if swap has other references.
+ */
+int free_swap_and_check(swp_entry_t entry)
+{
+	struct swap_info_struct *p;
+	int ret = 0;
+
+	if (is_migration_entry(entry))
+		return 2;
+
+	p = swap_info_get(entry);
+	if (!p)
+		return ret;
+	if (swap_entry_free(p, entry, SWAP_MAP) == 1)
+		ret = 1;
+	else
+		ret = 2;
+	spin_unlock(&swap_lock);
+
+	return ret;
+}
+
+/*
+ * The caller must guarantee that no other one don:t increase SWAP_MAP
+ * reference at this call. This function frees a swap cache and a swap entry
+ * with guarantee that
+ *   - free swap cache and entry only when refcnt goes down to 0.
+ * returns 0 if success. returns 1 if busy.
+ */
+int try_free_swap_and_cache_atomic(swp_entry_t entry)
+{
+	struct swap_info_struct *p;
+	struct page *page;
+	int count, cache_released = 0;
+
+	page = find_get_page(&swapper_space, entry.val);
+	if (page) {
+		if (!trylock_page(page)) {
+			page_cache_release(page);
+			return 1;
+		}
+		/* Under contention ? */
+		if (!PageSwapCache(page) || PageWriteback(page)) {
+			unlock_page(page);
+			page_cache_release(page);
+			return 1;
+		}
+		count = page_swapcount(page);
+		if (count != 2) { /* SWAP_CACHE + SWAP_MAP */
+			/*
+			 * seems to have another reference. So, the caller
+			 * failed to guarantee "no extra refence" to swap.
+			 */
+			unlock_page(page);
+			page_cache_release(page);
+			return 1;
+		}
+		/* This delete_from_swap_cache doesn't drop SWAP_CACHE ref */
+		delete_from_swap_cache_keep_swap(page);
+		SetPageDirty(page);
+		unlock_page(page);
+		page_cache_release(page);
+		cache_released = 1;
+		p = swap_info_get(entry);
+	} else {
+		p = swap_info_get(entry);
+		count = p->swap_map[swp_offset(entry)];
+		if (count > 2) {
+			/*
+			 * seems to have another reference. So, the caller
+			 * failed to guarantee "no extra refence" to swap.
+			 */
+			spin_unlock(&swap_lock);
+			return 1;
+		}
+	}
+	/* Drop all refs at once */
+	swap_entry_free(p, entry, SWAP_MAP);
+	/*
+	 * Free SwapCache reference at last (this prevents to create new
+	 * swap cache to this entry).
+	 */
+	if (cache_released)
+		swap_entry_free(p, entry, SWAP_CACHE);
+	spin_unlock(&swap_lock);
+	return 0;
+}
+
+
+/*
+ * Free swap cache in syncronous way.
+ */
+#ifdef CONFIG_CGROUP_MEM_RES_CTRL
+static int check_and_wait_swap_free(swp_entry_t entry)
+{
+	int count = 0;
+	struct swap_info_struct *p;
+
+	p = swap_info_get(entry);
+	if (!p)
+		return 0;
+	count = p->swap_map[swp_offset(entry)];
+	spin_unlock(&swap_lock);
+	if (count == 1) {
+		/*
+		 * in the race window of readahead.(we'll wait in lock_page,
+		 * anyway. So, its ok to do congestion wait here.
+		 */
+		congestion_wait(READ, HZ/10);
+		return 1;
+	}
+	/*
+	 * This means there are another references to this swap.
+	 * or swap is already freed. Do nothing more.
+	 */
+	return 0;
+}
+#else
+static int check_and_wait_swap_free(swp_entry_t entry)
+{
+	return 0;
+}
+#endif
+
+/*
+ * This function is used with free_swap_and_check(). When free_swap_and_check()
+ * returns 1, there are no refence to the swap_entry and we only need to free
+ * swap cache. This function is for freeing SwapCache, not swap.
+ */
+void free_swap_batch(int swaps, swp_entry_t *ents)
+{
+	int i;
+	struct page *page;
+	swp_entry_t entry;
+
+	for (i = 0; i < swaps; i++) {
+		entry = ents[i];
+redo:
+		page = find_get_page(&swapper_space, entry.val);
+		if (likely(page)) {
+			lock_page(page);
+			/* try_to_free_swap does all necessary checks. */
+			try_to_free_swap(page);
+			unlock_page(page);
+			page_cache_release(page);
+		} else if (check_and_wait_swap_free(entry))
+				goto redo;
+	}
+}
+
 #ifdef CONFIG_HIBERNATION
 /*
  * Find the swap type that corresponds to given device (if any).
Index: mmotm-2.6.30-May17/mm/swap_state.c
===================================================================
--- mmotm-2.6.30-May17.orig/mm/swap_state.c
+++ mmotm-2.6.30-May17/mm/swap_state.c
@@ -178,7 +178,7 @@ int add_to_swap(struct page *page)
  * It will never put the page into the free list,
  * the caller has a reference on the page.
  */
-void delete_from_swap_cache(struct page *page)
+static void delete_from_swap_cache_internal(struct page *page, int freeswap)
 {
 	swp_entry_t entry;
 
@@ -189,10 +189,22 @@ void delete_from_swap_cache(struct page 
 	spin_unlock_irq(&swapper_space.tree_lock);
 
 	mem_cgroup_uncharge_swapcache(page, entry);
-	swap_free(entry, SWAP_CACHE);
+	if (freeswap)
+		swap_free(entry, SWAP_CACHE);
 	page_cache_release(page);
 }
 
+void delete_from_swap_cache(struct page *page)
+{
+	delete_from_swap_cache_internal(page, 1);
+}
+
+void delete_from_swap_cache_keep_swap(struct page *page)
+{
+	delete_from_swap_cache_internal(page, 0);
+}
+
+
 /* 
  * If we are the only user, then try to free up the swap cache. 
  * 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH 2/2] synchrouns swap freeing without trylock.
  2009-05-21  7:43 ` [RFC][PATCH 2/2] synchrouns swap freeing without trylock KAMEZAWA Hiroyuki
@ 2009-05-21 12:44   ` Johannes Weiner
  2009-05-21 23:46     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 9+ messages in thread
From: Johannes Weiner @ 2009-05-21 12:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, nishimura, hugh, balbir, akpm

On Thu, May 21, 2009 at 04:43:46PM +0900, KAMEZAWA Hiroyuki wrote:
> Index: mmotm-2.6.30-May17/mm/memory.c
> ===================================================================
> --- mmotm-2.6.30-May17.orig/mm/memory.c
> +++ mmotm-2.6.30-May17/mm/memory.c
> @@ -758,10 +758,84 @@ int copy_page_range(struct mm_struct *ds
>  	return ret;
>  }
>  
> +
> +/*
> + * Because we are under preempt_disable (see tlb_xxx functions), we can't call
> + * lcok_page() etc..which may sleep. At freeing swap, gatering swp_entry
> + * which seems of-no-use but has swap cache to this struct and remove them
> + * in batch. Because the condition to gather swp_entry to this bix is
> + * - There is no other swap reference. &&
> + * - There is a swap cache. &&
> + * - Page table entry was "Not Present"
> + * The number of entries which is caught in this is very small.
> + */
> +#define NR_SWAP_FREE_BATCH		(63)
> +struct stale_swap_buffer {
> +	int nr;
> +	swp_entry_t ents[NR_SWAP_FREE_BATCH];
> +};
> +
> +#ifdef CONFIG_SWAP
> +static inline void push_swap_ssb(struct stale_swap_buffer *ssb, swp_entry_t ent)
> +{
> +	if (!ssb)
> +		return;
> +	ssb->ents[ssb->nr++] = ent;
> +}
> +
> +static inline int ssb_full(struct stale_swap_buffer *ssb)
> +{
> +	if (!ssb)
> +		return 0;
> +	return ssb->nr == NR_SWAP_FREE_BATCH;
> +}
> +
> +static void free_stale_swaps(struct stale_swap_buffer *ssb)
> +{
> +	if (!ssb || !ssb->nr)
> +		return;
> +	free_swap_batch(ssb->nr, ssb->ents);
> +	ssb->nr = 0;
> +}

Could you name it swapvec analogous to pagevec and make the API
similar?

> +static struct stale_swap_buffer *alloc_ssb(void)
> +{
> +	/*
> +	 * Considering the case zap_xxx can be called as a result of OOM,
> +	 * gfp_mask here should be GFP_ATOMIC. Even if we fails to allocate,
> +	 * global LRU can find and remove stale swap caches in such case.
> +	 */
> +	return kzalloc(sizeof(struct stale_swap_buffer), GFP_ATOMIC);
> +}
> +static inline void free_ssb(struct stale_swap_buffer *ssb)
> +{
> +	kfree(ssb);
> +}
> +#else
> +static inline void push_swap_ssb(struct stale_swap_buffer *ssb, swp_entry_t ent)
> +{
> +}
> +static inline int ssb_full(struct stale_swap_buufer *ssb)
> +{
> +	return 0;
> +}
> +static inline void free_stale_swaps(struct stale_swap_buffer *ssb)
> +{
> +}
> +static inline struct stale_swap_buffer *alloc_ssb(void)
> +{
> +	return NULL;
> +}
> +static inline void free_ssb(struct stale_swap_buffer *ssb)
> +{
> +}
> +#endif
> +
>  static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  				struct vm_area_struct *vma, pmd_t *pmd,
>  				unsigned long addr, unsigned long end,
> -				long *zap_work, struct zap_details *details)
> +				long *zap_work, struct zap_details *details,
> +				struct stale_swap_buffer *ssb)
>  {
>  	struct mm_struct *mm = tlb->mm;
>  	pte_t *pte;
> @@ -837,8 +911,17 @@ static unsigned long zap_pte_range(struc
>  		if (pte_file(ptent)) {
>  			if (unlikely(!(vma->vm_flags & VM_NONLINEAR)))
>  				print_bad_pte(vma, addr, ptent, NULL);
> -		} else if
> -		  (unlikely(!free_swap_and_cache(pte_to_swp_entry(ptent))))
> +		} else if (likely(ssb)) {
> +			int ret = free_swap_and_check(pte_to_swp_entry(ptent));
> +			if (unlikely(!ret))
> +				print_bad_pte(vma, addr, ptent, NULL);
> +			if (ret == 1) {
> +				push_swap_ssb(ssb, pte_to_swp_entry(ptent));
> +				/* need to free swaps ? */
> +				if (ssb_full(ssb))
> +					*zap_work = 0;

if (!swapvec_add(swapvec, pte_to_swp_entry(ptent)))
	*zap_work = 0;

would look more familiar, I think.

> @@ -1021,13 +1116,15 @@ unsigned long unmap_vmas(struct mmu_gath
>  
>  			tlb_finish_mmu(*tlbp, tlb_start, start);
>  
> -			if (need_resched() ||
> +			if (need_resched() || ssb_full(ssb) ||
>  				(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
>  				if (i_mmap_lock) {
>  					*tlbp = NULL;
>  					goto out;
>  				}
>  				cond_resched();
> +				/* This call may sleep */
> +				free_stale_swaps(ssb);

This checks both !!ssb and !!ssb->number in ssb_full() and in
free_stale_swaps().  It's not the only place, by the way.

I think it's better to swap two lines here, doing free_stale_swaps()
before cond_resched().  Because if we are going to sleep, we might as
well be waiting for a page lock meanwhile.

> @@ -1037,6 +1134,13 @@ unsigned long unmap_vmas(struct mmu_gath
>  	}
>  out:
>  	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
> +	/* there is stale swap cache. We may sleep and release per-cpu.*/
> +	if (ssb && ssb->nr) {
> +		tlb_finish_mmu(*tlbp, tlb_start, start);
> +		free_stale_swaps(ssb);
> +		*tlbp = tlb_gather_mmu(mm, fullmm);
> +	}
> +	free_ssb(ssb);
>  	return start;	/* which is now the end (or restart) address */
>  }
>  

> Index: mmotm-2.6.30-May17/mm/swapfile.c
> ===================================================================
> --- mmotm-2.6.30-May17.orig/mm/swapfile.c
> +++ mmotm-2.6.30-May17/mm/swapfile.c

> @@ -618,6 +619,159 @@ int free_swap_and_cache(swp_entry_t entr
>  	return p != NULL;
>  }
>  
> +/*
> + * Free the swap entry like above, but
> + * returns 1 if swap entry has swap cache and ready to be freed.
> + * returns 2 if swap has other references.
> + */
> +int free_swap_and_check(swp_entry_t entry)
> +{
> +	struct swap_info_struct *p;
> +	int ret = 0;
> +
> +	if (is_migration_entry(entry))
> +		return 2;
> +
> +	p = swap_info_get(entry);
> +	if (!p)
> +		return ret;
> +	if (swap_entry_free(p, entry, SWAP_MAP) == 1)
> +		ret = 1;
> +	else
> +		ret = 2;

Wouldn't it be possible to drop the previous patch and in case
swap_entry_free() returns 1, look up the entry in the page cache to
see whether the last user is the cache and not a pte?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH 2/2] synchrouns swap freeing without trylock.
  2009-05-21 12:44   ` Johannes Weiner
@ 2009-05-21 23:46     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 9+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-21 23:46 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, nishimura, balbir, akpm, hugh.dickins

On Thu, 21 May 2009 14:44:20 +0200
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Thu, May 21, 2009 at 04:43:46PM +0900, KAMEZAWA Hiroyuki wrote:
> > Index: mmotm-2.6.30-May17/mm/memory.c
> > ===================================================================
> > --- mmotm-2.6.30-May17.orig/mm/memory.c
> > +++ mmotm-2.6.30-May17/mm/memory.c
> > @@ -758,10 +758,84 @@ int copy_page_range(struct mm_struct *ds
> >  	return ret;
> >  }
> >  
> > +
> > +/*
> > + * Because we are under preempt_disable (see tlb_xxx functions), we can't call
> > + * lcok_page() etc..which may sleep. At freeing swap, gatering swp_entry
> > + * which seems of-no-use but has swap cache to this struct and remove them
> > + * in batch. Because the condition to gather swp_entry to this bix is
> > + * - There is no other swap reference. &&
> > + * - There is a swap cache. &&
> > + * - Page table entry was "Not Present"
> > + * The number of entries which is caught in this is very small.
> > + */
> > +#define NR_SWAP_FREE_BATCH		(63)
> > +struct stale_swap_buffer {
> > +	int nr;
> > +	swp_entry_t ents[NR_SWAP_FREE_BATCH];
> > +};
> > +
> > +#ifdef CONFIG_SWAP
> > +static inline void push_swap_ssb(struct stale_swap_buffer *ssb, swp_entry_t ent)
> > +{
> > +	if (!ssb)
> > +		return;
> > +	ssb->ents[ssb->nr++] = ent;
> > +}
> > +
> > +static inline int ssb_full(struct stale_swap_buffer *ssb)
> > +{
> > +	if (!ssb)
> > +		return 0;
> > +	return ssb->nr == NR_SWAP_FREE_BATCH;
> > +}
> > +
> > +static void free_stale_swaps(struct stale_swap_buffer *ssb)
> > +{
> > +	if (!ssb || !ssb->nr)
> > +		return;
> > +	free_swap_batch(ssb->nr, ssb->ents);
> > +	ssb->nr = 0;
> > +}
> 
> Could you name it swapvec analogous to pagevec and make the API
> similar?
> 
sure.

> > +static struct stale_swap_buffer *alloc_ssb(void)
> > +{
> > +	/*
> > +	 * Considering the case zap_xxx can be called as a result of OOM,
> > +	 * gfp_mask here should be GFP_ATOMIC. Even if we fails to allocate,
> > +	 * global LRU can find and remove stale swap caches in such case.
> > +	 */
> > +	return kzalloc(sizeof(struct stale_swap_buffer), GFP_ATOMIC);
> > +}
> > +static inline void free_ssb(struct stale_swap_buffer *ssb)
> > +{
> > +	kfree(ssb);
> > +}
> > +#else
> > +static inline void push_swap_ssb(struct stale_swap_buffer *ssb, swp_entry_t ent)
> > +{
> > +}
> > +static inline int ssb_full(struct stale_swap_buufer *ssb)
> > +{
> > +	return 0;
> > +}
> > +static inline void free_stale_swaps(struct stale_swap_buffer *ssb)
> > +{
> > +}
> > +static inline struct stale_swap_buffer *alloc_ssb(void)
> > +{
> > +	return NULL;
> > +}
> > +static inline void free_ssb(struct stale_swap_buffer *ssb)
> > +{
> > +}
> > +#endif
> > +
> >  static unsigned long zap_pte_range(struct mmu_gather *tlb,
> >  				struct vm_area_struct *vma, pmd_t *pmd,
> >  				unsigned long addr, unsigned long end,
> > -				long *zap_work, struct zap_details *details)
> > +				long *zap_work, struct zap_details *details,
> > +				struct stale_swap_buffer *ssb)
> >  {
> >  	struct mm_struct *mm = tlb->mm;
> >  	pte_t *pte;
> > @@ -837,8 +911,17 @@ static unsigned long zap_pte_range(struc
> >  		if (pte_file(ptent)) {
> >  			if (unlikely(!(vma->vm_flags & VM_NONLINEAR)))
> >  				print_bad_pte(vma, addr, ptent, NULL);
> > -		} else if
> > -		  (unlikely(!free_swap_and_cache(pte_to_swp_entry(ptent))))
> > +		} else if (likely(ssb)) {
> > +			int ret = free_swap_and_check(pte_to_swp_entry(ptent));
> > +			if (unlikely(!ret))
> > +				print_bad_pte(vma, addr, ptent, NULL);
> > +			if (ret == 1) {
> > +				push_swap_ssb(ssb, pte_to_swp_entry(ptent));
> > +				/* need to free swaps ? */
> > +				if (ssb_full(ssb))
> > +					*zap_work = 0;
> 
> if (!swapvec_add(swapvec, pte_to_swp_entry(ptent)))
> 	*zap_work = 0;
> 
> would look more familiar, I think.
> 
sure.

> > @@ -1021,13 +1116,15 @@ unsigned long unmap_vmas(struct mmu_gath
> >  
> >  			tlb_finish_mmu(*tlbp, tlb_start, start);
> >  
> > -			if (need_resched() ||
> > +			if (need_resched() || ssb_full(ssb) ||
> >  				(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
> >  				if (i_mmap_lock) {
> >  					*tlbp = NULL;
> >  					goto out;
> >  				}
> >  				cond_resched();
> > +				/* This call may sleep */
> > +				free_stale_swaps(ssb);
> 
> This checks both !!ssb and !!ssb->number in ssb_full() and in
> free_stale_swaps().  It's not the only place, by the way.
> 
> I think it's better to swap two lines here, doing free_stale_swaps()
> before cond_resched().  Because if we are going to sleep, we might as
> well be waiting for a page lock meanwhile.
> 
ok.

> > @@ -1037,6 +1134,13 @@ unsigned long unmap_vmas(struct mmu_gath
> >  	}
> >  out:
> >  	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
> > +	/* there is stale swap cache. We may sleep and release per-cpu.*/
> > +	if (ssb && ssb->nr) {
> > +		tlb_finish_mmu(*tlbp, tlb_start, start);
> > +		free_stale_swaps(ssb);
> > +		*tlbp = tlb_gather_mmu(mm, fullmm);
> > +	}
> > +	free_ssb(ssb);
> >  	return start;	/* which is now the end (or restart) address */
> >  }
> >  
> 
> > Index: mmotm-2.6.30-May17/mm/swapfile.c
> > ===================================================================
> > --- mmotm-2.6.30-May17.orig/mm/swapfile.c
> > +++ mmotm-2.6.30-May17/mm/swapfile.c
> 
> > @@ -618,6 +619,159 @@ int free_swap_and_cache(swp_entry_t entr
> >  	return p != NULL;
> >  }
> >  
> > +/*
> > + * Free the swap entry like above, but
> > + * returns 1 if swap entry has swap cache and ready to be freed.
> > + * returns 2 if swap has other references.
> > + */
> > +int free_swap_and_check(swp_entry_t entry)
> > +{
> > +	struct swap_info_struct *p;
> > +	int ret = 0;
> > +
> > +	if (is_migration_entry(entry))
> > +		return 2;
> > +
> > +	p = swap_info_get(entry);
> > +	if (!p)
> > +		return ret;
> > +	if (swap_entry_free(p, entry, SWAP_MAP) == 1)
> > +		ret = 1;
> > +	else
> > +		ret = 2;
> 
> Wouldn't it be possible to drop the previous patch and in case
> swap_entry_free() returns 1, look up the entry in the page cache to
> see whether the last user is the cache and not a pte?

there is a race at swapin-readahead

   swap_duplicate()
   =>
   add_to_swap_cache().

So, I wrote 1/2.

After reading Hugh's comment, it seems I have to drop this all or rewrite all ;)
Anyway, Thank you for review.

-Kame





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH] synchrouns swap freeing at zapping vmas
  2009-05-21  7:41 [RFC][PATCH] synchrouns swap freeing at zapping vmas KAMEZAWA Hiroyuki
  2009-05-21  7:43 ` [RFC][PATCH 1/2] change swapcount handling KAMEZAWA Hiroyuki
  2009-05-21  7:43 ` [RFC][PATCH 2/2] synchrouns swap freeing without trylock KAMEZAWA Hiroyuki
@ 2009-05-21 21:00 ` Hugh Dickins
  2009-05-22  0:26   ` KAMEZAWA Hiroyuki
  2009-05-22  4:39 ` Daisuke Nishimura
  3 siblings, 1 reply; 9+ messages in thread
From: Hugh Dickins @ 2009-05-21 21:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, balbir, akpm, linux-mm

On Thu, 21 May 2009, KAMEZAWA Hiroyuki wrote:
> 
> In these 6-7 weeks, we tried to fix memcg's swap-leak race by checking
> swap is valid or not after I/O.

I realize you've been working on different solutions for many weeks,
and would love a positive response.  Sorry, I'm not providing that:
these patches are not so beautiful that I'm eager to see them go in.

I ought to be attending to other priorities, but you've been clever
enough to propose intrusive mods that I can't really ignore, just to
force a response out of me!  And I'd better get a reply in with my
new address, before the old starts bouncing in a few days time.

> But Andrew Morton pointed out that
> "trylock in free_swap_and_cache() is not good"
> Oh, yes. it's not good.

Well, it has served non-memcg very well for years:
what's so bad about it now?

I've skimmed through the threads, starting from Nishimura-san's mail
on 17 March, was that the right one?  My head spins like Balbir's.

It seems like you have two leaks, but I may have missed the point.

One, that mem-swap accounting and mem+swap accounting have some
disagreement about when to (un)account to a memcg, with the result
that orphaned swapcache pages are liable to be accounted, but not
on the LRUs of the memcg.  I'd have thought that inconsistency is
something you should be sorting out at the memcg end, without
needing changes to the non-memcg code.

Other, that orphaned swapcache pages can build up until swap is
full, before reaching sufficient global memory pressure to run
through the global LRUs, which is what has traditionally dealt
with the issue.  And when swap is filled in this way, memcgs can
no longer put their pages out to swap, so OOM prematurely instead.

I can imagine (just imagining, haven't checked, may be quite wrong)
that split LRUs have interfered with that freeing of swapcache pages:
since vmscan.c is mainly targetted at freeing memory, I think it tries
to avoid the swapbacked LRUs once swap is full, so may now be missing
out on freeing such pages?

And it's probably an inefficient way to get at them anyway.
Why not have a global scan to target swapcache pages whenever swap is
approaching full (full in a real sense, not vm_swap_full's 50% sense)?
And run that before OOMing, memcg or not.

Sorry, you're probably going to have to explain for the umpteenth
time why these approaches do not work.

> 
> Then, this patch series is a trial to remove trylock for swapcache AMAP.
> Patches are more complex and larger than expected but the behavior itself is
> much appreciate than prevoius my posts for memcg...
>  
> This series contains 2 patches.
>   1. change refcounting in swap_map.
>      This is for allowing swap_map to indicate there is swap reference/cache.

You've gone to a lot of trouble to obscure what this patch is doing:
lots of changes that didn't need to be made, and an enum of 0 or 1
which keeps on being translated to a count of 2 or 1.

Using the 0x8000 bit in the swap_map to indicate if that swap entry
is in swapcache, yes, that may well be a good idea - and I don't know
why that bit isn't already used: might relate to when pids were limited
to 32000, but more likely was once used as a flag later abandoned.
But you don't need to change every single call to swap_free() etc,
they can mostly do just the same as they already do.

Whether it works correctly, I haven't tried to decide.  But was
puzzled when by the end of it, no real use was actually made of
the changes: the same trylock_page as before, it just wouldn't
get tried unsuccessfully so often.  Just preparatory work for
the second patch?

>   2. synchronous freeing of swap entries.
>      For avoiding race, free swap_entries in appropriate way with lock_page().
>      After this patch, race between swapin-readahead v.s. zap_page_range()
>      will go away.
>      Note: the whole code for zap_page_range() will not work until the system
>      or cgroup is very swappy. So, no influence in typical case.

This patch adds quite a lot of ugliness in a hot path which is already
uglier than we'd like.   Adding overhead to zap_pte_range, for the rare
swap and memcg case, isn't very welcome.

> 
> There are used trylocks more than this patch treats. But IIUC, they are not
> racy with memcg and I don't care them.
> (And....I have no idea to remove trylock() in free_pages_and_swapcache(),
>  which is called via tlb_flush_mmu()....preemption disabled and using percpu.)

I know well the difficulty, several of us have had patches to solve most
of the percpu mmu_gather problems, but the file truncation case (under
i_mmap_lock) has so far defeated us; and you can't ignore that case,
truncation has to remove even the anon (possibly swapcache) pages
from a private file mapping.

But I'm afraid, if you do nothing about free_pages_and_swapcache,
then I can't see much point in studying the rest of it, which
would only be addressing half of your problem.

> 
> These patches + Nishimura-san's writeback fix will do complete work, I think.
> But test is not enough.

Please provide a pointer to Nishimura-san's writeback fix,
I seem to have missed that.

There is indeed little point in attacking the trylock_page()s here,
unless you also attack all those PageWriteback backoffs.  I can imagine
a simple patch to do that (removing from swapcache while PageWriteback),
but it would be adding more atomic ops, and using spin_lock_irq on
swap_lock everywhere, probably not a good tradeoff.

> 
> Any comments are welcome. 

I sincerely wish I could be less discouraging!

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH] synchrouns swap freeing at zapping vmas
  2009-05-21 21:00 ` [RFC][PATCH] synchrouns swap freeing at zapping vmas Hugh Dickins
@ 2009-05-22  0:26   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 9+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-22  0:26 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: nishimura, balbir, akpm, linux-mm

On Thu, 21 May 2009 22:00:20 +0100 (BST)
Hugh Dickins <hugh.dickins@tiscali.co.uk> wrote:

> On Thu, 21 May 2009, KAMEZAWA Hiroyuki wrote:
> > 
> > In these 6-7 weeks, we tried to fix memcg's swap-leak race by checking
> > swap is valid or not after I/O.
> 
> I realize you've been working on different solutions for many weeks,
> and would love a positive response.  Sorry, I'm not providing that:
> these patches are not so beautiful that I'm eager to see them go in.
> 
> I ought to be attending to other priorities, but you've been clever
> enough to propose intrusive mods that I can't really ignore, just to
> force a response out of me!

Sorry and thank you for your time and kindness.

>  And I'd better get a reply in with my new address, before the old
 > starts bouncing in a few days time.
> 
Updated my address book.


> > But Andrew Morton pointed out that
> > "trylock in free_swap_and_cache() is not good"
> > Oh, yes. it's not good.
> 
> Well, it has served non-memcg very well for years:
> what's so bad about it now?
> 
> I've skimmed through the threads, starting from Nishimura-san's mail
> on 17 March, was that the right one?  My head spins like Balbir's.
> 
Maybe right.

> It seems like you have two leaks, but I may have missed the point.
> 
> One, that mem-swap accounting and mem+swap accounting have some
> disagreement about when to (un)account to a memcg, with the result
> that orphaned swapcache pages are liable to be accounted, but not
> on the LRUs of the memcg.  I'd have thought that inconsistency is
> something you should be sorting out at the memcg end, without
> needing changes to the non-memcg code.
> 
I did these things in memcg. But finally,

-----------------------------------------------
                       |     free_swap_and_cache()
lock_page()            |        
 try_to_free_swap()    |
   check swap refcnt   |
                       |   swap refcnt goes to 1.
                             trylock failure
unlock_page()          | 
------------------------------------------------
This race was the last obstacle in front of me in previous patch.
This patch is a trial to remove trylock. (this patch teach me much ;)

> Other, that orphaned swapcache pages can build up until swap is
> full, before reaching sufficient global memory pressure to run
> through the global LRUs, which is what has traditionally dealt
> with the issue.  And when swap is filled in this way, memcgs can
> no longer put their pages out to swap, so OOM prematurely instead.
> 
yes.

> I can imagine (just imagining, haven't checked, may be quite wrong)
> that split LRUs have interfered with that freeing of swapcache pages:
> since vmscan.c is mainly targetted at freeing memory, I think it tries
> to avoid the swapbacked LRUs once swap is full, so may now be missing
> out on freeing such pages?
> 
Hmm, I feel it is possible. 

> And it's probably an inefficient way to get at them anyway.
> Why not have a global scan to target swapcache pages whenever swap is
> approaching full (full in a real sense, not vm_swap_full's 50% sense)?
> And run that before OOMing, memcg or not.
> 
It's one of points.
I or Nishimura have to modify vm_swap_full() to see memcg information.

But the problem in readahead case is
 - swap entry is used.
 - it's accoutned to a memcg by swap_cgroup
 - but not on memcg's LRU and we can't free it.

> Sorry, you're probably going to have to explain for the umpteenth
> time why these approaches do not work.
> 
IIRC, Nishimura and guys walks mainly for HPC and they tends to have tons of
memory. Then, I'd like to avoid scanning global LRU without any hints, as much as
possible.




> > 
> > Then, this patch series is a trial to remove trylock for swapcache AMAP.
> > Patches are more complex and larger than expected but the behavior itself is
> > much appreciate than prevoius my posts for memcg...
> >  
> > This series contains 2 patches.
> >   1. change refcounting in swap_map.
> >      This is for allowing swap_map to indicate there is swap reference/cache.
> 
> You've gone to a lot of trouble to obscure what this patch is doing:
> lots of changes that didn't need to be made, and an enum of 0 or 1
> which keeps on being translated to a count of 2 or 1.
> 
Ah, ok. it's not good.

> Using the 0x8000 bit in the swap_map to indicate if that swap entry
> is in swapcache, yes, that may well be a good idea - and I don't know
> why that bit isn't already used: might relate to when pids were limited
> to 32000, but more likely was once used as a flag later abandoned.
> But you don't need to change every single call to swap_free() etc,
> they can mostly do just the same as they already do.
> 
yes. Using 0x8000 as flag is the choice.

> Whether it works correctly, I haven't tried to decide.  But was
> puzzled when by the end of it, no real use was actually made of
> the changes: the same trylock_page as before, it just wouldn't
> get tried unsuccessfully so often.  Just preparatory work for
> the second patch?
> 
When swap count returns 1, there are 2 possibilities.
  - there is a swap cache
  - there is swap reference.

In second patch, I wanted to avoid unnecesasry call for
  find_get_page() -> lock_page() -> try_to_free_swap().
because I know I can't use large buffer for batched work.

Without second patch, I have a chace to fix this race
-----------------------------------------------
                       |     free_swap_and_cache()
lock_page()            |        
 try_to_free_swap()    |
   check swap refcnt   |
                       |   swap refcnt goes to 1.
                       |   trylock failure
unlock_page()          | 
------------------------------------------------

There will be no race between swap cache handling v.s. swap usage.



> >   2. synchronous freeing of swap entries.
> >      For avoiding race, free swap_entries in appropriate way with lock_page().
> >      After this patch, race between swapin-readahead v.s. zap_page_range()
> >      will go away.
> >      Note: the whole code for zap_page_range() will not work until the system
> >      or cgroup is very swappy. So, no influence in typical case.
> 
> This patch adds quite a lot of ugliness in a hot path which is already
> uglier than we'd like.   Adding overhead to zap_pte_range, for the rare
> swap and memcg case, isn't very welcome.
> 
Ok, I have to agree.

> > 
> > There are used trylocks more than this patch treats. But IIUC, they are not
> > racy with memcg and I don't care them.
> > (And....I have no idea to remove trylock() in free_pages_and_swapcache(),
> >  which is called via tlb_flush_mmu()....preemption disabled and using percpu.)
> 
> I know well the difficulty, several of us have had patches to solve most
> of the percpu mmu_gather problems, but the file truncation case (under
> i_mmap_lock) has so far defeated us; and you can't ignore that case,
> truncation has to remove even the anon (possibly swapcache) pages
> from a private file mapping.
> 
Ah, I may misunderstand following lines.
== zap_pte_range()
 832                  * If details->check_mapping, we leave swap entries;
 833                  * if details->nonlinear_vma, we leave file entries.
 834                  */
 835                 if (unlikely(details))
 836                         continue;
==
Then...this is bug ?

> But I'm afraid, if you do nothing about free_pages_and_swapcache,
> then I can't see much point in studying the rest of it, which
> would only be addressing half of your problem.
> 
> > 
> > These patches + Nishimura-san's writeback fix will do complete work, I think.
> > But test is not enough.
> 
> Please provide a pointer to Nishimura-san's writeback fix,
> I seem to have missed that.
> 
This one.
  http://marc.info/?l=linux-kernel&m=124236139502335&w=2

> There is indeed little point in attacking the trylock_page()s here,
> unless you also attack all those PageWriteback backoffs.  I can imagine
> a simple patch to do that (removing from swapcache while PageWriteback),
> but it would be adding more atomic ops, and using spin_lock_irq on
> swap_lock everywhere, probably not a good tradeoff.
> 
Ok, I should consider more.
> > 
> > Any comments are welcome. 
> 
> I sincerely wish I could be less discouraging!
> 
I've been feeling like to crash my head by hitting it against an edge ot a tofu
in these days. But if patch 1/2 is acceptable with the modification you suggested,
there will be a way to go. 

You encouraged me :) thanks.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH] synchrouns swap freeing at zapping vmas
  2009-05-21  7:41 [RFC][PATCH] synchrouns swap freeing at zapping vmas KAMEZAWA Hiroyuki
                   ` (2 preceding siblings ...)
  2009-05-21 21:00 ` [RFC][PATCH] synchrouns swap freeing at zapping vmas Hugh Dickins
@ 2009-05-22  4:39 ` Daisuke Nishimura
  2009-05-22  5:05   ` KAMEZAWA Hiroyuki
  3 siblings, 1 reply; 9+ messages in thread
From: Daisuke Nishimura @ 2009-05-22  4:39 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, akpm, Hugh Dickins, Daisuke Nishimura

On Thu, 21 May 2009 16:41:00 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> In these 6-7 weeks, we tried to fix memcg's swap-leak race by checking
> swap is valid or not after I/O. But Andrew Morton pointed out that
> "trylock in free_swap_and_cache() is not good"
> Oh, yes. it's not good.
> 
> Then, this patch series is a trial to remove trylock for swapcache AMAP.
> Patches are more complex and larger than expected but the behavior itself is
> much appreciate than prevoius my posts for memcg...
>  
> This series contains 2 patches.
>   1. change refcounting in swap_map.
>      This is for allowing swap_map to indicate there is swap reference/cache.
>   2. synchronous freeing of swap entries.
>      For avoiding race, free swap_entries in appropriate way with lock_page().
>      After this patch, race between swapin-readahead v.s. zap_page_range()
>      will go away.
>      Note: the whole code for zap_page_range() will not work until the system
>      or cgroup is very swappy. So, no influence in typical case.
> 
> There are used trylocks more than this patch treats. But IIUC, they are not
> racy with memcg and I don't care them.
> (And....I have no idea to remove trylock() in free_pages_and_swapcache(),
>  which is called via tlb_flush_mmu()....preemption disabled and using percpu.)
> 
> These patches + Nishimura-san's writeback fix will do complete work, I think.
> But test is not enough.
> 
I've not reviewed those patches(especially 2/2) in detail, I run some tests
and saw some strange behaviors.

- system global oom was invoked after a few minites. I've never seen even memcg's oom
  in this test.

        page01 invoked oom-killer: gfp_mask=0x0, order=0, oomkilladj=0
        Pid: 20485, comm: page01 Not tainted 2.6.30-rc5-69e923d8 #2
        Call Trace:
         [<ffffffff804ee0ed>] ? _spin_unlock+0x17/0x20
         [<ffffffff8028f702>] ? oom_kill_process+0x96/0x265
         [<ffffffff8028fbf5>] ? __out_of_memory+0x31/0x81
         [<ffffffff80290062>] ? pagefault_out_of_memory+0x64/0x92
         [<ffffffff804eea9f>] ? page_fault+0x1f/0x30
        Node 0 DMA per-cpu:
        CPU    0: hi:    0, btch:   1 usd:   0
        CPU    1: hi:    0, btch:   1 usd:   0
        CPU    2: hi:    0, btch:   1 usd:   0
        CPU    3: hi:    0, btch:   1 usd:   0
        CPU    4: hi:    0, btch:   1 usd:   0
        CPU    5: hi:    0, btch:   1 usd:   0
        CPU    6: hi:    0, btch:   1 usd:   0
        CPU    7: hi:    0, btch:   1 usd:   0
        CPU    8: hi:    0, btch:   1 usd:   0
        CPU    9: hi:    0, btch:   1 usd:   0
        CPU   10: hi:    0, btch:   1 usd:   0
        CPU   11: hi:    0, btch:   1 usd:   0
        CPU   12: hi:    0, btch:   1 usd:   0
        CPU   13: hi:    0, btch:   1 usd:   0
        CPU   14: hi:    0, btch:   1 usd:   0
        CPU   15: hi:    0, btch:   1 usd:   0
        Node 0 DMA32 per-cpu:
        CPU    0: hi:  186, btch:  31 usd:  69
        CPU    1: hi:  186, btch:  31 usd:  77
        CPU    2: hi:  186, btch:  31 usd: 144
        CPU    3: hi:  186, btch:  31 usd:  19
        CPU    4: hi:  186, btch:  31 usd:  59
        CPU    5: hi:  186, btch:  31 usd:  41
        CPU    6: hi:  186, btch:  31 usd:   0
        CPU    7: hi:  186, btch:  31 usd:  38
        CPU    8: hi:  186, btch:  31 usd: 117
        CPU    9: hi:  186, btch:  31 usd:  75
        CPU   10: hi:  186, btch:  31 usd: 106
        CPU   11: hi:  186, btch:  31 usd: 117
        CPU   12: hi:  186, btch:  31 usd: 159
        CPU   13: hi:  186, btch:  31 usd: 142
        CPU   14: hi:  186, btch:  31 usd: 161
        CPU   15: hi:  186, btch:  31 usd: 160
        Node 0 Normal per-cpu:
        CPU    0: hi:   90, btch:  15 usd:  32
        CPU    1: hi:   90, btch:  15 usd:  49
        CPU    2: hi:   90, btch:  15 usd:  57
        CPU    3: hi:   90, btch:  15 usd:  94
        CPU    4: hi:   90, btch:  15 usd:  54
        CPU    5: hi:   90, btch:  15 usd:  80
        CPU    6: hi:   90, btch:  15 usd:  49
        CPU    7: hi:   90, btch:  15 usd:  89
        CPU    8: hi:   90, btch:  15 usd:  37
        CPU    9: hi:   90, btch:  15 usd:  76
        CPU   10: hi:   90, btch:  15 usd:  45
        CPU   11: hi:   90, btch:  15 usd:  57
        CPU   12: hi:   90, btch:  15 usd: 100
        CPU   13: hi:   90, btch:  15 usd:  74
        CPU   14: hi:   90, btch:  15 usd:  73
        CPU   15: hi:   90, btch:  15 usd:  47
        Node 1 Normal per-cpu:
        CPU    0: hi:  186, btch:  31 usd:   0
        CPU    1: hi:  186, btch:  31 usd:   0
        CPU    2: hi:  186, btch:  31 usd:   0
        CPU    3: hi:  186, btch:  31 usd:   0
        CPU    4: hi:  186, btch:  31 usd:   0
        CPU    5: hi:  186, btch:  31 usd:   0
        CPU    6: hi:  186, btch:  31 usd:   0
        CPU    7: hi:  186, btch:  31 usd:   0
        CPU    8: hi:  186, btch:  31 usd:   0
        CPU    9: hi:  186, btch:  31 usd:   0
        CPU   10: hi:  186, btch:  31 usd:   0
        CPU   11: hi:  186, btch:  31 usd:   0
        CPU   12: hi:  186, btch:  31 usd:   0
        CPU   13: hi:  186, btch:  31 usd:   0
        CPU   14: hi:  186, btch:  31 usd:   0
        CPU   15: hi:  186, btch:  31 usd:   0
        Node 2 Normal per-cpu:
        CPU    0: hi:  186, btch:  31 usd:   0
        CPU    1: hi:  186, btch:  31 usd:   0
        CPU    2: hi:  186, btch:  31 usd:   0
        CPU    3: hi:  186, btch:  31 usd:   0
        CPU    4: hi:  186, btch:  31 usd:   0
        CPU    5: hi:  186, btch:  31 usd:   0
        CPU    6: hi:  186, btch:  31 usd:   0
        CPU    7: hi:  186, btch:  31 usd:   0
        CPU    8: hi:  186, btch:  31 usd:   0
        CPU    9: hi:  186, btch:  31 usd:   0
        CPU   10: hi:  186, btch:  31 usd:   0
        CPU   11: hi:  186, btch:  31 usd:   0
        CPU   12: hi:  186, btch:  31 usd:   0
        CPU   13: hi:  186, btch:  31 usd:   0
        CPU   14: hi:  186, btch:  31 usd:   0
        CPU   15: hi:  186, btch:  31 usd:   0
        Node 3 Normal per-cpu:
        CPU    0: hi:  186, btch:  31 usd:   0
        CPU    1: hi:  186, btch:  31 usd:   0
        CPU    2: hi:  186, btch:  31 usd: 164
        CPU    3: hi:  186, btch:  31 usd:   0
        CPU    4: hi:  186, btch:  31 usd:   0
        CPU    5: hi:  186, btch:  31 usd:   0
        CPU    6: hi:  186, btch:  31 usd:   0
        CPU    7: hi:  186, btch:  31 usd:   0
        CPU    8: hi:  186, btch:  31 usd:   0
        CPU    9: hi:  186, btch:  31 usd:   0
        CPU   10: hi:  186, btch:  31 usd:   0
        CPU   11: hi:  186, btch:  31 usd:   0
        CPU   12: hi:  186, btch:  31 usd:  86
        CPU   13: hi:  186, btch:  31 usd:  36
        CPU   14: hi:  186, btch:  31 usd: 179
        CPU   15: hi:  186, btch:  31 usd: 120
        Active_anon:49386 active_file:7453 inactive_anon:4256
         inactive_file:62010 unevictable:0 dirty:0 writeback:10 unstable:0
         free:3319229 slab:12952 mapped:9282 pagetables:4893 bounce:0
        Node 0 DMA free:3784kB min:12kB low:12kB high:16kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB present:15100kB pages_scanned:0 all_unreclaimable? no
        lowmem_reserve[]: 0 3204 3453 3453
        Node 0 DMA32 free:2938020kB min:3472kB low:4340kB high:5208kB active_anon:24280kB inactive_anon:17024kB active_file:1600kB inactive_file:44032kB unevictable:0kB present:3281248kB pages_scanned:0 all_unreclaimable? no
        lowmem_reserve[]: 0 0 249 249
        Node 0 Normal free:292kB min:268kB low:332kB high:400kB active_anon:29872kB inactive_anon:0kB active_file:23096kB inactive_file:152440kB unevictable:0kB present:255488kB pages_scanned:0 all_unreclaimable? no
        lowmem_reserve[]: 0 0 0 0
        Node 1 Normal free:3522552kB min:3784kB low:4728kB high:5676kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB present:3576832kB pages_scanned:0 all_unreclaimable? no
        lowmem_reserve[]: 0 0 0 0
        Node 2 Normal free:3520304kB min:3784kB low:4728kB high:5676kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB present:3576832kB pages_scanned:0 all_unreclaimable? no
        lowmem_reserve[]: 0 0 0 0
        Node 3 Normal free:3291964kB min:3784kB low:4728kB high:5676kB active_anon:143392kB inactive_anon:0kB active_file:5116kB inactive_file:51568kB unevictable:0kB present:3576832kB pages_scanned:0 all_unreclaimable? no
        lowmem_reserve[]: 0 0 0 0
        Node 0 DMA: 2*4kB 4*8kB 2*16kB 4*32kB 2*64kB 3*128kB 2*256kB 1*512kB 2*1024kB 0*2048kB 0*4096kB = 3784kB
        Node 0 DMA32: 59*4kB 29*8kB 17*16kB 40*32kB 29*64kB 3*128kB 4*256kB 2*512kB 1*1024kB 3*2048kB 714*4096kB = 2938020kB
        Node 0 Normal: 35*4kB 9*8kB 3*16kB 1*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 292kB
        Node 1 Normal: 8*4kB 9*8kB 9*16kB 6*32kB 5*64kB 6*128kB 4*256kB 3*512kB 6*1024kB 3*2048kB 856*4096kB = 3522552kB
        Node 2 Normal: 10*4kB 9*8kB 6*16kB 7*32kB 6*64kB 8*128kB 4*256kB 4*512kB 5*1024kB 2*2048kB 856*4096kB = 3520304kB
        Node 3 Normal: 70*4kB 23*8kB 9*16kB 8*32kB 7*64kB 2*128kB 3*256kB 3*512kB 3*1024kB 4*2048kB 800*4096kB = 3291936kB
        73041 total pagecache pages
        3220 pages in swap cache
        Swap cache stats: add 2206323, delete 2203103, find 1254789/1376833
        Free swap  = 1978488kB
        Total swap = 2000888kB

- Using shmem caused a BUG.

        BUG: sleeping function called from invalid context at include/linux/pagemap.h:327
        in_atomic(): 1, irqs_disabled(): 0, pid: 1113, name: shmem_test_02
        no locks held by shmem_test_02/1113.
        Pid: 1113, comm: shmem_test_02 Not tainted 2.6.30-rc5-69e923d8 #2
        Call Trace:
         [<ffffffff802ad004>] ? free_swap_batch+0x40/0x7f
         [<ffffffff80299b58>] ? shmem_free_swp+0xac/0xca
         [<ffffffff8029a0f1>] ? shmem_truncate_range+0x57b/0x7af
         [<ffffffff80378393>] ? __percpu_counter_add+0x3e/0x5c
         [<ffffffff8029c458>] ? shmem_delete_inode+0x77/0xd3
         [<ffffffff8029c3e1>] ? shmem_delete_inode+0x0/0xd3
         [<ffffffff802d3ab7>] ? generic_delete_inode+0xe0/0x178
         [<ffffffff802d0dda>] ? d_kill+0x24/0x46
         [<ffffffff802d2212>] ? dput+0x134/0x141
         [<ffffffff802c3504>] ? __fput+0x189/0x1ba
         [<ffffffff802a50e4>] ? remove_vma+0x4e/0x83
         [<ffffffff802a5224>] ? exit_mmap+0x10b/0x129
         [<ffffffff80238fbd>] ? mmput+0x41/0x9f
         [<ffffffff8023cf37>] ? exit_mm+0x101/0x10c
         [<ffffffff8023e439>] ? do_exit+0x1a0/0x61a
         [<ffffffff80259253>] ? trace_hardirqs_on_caller+0x113/0x13e
         [<ffffffff8023e926>] ? do_group_exit+0x73/0xa5
         [<ffffffff8023e96a>] ? sys_exit_group+0x12/0x16
         [<ffffffff8020b96b>] ? system_call_fastpath+0x16/0x1b

(include/linux/pagemap.h)
    325 static inline void lock_page(struct page *page)
    326 {
    327         might_sleep();
    328         if (!trylock_page(page))
    329                 __lock_page(page);
    330 }
    331


I hope they would be some help for you.

Thanks,
Daisuke Nishimura.

> Any comments are welcome. 
> 
> Thanks,
> -Kame
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH] synchrouns swap freeing at zapping vmas
  2009-05-22  4:39 ` Daisuke Nishimura
@ 2009-05-22  5:05   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 9+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-22  5:05 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, balbir, akpm, Hugh Dickins

On Fri, 22 May 2009 13:39:06 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Thu, 21 May 2009 16:41:00 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> - Using shmem caused a BUG.
> 
>         BUG: sleeping function called from invalid context at include/linux/pagemap.h:327
>         in_atomic(): 1, irqs_disabled(): 0, pid: 1113, name: shmem_test_02
>         no locks held by shmem_test_02/1113.
>         Pid: 1113, comm: shmem_test_02 Not tainted 2.6.30-rc5-69e923d8 #2
>         Call Trace:
>          [<ffffffff802ad004>] ? free_swap_batch+0x40/0x7f
>          [<ffffffff80299b58>] ? shmem_free_swp+0xac/0xca
>          [<ffffffff8029a0f1>] ? shmem_truncate_range+0x57b/0x7af
>          [<ffffffff80378393>] ? __percpu_counter_add+0x3e/0x5c
>          [<ffffffff8029c458>] ? shmem_delete_inode+0x77/0xd3
>          [<ffffffff8029c3e1>] ? shmem_delete_inode+0x0/0xd3
>          [<ffffffff802d3ab7>] ? generic_delete_inode+0xe0/0x178
>          [<ffffffff802d0dda>] ? d_kill+0x24/0x46
>          [<ffffffff802d2212>] ? dput+0x134/0x141
>          [<ffffffff802c3504>] ? __fput+0x189/0x1ba
>          [<ffffffff802a50e4>] ? remove_vma+0x4e/0x83
>          [<ffffffff802a5224>] ? exit_mmap+0x10b/0x129
>          [<ffffffff80238fbd>] ? mmput+0x41/0x9f
>          [<ffffffff8023cf37>] ? exit_mm+0x101/0x10c
>          [<ffffffff8023e439>] ? do_exit+0x1a0/0x61a
>          [<ffffffff80259253>] ? trace_hardirqs_on_caller+0x113/0x13e
>          [<ffffffff8023e926>] ? do_group_exit+0x73/0xa5
>          [<ffffffff8023e96a>] ? sys_exit_group+0x12/0x16
>          [<ffffffff8020b96b>] ? system_call_fastpath+0x16/0x1b
> 
> (include/linux/pagemap.h)
>     325 static inline void lock_page(struct page *page)
>     326 {
>     327         might_sleep();
>     328         if (!trylock_page(page))
>     329                 __lock_page(page);
>     330 }
>     331
> 
> 
> I hope they would be some help for you.
> 
Thanks, I have to drop this patch ;)
Now, I found a very clean new way....I think (modify memcg's logic).
Thank you for your contribution and patience.

Thanks,
-Kame


> Thanks,
> Daisuke Nishimura.
> 
> > Any comments are welcome. 
> > 
> > Thanks,
> > -Kame
> > 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2009-05-22  5:07 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-21  7:41 [RFC][PATCH] synchrouns swap freeing at zapping vmas KAMEZAWA Hiroyuki
2009-05-21  7:43 ` [RFC][PATCH 1/2] change swapcount handling KAMEZAWA Hiroyuki
2009-05-21  7:43 ` [RFC][PATCH 2/2] synchrouns swap freeing without trylock KAMEZAWA Hiroyuki
2009-05-21 12:44   ` Johannes Weiner
2009-05-21 23:46     ` KAMEZAWA Hiroyuki
2009-05-21 21:00 ` [RFC][PATCH] synchrouns swap freeing at zapping vmas Hugh Dickins
2009-05-22  0:26   ` KAMEZAWA Hiroyuki
2009-05-22  4:39 ` Daisuke Nishimura
2009-05-22  5:05   ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox