* [PATCH 0/4] memcg fix swap accounting (28/May)
@ 2009-05-28 4:54 KAMEZAWA Hiroyuki
2009-05-28 5:10 ` [PATCH 1/4] add swap cache interface for swap reference KAMEZAWA Hiroyuki
` (3 more replies)
0 siblings, 4 replies; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-28 4:54 UTC (permalink / raw)
To: linux-mm; +Cc: linux-kernel, nishimura, balbir, hugh.dickins, hannes, akpm
Removed "RFC"
This patch series is restructured as following. some bugs are fixed.
Thank you for all your helps.
[1/4] ....change interface of swap_duplicate()/swap_free()
Adds an function swapcache_prepare() and swapcache_free().
[2/4] ....add SWAP_HAS_CACHE flag and modify reference counting in swap_map
Add SWAP_HAS_CACHE flag to swap_map array for knowing an information that
"there is an only swap cache and swap has no reference"
without extra call of find_get_page().
[3/4] ....reclaim swap-cache-only swap_entry when get_swap_page() find it.
Now, swap_map can tell "there is no reference other than cache", we
can reclaim it if necessary.
This code reclaim swap entries if
- vm_swap_full()==ture
&& there is no free swap cluster
&& get_swap_page() finds unused swap entry.
[4/4].... fix memcg's swap accounting
This is for fixing memcg's swap account leak. like this
==
processA | processB
-------------------------------------+-------------------------------------
(free_swap_and_cache()) | (read_swap_cache_async())
| swap_duplicate()
| __set_page_locked()
| add_to_swap_cache()
swap_entry_free() == 0 |
find_get_page() -> found |
try_lock_page() -> fail & return |
| lru_cache_add_anon()
| doesn't link this page to memcg's
| LRU, because of !PageCgroupUsed.
==
This patch tries to fix this by uncharging account when swap's refcnt goes
to 0 even if there is an unused swap-cache.
Works quite well in my test.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 1/4] add swap cache interface for swap reference
2009-05-28 4:54 [PATCH 0/4] memcg fix swap accounting (28/May) KAMEZAWA Hiroyuki
@ 2009-05-28 5:10 ` KAMEZAWA Hiroyuki
2009-05-29 4:21 ` Daisuke Nishimura
2009-05-28 5:19 ` [PATCH 2/4] modify swap_map and add SWAP_HAS_CACHE flag KAMEZAWA Hiroyuki
` (2 subsequent siblings)
3 siblings, 1 reply; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-28 5:10 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm, linux-kernel, nishimura, balbir, hugh.dickins, hannes, akpm
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
In following patch, usage of swap cache will be recorded into swap_map.
This patch is for necessary interface changes to do that.
2 interfaces:
- swapcache_prepare()
- swapcache_free()
is added for allocating/freeing refcnt from swap-cache to existing
swap entries. But implementation itself is not changed under this patch.
At adding swapcache_free(), memcg's hook code is moved under swapcache_free().
This is better than using scattered hooks.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
include/linux/swap.h | 7 +++++++
mm/swap_state.c | 11 +++++------
mm/swapfile.c | 19 +++++++++++++++++++
mm/vmscan.c | 3 +--
4 files changed, 32 insertions(+), 8 deletions(-)
Index: new-trial-swapcount2/include/linux/swap.h
===================================================================
--- new-trial-swapcount2.orig/include/linux/swap.h
+++ new-trial-swapcount2/include/linux/swap.h
@@ -301,8 +301,10 @@ extern void si_swapinfo(struct sysinfo *
extern swp_entry_t get_swap_page(void);
extern swp_entry_t get_swap_page_of_type(int);
extern int swap_duplicate(swp_entry_t);
+extern int swapcache_prepare(swp_entry_t);
extern int valid_swaphandles(swp_entry_t, unsigned long *);
extern void swap_free(swp_entry_t);
+extern void swapcache_free(swp_entry_t, struct page *page);
extern int free_swap_and_cache(swp_entry_t);
extern int swap_type_of(dev_t, sector_t, struct block_device **);
extern unsigned int count_swap_pages(int, int);
@@ -371,11 +373,16 @@ static inline void show_swap_cache_info(
#define free_swap_and_cache(swp) is_migration_entry(swp)
#define swap_duplicate(swp) is_migration_entry(swp)
+#define swapcache_prepare(swp) is_migration_entry(swp)
static inline void swap_free(swp_entry_t swp)
{
}
+static inline void swapcache_free(swp_entry_t swp, struct page *page)
+{
+}
+
static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
struct vm_area_struct *vma, unsigned long addr)
{
Index: new-trial-swapcount2/mm/swap_state.c
===================================================================
--- new-trial-swapcount2.orig/mm/swap_state.c
+++ new-trial-swapcount2/mm/swap_state.c
@@ -162,11 +162,11 @@ int add_to_swap(struct page *page)
return 1;
case -EEXIST:
/* Raced with "speculative" read_swap_cache_async */
- swap_free(entry);
+ swapcache_free(entry, NULL);
continue;
default:
/* -ENOMEM radix-tree allocation failure */
- swap_free(entry);
+ swapcache_free(entry, NULL);
return 0;
}
}
@@ -188,8 +188,7 @@ void delete_from_swap_cache(struct page
__delete_from_swap_cache(page);
spin_unlock_irq(&swapper_space.tree_lock);
- mem_cgroup_uncharge_swapcache(page, entry);
- swap_free(entry);
+ swapcache_free(entry, page);
page_cache_release(page);
}
@@ -293,7 +292,7 @@ struct page *read_swap_cache_async(swp_e
/*
* Swap entry may have been freed since our caller observed it.
*/
- if (!swap_duplicate(entry))
+ if (!swapcache_prepare(entry))
break;
/*
@@ -317,7 +316,7 @@ struct page *read_swap_cache_async(swp_e
}
ClearPageSwapBacked(new_page);
__clear_page_locked(new_page);
- swap_free(entry);
+ swapcache_free(entry, NULL);
} while (err != -ENOMEM);
if (new_page)
Index: new-trial-swapcount2/mm/swapfile.c
===================================================================
--- new-trial-swapcount2.orig/mm/swapfile.c
+++ new-trial-swapcount2/mm/swapfile.c
@@ -510,6 +510,16 @@ void swap_free(swp_entry_t entry)
}
/*
+ * Called after dropping swapcache to decrease refcnt to swap entries.
+ */
+void swapcache_free(swp_entry_t entry, struct page *page)
+{
+ if (page)
+ mem_cgroup_uncharge_swapcache(page, entry);
+ return swap_free(entry);
+}
+
+/*
* How many references to page are currently swapped out?
*/
static inline int page_swapcount(struct page *page)
@@ -1979,6 +1989,15 @@ bad_file:
goto out;
}
+/*
+ * Called when allocating swap cache for exising swap entry,
+ */
+int swapcache_prepare(swp_entry_t entry)
+{
+ return swap_duplicate(entry);
+}
+
+
struct swap_info_struct *
get_swap_info_struct(unsigned type)
{
Index: new-trial-swapcount2/mm/vmscan.c
===================================================================
--- new-trial-swapcount2.orig/mm/vmscan.c
+++ new-trial-swapcount2/mm/vmscan.c
@@ -477,8 +477,7 @@ static int __remove_mapping(struct addre
swp_entry_t swap = { .val = page_private(page) };
__delete_from_swap_cache(page);
spin_unlock_irq(&mapping->tree_lock);
- mem_cgroup_uncharge_swapcache(page, swap);
- swap_free(swap);
+ swapcache_free(swap, page);
} else {
__remove_from_page_cache(page);
spin_unlock_irq(&mapping->tree_lock);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 2/4] modify swap_map and add SWAP_HAS_CACHE flag.
2009-05-28 4:54 [PATCH 0/4] memcg fix swap accounting (28/May) KAMEZAWA Hiroyuki
2009-05-28 5:10 ` [PATCH 1/4] add swap cache interface for swap reference KAMEZAWA Hiroyuki
@ 2009-05-28 5:19 ` KAMEZAWA Hiroyuki
2009-05-30 6:10 ` Balbir Singh
2009-06-01 7:04 ` Daisuke Nishimura
2009-05-28 5:20 ` [PATCH 3/4] reuse unused swap entry if necessary KAMEZAWA Hiroyuki
2009-05-28 5:21 ` [PATCH 4/4] memcg: fix swap accounting KAMEZAWA Hiroyuki
3 siblings, 2 replies; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-28 5:19 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm, linux-kernel, nishimura, balbir, hugh.dickins, hannes, akpm
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
This is a part of patches for fixing memcg's swap account leak. But, IMHO,
not a bad patch even if no memcg.
Now, reference to swap is counted by swap_map[], an array of unsigned short.
There are 2 kinds of references to swap.
- reference from swap entry
- reference from swap cache
Then,
- If there is swap cache && swap's refcnt is 1, there is only swap cache.
(*) swapcount(entry) == 1 && find_get_page(swapper_space, entry) != NULL
This counting logic have worked well for a long time. But considering that
we cannot know there is a _real_ reference or not by swap_map[], current usage
of counter is not very good.
This patch adds a flag SWAP_HAS_CACHE and recored information that a swap entry
has a cache or not. This will remove -1 magic used in swapfile.c and be a help
to avoid unnecessary find_get_page().
Changelog: v1->v2
- fixed swapcache_prepare()'s return code.
- changed swap_duplicate() be void function.
- fixed racy case in swapoff().
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
include/linux/swap.h | 14 ++-
mm/swap_state.c | 5 +
mm/swapfile.c | 203 ++++++++++++++++++++++++++++++++++++---------------
3 files changed, 160 insertions(+), 62 deletions(-)
Index: new-trial-swapcount2/include/linux/swap.h
===================================================================
--- new-trial-swapcount2.orig/include/linux/swap.h
+++ new-trial-swapcount2/include/linux/swap.h
@@ -129,9 +129,10 @@ enum {
#define SWAP_CLUSTER_MAX 32
-#define SWAP_MAP_MAX 0x7fff
-#define SWAP_MAP_BAD 0x8000
-
+#define SWAP_MAP_MAX 0x7ffe
+#define SWAP_MAP_BAD 0x7fff
+#define SWAP_HAS_CACHE 0x8000 /* There is a swap cache of entry. */
+#define SWAP_COUNT_MASK (~SWAP_HAS_CACHE)
/*
* The in-memory structure used to track swap areas.
*/
@@ -300,7 +301,7 @@ extern long total_swap_pages;
extern void si_swapinfo(struct sysinfo *);
extern swp_entry_t get_swap_page(void);
extern swp_entry_t get_swap_page_of_type(int);
-extern int swap_duplicate(swp_entry_t);
+extern void swap_duplicate(swp_entry_t);
extern int swapcache_prepare(swp_entry_t);
extern int valid_swaphandles(swp_entry_t, unsigned long *);
extern void swap_free(swp_entry_t);
@@ -372,9 +373,12 @@ static inline void show_swap_cache_info(
}
#define free_swap_and_cache(swp) is_migration_entry(swp)
-#define swap_duplicate(swp) is_migration_entry(swp)
#define swapcache_prepare(swp) is_migration_entry(swp)
+static inline void swap_duplicate(swp_entry_t swp)
+{
+}
+
static inline void swap_free(swp_entry_t swp)
{
}
Index: new-trial-swapcount2/mm/swapfile.c
===================================================================
--- new-trial-swapcount2.orig/mm/swapfile.c
+++ new-trial-swapcount2/mm/swapfile.c
@@ -53,6 +53,26 @@ static struct swap_info_struct swap_info
static DEFINE_MUTEX(swapon_mutex);
+/* For reference count accounting in swap_map */
+static inline int swap_count(unsigned short ent)
+{
+ return ent & SWAP_COUNT_MASK;
+}
+
+static inline int swap_has_cache(unsigned short ent)
+{
+ return ent & SWAP_HAS_CACHE;
+}
+
+static inline unsigned short make_swap_count(int count, int has_cache)
+{
+ unsigned short ret = count;
+
+ if (has_cache)
+ return SWAP_HAS_CACHE | ret;
+ return ret;
+}
+
/*
* We need this because the bdev->unplug_fn can sleep and we cannot
* hold swap_lock while calling the unplug_fn. And swap_lock
@@ -167,7 +187,8 @@ static int wait_for_discard(void *word)
#define SWAPFILE_CLUSTER 256
#define LATENCY_LIMIT 256
-static inline unsigned long scan_swap_map(struct swap_info_struct *si)
+static inline unsigned long scan_swap_map(struct swap_info_struct *si,
+ int cache)
{
unsigned long offset;
unsigned long scan_base;
@@ -285,7 +306,10 @@ checks:
si->lowest_bit = si->max;
si->highest_bit = 0;
}
- si->swap_map[offset] = 1;
+ if (cache) /* at usual swap-out via vmscan.c */
+ si->swap_map[offset] = make_swap_count(0, 1);
+ else /* at suspend */
+ si->swap_map[offset] = make_swap_count(1, 0);
si->cluster_next = offset + 1;
si->flags -= SWP_SCANNING;
@@ -401,7 +425,8 @@ swp_entry_t get_swap_page(void)
continue;
swap_list.next = next;
- offset = scan_swap_map(si);
+ /* This is called for allocating swap entry for cache */
+ offset = scan_swap_map(si, 1);
if (offset) {
spin_unlock(&swap_lock);
return swp_entry(type, offset);
@@ -415,6 +440,7 @@ noswap:
return (swp_entry_t) {0};
}
+/* The only caller of this function is now susupend routine */
swp_entry_t get_swap_page_of_type(int type)
{
struct swap_info_struct *si;
@@ -424,7 +450,8 @@ swp_entry_t get_swap_page_of_type(int ty
si = swap_info + type;
if (si->flags & SWP_WRITEOK) {
nr_swap_pages--;
- offset = scan_swap_map(si);
+ /* This is called for allocating swap entry, not cache */
+ offset = scan_swap_map(si, 0);
if (offset) {
spin_unlock(&swap_lock);
return swp_entry(type, offset);
@@ -471,25 +498,36 @@ out:
return NULL;
}
-static int swap_entry_free(struct swap_info_struct *p, swp_entry_t ent)
+static int swap_entry_free(struct swap_info_struct *p,
+ swp_entry_t ent, int cache)
{
unsigned long offset = swp_offset(ent);
- int count = p->swap_map[offset];
+ int count = swap_count(p->swap_map[offset]);
+ int has_cache = swap_has_cache(p->swap_map[offset]);
- if (count < SWAP_MAP_MAX) {
- count--;
- p->swap_map[offset] = count;
- if (!count) {
- if (offset < p->lowest_bit)
- p->lowest_bit = offset;
- if (offset > p->highest_bit)
- p->highest_bit = offset;
- if (p->prio > swap_info[swap_list.next].prio)
- swap_list.next = p - swap_info;
- nr_swap_pages++;
- p->inuse_pages--;
- mem_cgroup_uncharge_swap(ent);
- }
+ if (!cache) { /* dropping usage count of swap */
+ if (count < SWAP_MAP_MAX) {
+ count--;
+ p->swap_map[offset] = make_swap_count(count, has_cache);
+ }
+ } else { /* dropping swap cache flag */
+ VM_BUG_ON(!has_cache);
+ p->swap_map[offset] = make_swap_count(count, 0);
+
+ }
+ /* return code. */
+ count = p->swap_map[offset];
+ /* free if no reference */
+ if (!count) {
+ if (offset < p->lowest_bit)
+ p->lowest_bit = offset;
+ if (offset > p->highest_bit)
+ p->highest_bit = offset;
+ if (p->prio > swap_info[swap_list.next].prio)
+ swap_list.next = p - swap_info;
+ nr_swap_pages++;
+ p->inuse_pages--;
+ mem_cgroup_uncharge_swap(ent);
}
return count;
}
@@ -504,7 +542,7 @@ void swap_free(swp_entry_t entry)
p = swap_info_get(entry);
if (p) {
- swap_entry_free(p, entry);
+ swap_entry_free(p, entry, 0);
spin_unlock(&swap_lock);
}
}
@@ -514,9 +552,16 @@ void swap_free(swp_entry_t entry)
*/
void swapcache_free(swp_entry_t entry, struct page *page)
{
+ struct swap_info_struct *p;
+
if (page)
mem_cgroup_uncharge_swapcache(page, entry);
- return swap_free(entry);
+ p = swap_info_get(entry);
+ if (p) {
+ swap_entry_free(p, entry, 1);
+ spin_unlock(&swap_lock);
+ }
+ return;
}
/*
@@ -531,8 +576,7 @@ static inline int page_swapcount(struct
entry.val = page_private(page);
p = swap_info_get(entry);
if (p) {
- /* Subtract the 1 for the swap cache itself */
- count = p->swap_map[swp_offset(entry)] - 1;
+ count = swap_count(p->swap_map[swp_offset(entry)]);
spin_unlock(&swap_lock);
}
return count;
@@ -594,7 +638,7 @@ int free_swap_and_cache(swp_entry_t entr
p = swap_info_get(entry);
if (p) {
- if (swap_entry_free(p, entry) == 1) {
+ if (swap_entry_free(p, entry, 0) == SWAP_HAS_CACHE) {
page = find_get_page(&swapper_space, entry.val);
if (page && !trylock_page(page)) {
page_cache_release(page);
@@ -901,7 +945,7 @@ static unsigned int find_next_to_unuse(s
i = 1;
}
count = si->swap_map[i];
- if (count && count != SWAP_MAP_BAD)
+ if (count && swap_count(count) != SWAP_MAP_BAD)
break;
}
return i;
@@ -1005,13 +1049,13 @@ static int try_to_unuse(unsigned int typ
*/
shmem = 0;
swcount = *swap_map;
- if (swcount > 1) {
+ if (swap_count(swcount)) {
if (start_mm == &init_mm)
shmem = shmem_unuse(entry, page);
else
retval = unuse_mm(start_mm, entry, page);
}
- if (*swap_map > 1) {
+ if (swap_count(*swap_map)) {
int set_start_mm = (*swap_map >= swcount);
struct list_head *p = &start_mm->mmlist;
struct mm_struct *new_start_mm = start_mm;
@@ -1021,7 +1065,7 @@ static int try_to_unuse(unsigned int typ
atomic_inc(&new_start_mm->mm_users);
atomic_inc(&prev_mm->mm_users);
spin_lock(&mmlist_lock);
- while (*swap_map > 1 && !retval && !shmem &&
+ while (swap_count(*swap_map) && !retval && !shmem &&
(p = p->next) != &start_mm->mmlist) {
mm = list_entry(p, struct mm_struct, mmlist);
if (!atomic_inc_not_zero(&mm->mm_users))
@@ -1033,14 +1077,16 @@ static int try_to_unuse(unsigned int typ
cond_resched();
swcount = *swap_map;
- if (swcount <= 1)
+ if (!swap_count(swcount)) /* any usage ? */
;
else if (mm == &init_mm) {
set_start_mm = 1;
shmem = shmem_unuse(entry, page);
} else
retval = unuse_mm(mm, entry, page);
- if (set_start_mm && *swap_map < swcount) {
+
+ if (set_start_mm &&
+ swap_count(*swap_map) < swcount) {
mmput(new_start_mm);
atomic_inc(&mm->mm_users);
new_start_mm = mm;
@@ -1067,21 +1113,25 @@ static int try_to_unuse(unsigned int typ
}
/*
- * How could swap count reach 0x7fff when the maximum
- * pid is 0x7fff, and there's no way to repeat a swap
- * page within an mm (except in shmem, where it's the
- * shared object which takes the reference count)?
- * We believe SWAP_MAP_MAX cannot occur in Linux 2.4.
- *
+ * How could swap count reach 0x7ffe ?
+ * There's no way to repeat a swap page within an mm
+ * (except in shmem, where it's the shared object which takes
+ * the reference count)?
+ * We believe SWAP_MAP_MAX cannot occur.(if occur, unsigned
+ * short is too small....)
* If that's wrong, then we should worry more about
* exit_mmap() and do_munmap() cases described above:
* we might be resetting SWAP_MAP_MAX too early here.
* We know "Undead"s can happen, they're okay, so don't
* report them; but do report if we reset SWAP_MAP_MAX.
*/
- if (*swap_map == SWAP_MAP_MAX) {
+ /* We might release the lock_page() in unuse_mm(). */
+ if (!PageSwapCache(page) || page_private(page) != entry.val)
+ goto retry;
+
+ if (swap_count(*swap_map) == SWAP_MAP_MAX) {
spin_lock(&swap_lock);
- *swap_map = 1;
+ *swap_map = make_swap_count(0, 1);
spin_unlock(&swap_lock);
reset_overflow = 1;
}
@@ -1099,7 +1149,8 @@ static int try_to_unuse(unsigned int typ
* pages would be incorrect if swap supported "shared
* private" pages, but they are handled by tmpfs files.
*/
- if ((*swap_map > 1) && PageDirty(page) && PageSwapCache(page)) {
+ if (swap_count(*swap_map) &&
+ PageDirty(page) && PageSwapCache(page)) {
struct writeback_control wbc = {
.sync_mode = WB_SYNC_NONE,
};
@@ -1126,6 +1177,7 @@ static int try_to_unuse(unsigned int typ
* mark page dirty so shrink_page_list will preserve it.
*/
SetPageDirty(page);
+retry:
unlock_page(page);
page_cache_release(page);
@@ -1952,15 +2004,22 @@ void si_swapinfo(struct sysinfo *val)
*
* Note: if swap_map[] reaches SWAP_MAP_MAX the entries are treated as
* "permanent", but will be reclaimed by the next swapoff.
+ * Returns error code in following case.
+ * - success -> 0
+ * - swp_entry is invalid -> EINVAL
+ * - swp_entry is migration entry -> EINVAL
+ * - swap-cache reference is requested but there is already one. -> EEXIST
+ * - swap-cache reference is requested but the entry is not used. -> ENOENT
*/
-int swap_duplicate(swp_entry_t entry)
+static int __swap_duplicate(swp_entry_t entry, int cache)
{
struct swap_info_struct * p;
unsigned long offset, type;
- int result = 0;
+ int result = -EINVAL;
+ int count, has_cache;
if (is_migration_entry(entry))
- return 1;
+ return -EINVAL;
type = swp_type(entry);
if (type >= nr_swapfiles)
@@ -1969,17 +2028,37 @@ int swap_duplicate(swp_entry_t entry)
offset = swp_offset(entry);
spin_lock(&swap_lock);
- if (offset < p->max && p->swap_map[offset]) {
- if (p->swap_map[offset] < SWAP_MAP_MAX - 1) {
- p->swap_map[offset]++;
- result = 1;
- } else if (p->swap_map[offset] <= SWAP_MAP_MAX) {
+
+ if (unlikely(offset >= p->max))
+ goto unlock_out;
+
+ count = swap_count(p->swap_map[offset]);
+ has_cache = swap_has_cache(p->swap_map[offset]);
+ if (cache) { /* called for swapcache/swapin-readahead */
+ /* set SWAP_HAS_CACHE if there is no cache and entry is used */
+ if (!has_cache && count) {
+ p->swap_map[offset] = make_swap_count(count, 1);
+ result = 0;
+ } else if (has_cache)
+ result = -EEXIST;
+ else if (!count)
+ result = -ENOENT;
+ } else if (count || has_cache) {
+ if (count < SWAP_MAP_MAX - 1) {
+ p->swap_map[offset] = make_swap_count(count + 1,
+ has_cache);
+ result = 0;
+ } else if (count <= SWAP_MAP_MAX) {
if (swap_overflow++ < 5)
- printk(KERN_WARNING "swap_dup: swap entry overflow\n");
- p->swap_map[offset] = SWAP_MAP_MAX;
- result = 1;
- }
- }
+ printk(KERN_WARNING
+ "swap_dup: swap entry overflow\n");
+ p->swap_map[offset] = make_swap_count(SWAP_MAP_MAX,
+ has_cache);
+ result = 0;
+ }
+ } else
+ result = -ENOENT; /* unused swap entry */
+unlock_out:
spin_unlock(&swap_lock);
out:
return result;
@@ -1988,13 +2067,25 @@ bad_file:
printk(KERN_ERR "swap_dup: %s%08lx\n", Bad_file, entry.val);
goto out;
}
+/*
+ * increase reference count of swap entry by 1.
+ */
+void swap_duplicate(swp_entry_t entry)
+{
+ __swap_duplicate(entry, 0);
+}
/*
+ * @entry: swap entry for which we allocate swap cache.
+ *
* Called when allocating swap cache for exising swap entry,
+ * This can return error codes. Returns 0 at success.
+ * -EBUSY means there is a swap cache.
+ * Note: return code is different from swap_duplicate().
*/
int swapcache_prepare(swp_entry_t entry)
{
- return swap_duplicate(entry);
+ return __swap_duplicate(entry, 1);
}
@@ -2035,7 +2126,7 @@ int valid_swaphandles(swp_entry_t entry,
/* Don't read in free or bad pages */
if (!si->swap_map[toff])
break;
- if (si->swap_map[toff] == SWAP_MAP_BAD)
+ if (swap_count(si->swap_map[toff]) == SWAP_MAP_BAD)
break;
}
/* Count contiguous allocated slots below our target */
@@ -2043,7 +2134,7 @@ int valid_swaphandles(swp_entry_t entry,
/* Don't read in free or bad pages */
if (!si->swap_map[toff])
break;
- if (si->swap_map[toff] == SWAP_MAP_BAD)
+ if (swap_count(si->swap_map[toff]) == SWAP_MAP_BAD)
break;
}
spin_unlock(&swap_lock);
Index: new-trial-swapcount2/mm/swap_state.c
===================================================================
--- new-trial-swapcount2.orig/mm/swap_state.c
+++ new-trial-swapcount2/mm/swap_state.c
@@ -292,7 +292,10 @@ struct page *read_swap_cache_async(swp_e
/*
* Swap entry may have been freed since our caller observed it.
*/
- if (!swapcache_prepare(entry))
+ err = swapcache_prepare(entry);
+ if (err == -EEXIST) /* seems racy */
+ continue;
+ if (err) /* swp entry is obsolete ? */
break;
/*
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 3/4] reuse unused swap entry if necessary
2009-05-28 4:54 [PATCH 0/4] memcg fix swap accounting (28/May) KAMEZAWA Hiroyuki
2009-05-28 5:10 ` [PATCH 1/4] add swap cache interface for swap reference KAMEZAWA Hiroyuki
2009-05-28 5:19 ` [PATCH 2/4] modify swap_map and add SWAP_HAS_CACHE flag KAMEZAWA Hiroyuki
@ 2009-05-28 5:20 ` KAMEZAWA Hiroyuki
2009-05-29 21:55 ` Andrew Morton
2009-05-30 6:40 ` Balbir Singh
2009-05-28 5:21 ` [PATCH 4/4] memcg: fix swap accounting KAMEZAWA Hiroyuki
3 siblings, 2 replies; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-28 5:20 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm, linux-kernel, nishimura, balbir, hugh.dickins, hannes, akpm
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Now, we can know a swap entry is just used as SwapCache via swap_map,
without looking up swap cache.
Then, we have a chance to reuse swap-cache-only swap entries in
get_swap_pages().
This patch tries to free swap-cache-only swap entries if swap is
not enough.
Note: We hit following path when swap_cluster code cannot find
a free cluster. Then, vm_swap_full() is not only condition to allow
the kernel to reclaim unused swap.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/swapfile.c | 39 +++++++++++++++++++++++++++++++++++++++
1 file changed, 39 insertions(+)
Index: new-trial-swapcount2/mm/swapfile.c
===================================================================
--- new-trial-swapcount2.orig/mm/swapfile.c
+++ new-trial-swapcount2/mm/swapfile.c
@@ -73,6 +73,25 @@ static inline unsigned short make_swap_c
return ret;
}
+static int
+try_to_reuse_swap(struct swap_info_struct *si, unsigned long offset)
+{
+ int type = si - swap_info;
+ swp_entry_t entry = swp_entry(type, offset);
+ struct page *page;
+ int ret = 0;
+
+ page = find_get_page(&swapper_space, entry.val);
+ if (!page)
+ return 0;
+ if (trylock_page(page)) {
+ ret = try_to_free_swap(page);
+ unlock_page(page);
+ }
+ page_cache_release(page);
+ return ret;
+}
+
/*
* We need this because the bdev->unplug_fn can sleep and we cannot
* hold swap_lock while calling the unplug_fn. And swap_lock
@@ -294,6 +313,18 @@ checks:
goto no_page;
if (offset > si->highest_bit)
scan_base = offset = si->lowest_bit;
+
+ /* reuse swap entry of cache-only swap if not busy. */
+ if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
+ int ret;
+ spin_unlock(&swap_lock);
+ ret = try_to_reuse_swap(si, offset);
+ spin_lock(&swap_lock);
+ if (ret)
+ goto checks; /* we released swap_lock. retry. */
+ goto scan; /* In some racy case */
+ }
+
if (si->swap_map[offset])
goto scan;
@@ -375,6 +406,10 @@ scan:
spin_lock(&swap_lock);
goto checks;
}
+ if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
+ spin_lock(&swap_lock);
+ goto checks;
+ }
if (unlikely(--latency_ration < 0)) {
cond_resched();
latency_ration = LATENCY_LIMIT;
@@ -386,6 +421,10 @@ scan:
spin_lock(&swap_lock);
goto checks;
}
+ if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
+ spin_lock(&swap_lock);
+ goto checks;
+ }
if (unlikely(--latency_ration < 0)) {
cond_resched();
latency_ration = LATENCY_LIMIT;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 4/4] memcg: fix swap accounting
2009-05-28 4:54 [PATCH 0/4] memcg fix swap accounting (28/May) KAMEZAWA Hiroyuki
` (2 preceding siblings ...)
2009-05-28 5:20 ` [PATCH 3/4] reuse unused swap entry if necessary KAMEZAWA Hiroyuki
@ 2009-05-28 5:21 ` KAMEZAWA Hiroyuki
2009-05-30 7:20 ` Balbir Singh
3 siblings, 1 reply; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-28 5:21 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm, linux-kernel, nishimura, balbir, hugh.dickins, hannes, akpm
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
This patch fixes mis-accounting of swap usage in memcg.
In current implementation, memcg's swap account is uncharged only when
swap is completely freed. But there are several cases where swap
cannot be freed cleanly. For handling that, this patch changes that
memcg uncharges swap account when swap has no references other than cache.
By this, memcg's swap entry accounting can be fully synchronous with
the application's behavior.
This patch also changes memcg's hooks for swap-out.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
include/linux/swap.h | 5 +++--
mm/memcontrol.c | 17 ++++++++++++-----
mm/swapfile.c | 14 ++++++++++----
3 files changed, 25 insertions(+), 11 deletions(-)
Index: new-trial-swapcount2/include/linux/swap.h
===================================================================
--- new-trial-swapcount2.orig/include/linux/swap.h
+++ new-trial-swapcount2/include/linux/swap.h
@@ -338,10 +338,11 @@ static inline void disable_swap_token(vo
}
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
-extern void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent);
+extern void
+mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, int swapout);
#else
static inline void
-mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
+mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, int swapout)
{
}
#endif
Index: new-trial-swapcount2/mm/memcontrol.c
===================================================================
--- new-trial-swapcount2.orig/mm/memcontrol.c
+++ new-trial-swapcount2/mm/memcontrol.c
@@ -189,6 +189,7 @@ enum charge_type {
MEM_CGROUP_CHARGE_TYPE_SHMEM, /* used by page migration of shmem */
MEM_CGROUP_CHARGE_TYPE_FORCE, /* used by force_empty */
MEM_CGROUP_CHARGE_TYPE_SWAPOUT, /* for accounting swapcache */
+ MEM_CGROUP_CHARGE_TYPE_DROP, /* a page was unused swap cache */
NR_CHARGE_TYPE,
};
@@ -1501,6 +1502,7 @@ __mem_cgroup_uncharge_common(struct page
switch (ctype) {
case MEM_CGROUP_CHARGE_TYPE_MAPPED:
+ case MEM_CGROUP_CHARGE_TYPE_DROP:
if (page_mapped(page))
goto unlock_out;
break;
@@ -1564,18 +1566,23 @@ void mem_cgroup_uncharge_cache_page(stru
* called after __delete_from_swap_cache() and drop "page" account.
* memcg information is recorded to swap_cgroup of "ent"
*/
-void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
+void
+mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, int swapout)
{
struct mem_cgroup *memcg;
+ int ctype = MEM_CGROUP_CHARGE_TYPE_SWAPOUT;
+
+ if (!swapout) /* this was a swap cache but the swap is unused ! */
+ ctype = MEM_CGROUP_CHARGE_TYPE_DROP;
+
+ memcg = __mem_cgroup_uncharge_common(page, ctype);
- memcg = __mem_cgroup_uncharge_common(page,
- MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
/* record memcg information */
- if (do_swap_account && memcg) {
+ if (do_swap_account && swapout && memcg) {
swap_cgroup_record(ent, css_id(&memcg->css));
mem_cgroup_get(memcg);
}
- if (memcg)
+ if (swapout && memcg)
css_put(&memcg->css);
}
#endif
Index: new-trial-swapcount2/mm/swapfile.c
===================================================================
--- new-trial-swapcount2.orig/mm/swapfile.c
+++ new-trial-swapcount2/mm/swapfile.c
@@ -566,8 +566,9 @@ static int swap_entry_free(struct swap_i
swap_list.next = p - swap_info;
nr_swap_pages++;
p->inuse_pages--;
- mem_cgroup_uncharge_swap(ent);
}
+ if (!swap_count(count))
+ mem_cgroup_uncharge_swap(ent);
return count;
}
@@ -592,12 +593,17 @@ void swap_free(swp_entry_t entry)
void swapcache_free(swp_entry_t entry, struct page *page)
{
struct swap_info_struct *p;
+ int ret;
- if (page)
- mem_cgroup_uncharge_swapcache(page, entry);
p = swap_info_get(entry);
if (p) {
- swap_entry_free(p, entry, 1);
+ ret = swap_entry_free(p, entry, 1);
+ if (page) {
+ if (ret)
+ mem_cgroup_uncharge_swapcache(page, entry, 1);
+ else
+ mem_cgroup_uncharge_swapcache(page, entry, 0);
+ }
spin_unlock(&swap_lock);
}
return;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/4] add swap cache interface for swap reference
2009-05-28 5:10 ` [PATCH 1/4] add swap cache interface for swap reference KAMEZAWA Hiroyuki
@ 2009-05-29 4:21 ` Daisuke Nishimura
2009-05-29 5:08 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 19+ messages in thread
From: Daisuke Nishimura @ 2009-05-29 4:21 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm, linux-kernel, balbir, hugh.dickins, hannes, akpm,
Daisuke Nishimura
On Thu, 28 May 2009 14:10:49 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> In following patch, usage of swap cache will be recorded into swap_map.
> This patch is for necessary interface changes to do that.
>
> 2 interfaces:
> - swapcache_prepare()
> - swapcache_free()
> is added for allocating/freeing refcnt from swap-cache to existing
> swap entries. But implementation itself is not changed under this patch.
> At adding swapcache_free(), memcg's hook code is moved under swapcache_free().
> This is better than using scattered hooks.
>
IIUC, swap_free() at the end of shmem_writepage() should also be changed to swapcache_free().
Thanks,
Daisuke Nishimura.
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
> include/linux/swap.h | 7 +++++++
> mm/swap_state.c | 11 +++++------
> mm/swapfile.c | 19 +++++++++++++++++++
> mm/vmscan.c | 3 +--
> 4 files changed, 32 insertions(+), 8 deletions(-)
>
> Index: new-trial-swapcount2/include/linux/swap.h
> ===================================================================
> --- new-trial-swapcount2.orig/include/linux/swap.h
> +++ new-trial-swapcount2/include/linux/swap.h
> @@ -301,8 +301,10 @@ extern void si_swapinfo(struct sysinfo *
> extern swp_entry_t get_swap_page(void);
> extern swp_entry_t get_swap_page_of_type(int);
> extern int swap_duplicate(swp_entry_t);
> +extern int swapcache_prepare(swp_entry_t);
> extern int valid_swaphandles(swp_entry_t, unsigned long *);
> extern void swap_free(swp_entry_t);
> +extern void swapcache_free(swp_entry_t, struct page *page);
> extern int free_swap_and_cache(swp_entry_t);
> extern int swap_type_of(dev_t, sector_t, struct block_device **);
> extern unsigned int count_swap_pages(int, int);
> @@ -371,11 +373,16 @@ static inline void show_swap_cache_info(
>
> #define free_swap_and_cache(swp) is_migration_entry(swp)
> #define swap_duplicate(swp) is_migration_entry(swp)
> +#define swapcache_prepare(swp) is_migration_entry(swp)
>
> static inline void swap_free(swp_entry_t swp)
> {
> }
>
> +static inline void swapcache_free(swp_entry_t swp, struct page *page)
> +{
> +}
> +
> static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
> struct vm_area_struct *vma, unsigned long addr)
> {
> Index: new-trial-swapcount2/mm/swap_state.c
> ===================================================================
> --- new-trial-swapcount2.orig/mm/swap_state.c
> +++ new-trial-swapcount2/mm/swap_state.c
> @@ -162,11 +162,11 @@ int add_to_swap(struct page *page)
> return 1;
> case -EEXIST:
> /* Raced with "speculative" read_swap_cache_async */
> - swap_free(entry);
> + swapcache_free(entry, NULL);
> continue;
> default:
> /* -ENOMEM radix-tree allocation failure */
> - swap_free(entry);
> + swapcache_free(entry, NULL);
> return 0;
> }
> }
> @@ -188,8 +188,7 @@ void delete_from_swap_cache(struct page
> __delete_from_swap_cache(page);
> spin_unlock_irq(&swapper_space.tree_lock);
>
> - mem_cgroup_uncharge_swapcache(page, entry);
> - swap_free(entry);
> + swapcache_free(entry, page);
> page_cache_release(page);
> }
>
> @@ -293,7 +292,7 @@ struct page *read_swap_cache_async(swp_e
> /*
> * Swap entry may have been freed since our caller observed it.
> */
> - if (!swap_duplicate(entry))
> + if (!swapcache_prepare(entry))
> break;
>
> /*
> @@ -317,7 +316,7 @@ struct page *read_swap_cache_async(swp_e
> }
> ClearPageSwapBacked(new_page);
> __clear_page_locked(new_page);
> - swap_free(entry);
> + swapcache_free(entry, NULL);
> } while (err != -ENOMEM);
>
> if (new_page)
> Index: new-trial-swapcount2/mm/swapfile.c
> ===================================================================
> --- new-trial-swapcount2.orig/mm/swapfile.c
> +++ new-trial-swapcount2/mm/swapfile.c
> @@ -510,6 +510,16 @@ void swap_free(swp_entry_t entry)
> }
>
> /*
> + * Called after dropping swapcache to decrease refcnt to swap entries.
> + */
> +void swapcache_free(swp_entry_t entry, struct page *page)
> +{
> + if (page)
> + mem_cgroup_uncharge_swapcache(page, entry);
> + return swap_free(entry);
> +}
> +
> +/*
> * How many references to page are currently swapped out?
> */
> static inline int page_swapcount(struct page *page)
> @@ -1979,6 +1989,15 @@ bad_file:
> goto out;
> }
>
> +/*
> + * Called when allocating swap cache for exising swap entry,
> + */
> +int swapcache_prepare(swp_entry_t entry)
> +{
> + return swap_duplicate(entry);
> +}
> +
> +
> struct swap_info_struct *
> get_swap_info_struct(unsigned type)
> {
> Index: new-trial-swapcount2/mm/vmscan.c
> ===================================================================
> --- new-trial-swapcount2.orig/mm/vmscan.c
> +++ new-trial-swapcount2/mm/vmscan.c
> @@ -477,8 +477,7 @@ static int __remove_mapping(struct addre
> swp_entry_t swap = { .val = page_private(page) };
> __delete_from_swap_cache(page);
> spin_unlock_irq(&mapping->tree_lock);
> - mem_cgroup_uncharge_swapcache(page, swap);
> - swap_free(swap);
> + swapcache_free(swap, page);
> } else {
> __remove_from_page_cache(page);
> spin_unlock_irq(&mapping->tree_lock);
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/4] add swap cache interface for swap reference
2009-05-29 4:21 ` Daisuke Nishimura
@ 2009-05-29 5:08 ` KAMEZAWA Hiroyuki
2009-05-29 5:37 ` [PATCH 1/4] add swap cache interface for swap reference v2 (updated) KAMEZAWA Hiroyuki
0 siblings, 1 reply; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-29 5:08 UTC (permalink / raw)
To: Daisuke Nishimura
Cc: linux-mm, linux-kernel, balbir, hugh.dickins, hannes, akpm
On Fri, 29 May 2009 13:21:53 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> On Thu, 28 May 2009 14:10:49 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >
> > In following patch, usage of swap cache will be recorded into swap_map.
> > This patch is for necessary interface changes to do that.
> >
> > 2 interfaces:
> > - swapcache_prepare()
> > - swapcache_free()
> > is added for allocating/freeing refcnt from swap-cache to existing
> > swap entries. But implementation itself is not changed under this patch.
> > At adding swapcache_free(), memcg's hook code is moved under swapcache_free().
> > This is better than using scattered hooks.
> >
> IIUC, swap_free() at the end of shmem_writepage() should also be changed to swapcache_free().
>
Hmm!. Oh, yes. shmem_writepage()'s error path. Thank you. It will be fixed.
Thanks,
-Kame
> Thanks,
> Daisuke Nishimura.
>
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> > include/linux/swap.h | 7 +++++++
> > mm/swap_state.c | 11 +++++------
> > mm/swapfile.c | 19 +++++++++++++++++++
> > mm/vmscan.c | 3 +--
> > 4 files changed, 32 insertions(+), 8 deletions(-)
> >
> > Index: new-trial-swapcount2/include/linux/swap.h
> > ===================================================================
> > --- new-trial-swapcount2.orig/include/linux/swap.h
> > +++ new-trial-swapcount2/include/linux/swap.h
> > @@ -301,8 +301,10 @@ extern void si_swapinfo(struct sysinfo *
> > extern swp_entry_t get_swap_page(void);
> > extern swp_entry_t get_swap_page_of_type(int);
> > extern int swap_duplicate(swp_entry_t);
> > +extern int swapcache_prepare(swp_entry_t);
> > extern int valid_swaphandles(swp_entry_t, unsigned long *);
> > extern void swap_free(swp_entry_t);
> > +extern void swapcache_free(swp_entry_t, struct page *page);
> > extern int free_swap_and_cache(swp_entry_t);
> > extern int swap_type_of(dev_t, sector_t, struct block_device **);
> > extern unsigned int count_swap_pages(int, int);
> > @@ -371,11 +373,16 @@ static inline void show_swap_cache_info(
> >
> > #define free_swap_and_cache(swp) is_migration_entry(swp)
> > #define swap_duplicate(swp) is_migration_entry(swp)
> > +#define swapcache_prepare(swp) is_migration_entry(swp)
> >
> > static inline void swap_free(swp_entry_t swp)
> > {
> > }
> >
> > +static inline void swapcache_free(swp_entry_t swp, struct page *page)
> > +{
> > +}
> > +
> > static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
> > struct vm_area_struct *vma, unsigned long addr)
> > {
> > Index: new-trial-swapcount2/mm/swap_state.c
> > ===================================================================
> > --- new-trial-swapcount2.orig/mm/swap_state.c
> > +++ new-trial-swapcount2/mm/swap_state.c
> > @@ -162,11 +162,11 @@ int add_to_swap(struct page *page)
> > return 1;
> > case -EEXIST:
> > /* Raced with "speculative" read_swap_cache_async */
> > - swap_free(entry);
> > + swapcache_free(entry, NULL);
> > continue;
> > default:
> > /* -ENOMEM radix-tree allocation failure */
> > - swap_free(entry);
> > + swapcache_free(entry, NULL);
> > return 0;
> > }
> > }
> > @@ -188,8 +188,7 @@ void delete_from_swap_cache(struct page
> > __delete_from_swap_cache(page);
> > spin_unlock_irq(&swapper_space.tree_lock);
> >
> > - mem_cgroup_uncharge_swapcache(page, entry);
> > - swap_free(entry);
> > + swapcache_free(entry, page);
> > page_cache_release(page);
> > }
> >
> > @@ -293,7 +292,7 @@ struct page *read_swap_cache_async(swp_e
> > /*
> > * Swap entry may have been freed since our caller observed it.
> > */
> > - if (!swap_duplicate(entry))
> > + if (!swapcache_prepare(entry))
> > break;
> >
> > /*
> > @@ -317,7 +316,7 @@ struct page *read_swap_cache_async(swp_e
> > }
> > ClearPageSwapBacked(new_page);
> > __clear_page_locked(new_page);
> > - swap_free(entry);
> > + swapcache_free(entry, NULL);
> > } while (err != -ENOMEM);
> >
> > if (new_page)
> > Index: new-trial-swapcount2/mm/swapfile.c
> > ===================================================================
> > --- new-trial-swapcount2.orig/mm/swapfile.c
> > +++ new-trial-swapcount2/mm/swapfile.c
> > @@ -510,6 +510,16 @@ void swap_free(swp_entry_t entry)
> > }
> >
> > /*
> > + * Called after dropping swapcache to decrease refcnt to swap entries.
> > + */
> > +void swapcache_free(swp_entry_t entry, struct page *page)
> > +{
> > + if (page)
> > + mem_cgroup_uncharge_swapcache(page, entry);
> > + return swap_free(entry);
> > +}
> > +
> > +/*
> > * How many references to page are currently swapped out?
> > */
> > static inline int page_swapcount(struct page *page)
> > @@ -1979,6 +1989,15 @@ bad_file:
> > goto out;
> > }
> >
> > +/*
> > + * Called when allocating swap cache for exising swap entry,
> > + */
> > +int swapcache_prepare(swp_entry_t entry)
> > +{
> > + return swap_duplicate(entry);
> > +}
> > +
> > +
> > struct swap_info_struct *
> > get_swap_info_struct(unsigned type)
> > {
> > Index: new-trial-swapcount2/mm/vmscan.c
> > ===================================================================
> > --- new-trial-swapcount2.orig/mm/vmscan.c
> > +++ new-trial-swapcount2/mm/vmscan.c
> > @@ -477,8 +477,7 @@ static int __remove_mapping(struct addre
> > swp_entry_t swap = { .val = page_private(page) };
> > __delete_from_swap_cache(page);
> > spin_unlock_irq(&mapping->tree_lock);
> > - mem_cgroup_uncharge_swapcache(page, swap);
> > - swap_free(swap);
> > + swapcache_free(swap, page);
> > } else {
> > __remove_from_page_cache(page);
> > spin_unlock_irq(&mapping->tree_lock);
> >
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 1/4] add swap cache interface for swap reference v2 (updated)
2009-05-29 5:08 ` KAMEZAWA Hiroyuki
@ 2009-05-29 5:37 ` KAMEZAWA Hiroyuki
2009-05-29 6:05 ` Daisuke Nishimura
2009-05-30 5:21 ` Balbir Singh
0 siblings, 2 replies; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-29 5:37 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura, linux-mm, linux-kernel, balbir, hugh.dickins,
hannes, akpm
On Fri, 29 May 2009 14:08:32 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > IIUC, swap_free() at the end of shmem_writepage() should also be changed to swapcache_free().
> >
> Hmm!. Oh, yes. shmem_writepage()'s error path. Thank you. It will be fixed.
>
here.
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
In following patch, usage of swap cache will be recorded into swap_map.
This patch is for necessary interface changes to do that.
2 interfaces:
- swapcache_prepare()
- swapcache_free()
is added for allocating/freeing refcnt from swap-cache to existing
swap entries. But implementation itself is not changed under this patch.
At adding swapcache_free(), memcg's hook code is moved under swapcache_free().
This is better than using scattered hooks.
Changelog: v1->v2
- fixed shmem_writepage() error path.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
include/linux/swap.h | 7 +++++++
mm/shmem.c | 2 +-
mm/swap_state.c | 11 +++++------
mm/swapfile.c | 19 +++++++++++++++++++
mm/vmscan.c | 3 +--
5 files changed, 33 insertions(+), 9 deletions(-)
Index: mmotm-2.6.30-May28/include/linux/swap.h
===================================================================
--- mmotm-2.6.30-May28.orig/include/linux/swap.h
+++ mmotm-2.6.30-May28/include/linux/swap.h
@@ -282,8 +282,10 @@ extern void si_swapinfo(struct sysinfo *
extern swp_entry_t get_swap_page(void);
extern swp_entry_t get_swap_page_of_type(int);
extern int swap_duplicate(swp_entry_t);
+extern int swapcache_prepare(swp_entry_t);
extern int valid_swaphandles(swp_entry_t, unsigned long *);
extern void swap_free(swp_entry_t);
+extern void swapcache_free(swp_entry_t, struct page *page);
extern int free_swap_and_cache(swp_entry_t);
extern int swap_type_of(dev_t, sector_t, struct block_device **);
extern unsigned int count_swap_pages(int, int);
@@ -352,11 +354,16 @@ static inline void show_swap_cache_info(
#define free_swap_and_cache(swp) is_migration_entry(swp)
#define swap_duplicate(swp) is_migration_entry(swp)
+#define swapcache_prepare(swp) is_migration_entry(swp)
static inline void swap_free(swp_entry_t swp)
{
}
+static inline void swapcache_free(swp_entry_t swp, struct page *page)
+{
+}
+
static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
struct vm_area_struct *vma, unsigned long addr)
{
Index: mmotm-2.6.30-May28/mm/swap_state.c
===================================================================
--- mmotm-2.6.30-May28.orig/mm/swap_state.c
+++ mmotm-2.6.30-May28/mm/swap_state.c
@@ -162,11 +162,11 @@ int add_to_swap(struct page *page)
return 1;
case -EEXIST:
/* Raced with "speculative" read_swap_cache_async */
- swap_free(entry);
+ swapcache_free(entry, NULL);
continue;
default:
/* -ENOMEM radix-tree allocation failure */
- swap_free(entry);
+ swapcache_free(entry, NULL);
return 0;
}
}
@@ -188,8 +188,7 @@ void delete_from_swap_cache(struct page
__delete_from_swap_cache(page);
spin_unlock_irq(&swapper_space.tree_lock);
- mem_cgroup_uncharge_swapcache(page, entry);
- swap_free(entry);
+ swapcache_free(entry, page);
page_cache_release(page);
}
@@ -293,7 +292,7 @@ struct page *read_swap_cache_async(swp_e
/*
* Swap entry may have been freed since our caller observed it.
*/
- if (!swap_duplicate(entry))
+ if (!swapcache_prepare(entry))
break;
/*
@@ -317,7 +316,7 @@ struct page *read_swap_cache_async(swp_e
}
ClearPageSwapBacked(new_page);
__clear_page_locked(new_page);
- swap_free(entry);
+ swapcache_free(entry, NULL);
} while (err != -ENOMEM);
if (new_page)
Index: mmotm-2.6.30-May28/mm/swapfile.c
===================================================================
--- mmotm-2.6.30-May28.orig/mm/swapfile.c
+++ mmotm-2.6.30-May28/mm/swapfile.c
@@ -510,6 +510,16 @@ void swap_free(swp_entry_t entry)
}
/*
+ * Called after dropping swapcache to decrease refcnt to swap entries.
+ */
+void swapcache_free(swp_entry_t entry, struct page *page)
+{
+ if (page)
+ mem_cgroup_uncharge_swapcache(page, entry);
+ return swap_free(entry);
+}
+
+/*
* How many references to page are currently swapped out?
*/
static inline int page_swapcount(struct page *page)
@@ -1979,6 +1989,15 @@ bad_file:
goto out;
}
+/*
+ * Called when allocating swap cache for exising swap entry,
+ */
+int swapcache_prepare(swp_entry_t entry)
+{
+ return swap_duplicate(entry);
+}
+
+
struct swap_info_struct *
get_swap_info_struct(unsigned type)
{
Index: mmotm-2.6.30-May28/mm/vmscan.c
===================================================================
--- mmotm-2.6.30-May28.orig/mm/vmscan.c
+++ mmotm-2.6.30-May28/mm/vmscan.c
@@ -477,8 +477,7 @@ static int __remove_mapping(struct addre
swp_entry_t swap = { .val = page_private(page) };
__delete_from_swap_cache(page);
spin_unlock_irq(&mapping->tree_lock);
- mem_cgroup_uncharge_swapcache(page, swap);
- swap_free(swap);
+ swapcache_free(swap, page);
} else {
__remove_from_page_cache(page);
spin_unlock_irq(&mapping->tree_lock);
Index: mmotm-2.6.30-May28/mm/shmem.c
===================================================================
--- mmotm-2.6.30-May28.orig/mm/shmem.c
+++ mmotm-2.6.30-May28/mm/shmem.c
@@ -1097,7 +1097,7 @@ static int shmem_writepage(struct page *
shmem_swp_unmap(entry);
unlock:
spin_unlock(&info->lock);
- swap_free(swap);
+ swapcache_free(swap, NULL);
redirty:
set_page_dirty(page);
if (wbc->for_reclaim)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/4] add swap cache interface for swap reference v2 (updated)
2009-05-29 5:37 ` [PATCH 1/4] add swap cache interface for swap reference v2 (updated) KAMEZAWA Hiroyuki
@ 2009-05-29 6:05 ` Daisuke Nishimura
2009-05-29 6:53 ` KAMEZAWA Hiroyuki
2009-05-30 5:21 ` Balbir Singh
1 sibling, 1 reply; 19+ messages in thread
From: Daisuke Nishimura @ 2009-05-29 6:05 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm, linux-kernel, balbir, hugh.dickins, hannes, akpm,
Daisuke Nishimura
On Fri, 29 May 2009 14:37:58 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Fri, 29 May 2009 14:08:32 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > IIUC, swap_free() at the end of shmem_writepage() should also be changed to swapcache_free().
> > >
> > Hmm!. Oh, yes. shmem_writepage()'s error path. Thank you. It will be fixed.
> >
> here.
>
Looks good to me.
Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
BTW, I'm now testing(with swap-in/out and swap-on/off) [2/4] of this patch set.
I think this patch set would work well, but it's a big change to swap,
so we should test them very carefully.
Thanks,
Daisuke Nishimura.
> ==
>
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> In following patch, usage of swap cache will be recorded into swap_map.
> This patch is for necessary interface changes to do that.
>
> 2 interfaces:
> - swapcache_prepare()
> - swapcache_free()
> is added for allocating/freeing refcnt from swap-cache to existing
> swap entries. But implementation itself is not changed under this patch.
> At adding swapcache_free(), memcg's hook code is moved under swapcache_free().
> This is better than using scattered hooks.
>
> Changelog: v1->v2
> - fixed shmem_writepage() error path.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
> include/linux/swap.h | 7 +++++++
> mm/shmem.c | 2 +-
> mm/swap_state.c | 11 +++++------
> mm/swapfile.c | 19 +++++++++++++++++++
> mm/vmscan.c | 3 +--
> 5 files changed, 33 insertions(+), 9 deletions(-)
>
> Index: mmotm-2.6.30-May28/include/linux/swap.h
> ===================================================================
> --- mmotm-2.6.30-May28.orig/include/linux/swap.h
> +++ mmotm-2.6.30-May28/include/linux/swap.h
> @@ -282,8 +282,10 @@ extern void si_swapinfo(struct sysinfo *
> extern swp_entry_t get_swap_page(void);
> extern swp_entry_t get_swap_page_of_type(int);
> extern int swap_duplicate(swp_entry_t);
> +extern int swapcache_prepare(swp_entry_t);
> extern int valid_swaphandles(swp_entry_t, unsigned long *);
> extern void swap_free(swp_entry_t);
> +extern void swapcache_free(swp_entry_t, struct page *page);
> extern int free_swap_and_cache(swp_entry_t);
> extern int swap_type_of(dev_t, sector_t, struct block_device **);
> extern unsigned int count_swap_pages(int, int);
> @@ -352,11 +354,16 @@ static inline void show_swap_cache_info(
>
> #define free_swap_and_cache(swp) is_migration_entry(swp)
> #define swap_duplicate(swp) is_migration_entry(swp)
> +#define swapcache_prepare(swp) is_migration_entry(swp)
>
> static inline void swap_free(swp_entry_t swp)
> {
> }
>
> +static inline void swapcache_free(swp_entry_t swp, struct page *page)
> +{
> +}
> +
> static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
> struct vm_area_struct *vma, unsigned long addr)
> {
> Index: mmotm-2.6.30-May28/mm/swap_state.c
> ===================================================================
> --- mmotm-2.6.30-May28.orig/mm/swap_state.c
> +++ mmotm-2.6.30-May28/mm/swap_state.c
> @@ -162,11 +162,11 @@ int add_to_swap(struct page *page)
> return 1;
> case -EEXIST:
> /* Raced with "speculative" read_swap_cache_async */
> - swap_free(entry);
> + swapcache_free(entry, NULL);
> continue;
> default:
> /* -ENOMEM radix-tree allocation failure */
> - swap_free(entry);
> + swapcache_free(entry, NULL);
> return 0;
> }
> }
> @@ -188,8 +188,7 @@ void delete_from_swap_cache(struct page
> __delete_from_swap_cache(page);
> spin_unlock_irq(&swapper_space.tree_lock);
>
> - mem_cgroup_uncharge_swapcache(page, entry);
> - swap_free(entry);
> + swapcache_free(entry, page);
> page_cache_release(page);
> }
>
> @@ -293,7 +292,7 @@ struct page *read_swap_cache_async(swp_e
> /*
> * Swap entry may have been freed since our caller observed it.
> */
> - if (!swap_duplicate(entry))
> + if (!swapcache_prepare(entry))
> break;
>
> /*
> @@ -317,7 +316,7 @@ struct page *read_swap_cache_async(swp_e
> }
> ClearPageSwapBacked(new_page);
> __clear_page_locked(new_page);
> - swap_free(entry);
> + swapcache_free(entry, NULL);
> } while (err != -ENOMEM);
>
> if (new_page)
> Index: mmotm-2.6.30-May28/mm/swapfile.c
> ===================================================================
> --- mmotm-2.6.30-May28.orig/mm/swapfile.c
> +++ mmotm-2.6.30-May28/mm/swapfile.c
> @@ -510,6 +510,16 @@ void swap_free(swp_entry_t entry)
> }
>
> /*
> + * Called after dropping swapcache to decrease refcnt to swap entries.
> + */
> +void swapcache_free(swp_entry_t entry, struct page *page)
> +{
> + if (page)
> + mem_cgroup_uncharge_swapcache(page, entry);
> + return swap_free(entry);
> +}
> +
> +/*
> * How many references to page are currently swapped out?
> */
> static inline int page_swapcount(struct page *page)
> @@ -1979,6 +1989,15 @@ bad_file:
> goto out;
> }
>
> +/*
> + * Called when allocating swap cache for exising swap entry,
> + */
> +int swapcache_prepare(swp_entry_t entry)
> +{
> + return swap_duplicate(entry);
> +}
> +
> +
> struct swap_info_struct *
> get_swap_info_struct(unsigned type)
> {
> Index: mmotm-2.6.30-May28/mm/vmscan.c
> ===================================================================
> --- mmotm-2.6.30-May28.orig/mm/vmscan.c
> +++ mmotm-2.6.30-May28/mm/vmscan.c
> @@ -477,8 +477,7 @@ static int __remove_mapping(struct addre
> swp_entry_t swap = { .val = page_private(page) };
> __delete_from_swap_cache(page);
> spin_unlock_irq(&mapping->tree_lock);
> - mem_cgroup_uncharge_swapcache(page, swap);
> - swap_free(swap);
> + swapcache_free(swap, page);
> } else {
> __remove_from_page_cache(page);
> spin_unlock_irq(&mapping->tree_lock);
> Index: mmotm-2.6.30-May28/mm/shmem.c
> ===================================================================
> --- mmotm-2.6.30-May28.orig/mm/shmem.c
> +++ mmotm-2.6.30-May28/mm/shmem.c
> @@ -1097,7 +1097,7 @@ static int shmem_writepage(struct page *
> shmem_swp_unmap(entry);
> unlock:
> spin_unlock(&info->lock);
> - swap_free(swap);
> + swapcache_free(swap, NULL);
> redirty:
> set_page_dirty(page);
> if (wbc->for_reclaim)
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/4] add swap cache interface for swap reference v2 (updated)
2009-05-29 6:05 ` Daisuke Nishimura
@ 2009-05-29 6:53 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-29 6:53 UTC (permalink / raw)
To: Daisuke Nishimura
Cc: linux-mm, linux-kernel, balbir, hugh.dickins, hannes, akpm
On Fri, 29 May 2009 15:05:25 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> On Fri, 29 May 2009 14:37:58 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Fri, 29 May 2009 14:08:32 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > IIUC, swap_free() at the end of shmem_writepage() should also be changed to swapcache_free().
> > > >
> > > Hmm!. Oh, yes. shmem_writepage()'s error path. Thank you. It will be fixed.
> > >
> > here.
> >
> Looks good to me.
>
> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
>
>
> BTW, I'm now testing(with swap-in/out and swap-on/off) [2/4] of this patch set.
> I think this patch set would work well, but it's a big change to swap,
> so we should test them very carefully.
>
Indeed.
BTW, even we ignores memcg, they are necessary change for us. ([2/4] and [3/4])
IIUC, I remember a NEC man had a story like below.
1. create 2 cpusets. A and B.
2. At first, tons of swaps are created by "A".
3. After size of applications in A shrinks, pages swapped out by "A" is now on-memory.
4. When running program in B, B can't use enough swaps because "A" uses tons of
cache-only swaps.
Why swap_entries are not reclaimed in "4" is because cpuset divides the LRU.
I think patch [3/4] can be a sliver bullet to this problem.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/4] reuse unused swap entry if necessary
2009-05-28 5:20 ` [PATCH 3/4] reuse unused swap entry if necessary KAMEZAWA Hiroyuki
@ 2009-05-29 21:55 ` Andrew Morton
2009-05-30 11:11 ` KAMEZAWA Hiroyuki
2009-05-30 6:40 ` Balbir Singh
1 sibling, 1 reply; 19+ messages in thread
From: Andrew Morton @ 2009-05-29 21:55 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm, linux-kernel, nishimura, balbir, hugh.dickins, hannes
On Thu, 28 May 2009 14:20:47 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> Now, we can know a swap entry is just used as SwapCache via swap_map,
> without looking up swap cache.
>
> Then, we have a chance to reuse swap-cache-only swap entries in
> get_swap_pages().
>
> This patch tries to free swap-cache-only swap entries if swap is
> not enough.
> Note: We hit following path when swap_cluster code cannot find
> a free cluster. Then, vm_swap_full() is not only condition to allow
> the kernel to reclaim unused swap.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
> mm/swapfile.c | 39 +++++++++++++++++++++++++++++++++++++++
> 1 file changed, 39 insertions(+)
>
> Index: new-trial-swapcount2/mm/swapfile.c
> ===================================================================
> --- new-trial-swapcount2.orig/mm/swapfile.c
> +++ new-trial-swapcount2/mm/swapfile.c
> @@ -73,6 +73,25 @@ static inline unsigned short make_swap_c
> return ret;
> }
>
> +static int
> +try_to_reuse_swap(struct swap_info_struct *si, unsigned long offset)
> +{
> + int type = si - swap_info;
> + swp_entry_t entry = swp_entry(type, offset);
> + struct page *page;
> + int ret = 0;
> +
> + page = find_get_page(&swapper_space, entry.val);
> + if (!page)
> + return 0;
> + if (trylock_page(page)) {
> + ret = try_to_free_swap(page);
> + unlock_page(page);
> + }
> + page_cache_release(page);
> + return ret;
> +}
This function could do with some comments explaining what it does, and
why. Also describing the semantics of its return value.
afacit it's misnamed. It doesn't 'reuse' anything. It in fact tries
to release a swap entry so that (presumably) its _caller_ can reuse the
swap slot.
The missing comment should also explain why this function is forced to
use the nasty trylock_page().
Why _is_ this function forced to use the nasty trylock_page()?
> /*
> * We need this because the bdev->unplug_fn can sleep and we cannot
> * hold swap_lock while calling the unplug_fn. And swap_lock
> @@ -294,6 +313,18 @@ checks:
> goto no_page;
> if (offset > si->highest_bit)
> scan_base = offset = si->lowest_bit;
> +
> + /* reuse swap entry of cache-only swap if not busy. */
> + if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> + int ret;
> + spin_unlock(&swap_lock);
> + ret = try_to_reuse_swap(si, offset);
> + spin_lock(&swap_lock);
> + if (ret)
> + goto checks; /* we released swap_lock. retry. */
> + goto scan; /* In some racy case */
> + }
So.. what prevents an infinite (or long) busy loop here? It appears
that if try_to_reuse_swap() returned non-zero, it will have cleared
si->swap_map[offset], so we don't rerun try_to_reuse_swap(). Yes?
`ret' is a poor choice of identifier. It is usually used to hold the
value which this function will be returning. Ditto `retval'. But that
is not this variable's role in this case. Perhaps a better name would
be slot_was_freed or something.
> if (si->swap_map[offset])
> goto scan;
>
> @@ -375,6 +406,10 @@ scan:
> spin_lock(&swap_lock);
> goto checks;
> }
> + if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> + spin_lock(&swap_lock);
> + goto checks;
> + }
> if (unlikely(--latency_ration < 0)) {
> cond_resched();
> latency_ration = LATENCY_LIMIT;
> @@ -386,6 +421,10 @@ scan:
> spin_lock(&swap_lock);
> goto checks;
> }
> + if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> + spin_lock(&swap_lock);
> + goto checks;
> + }
> if (unlikely(--latency_ration < 0)) {
> cond_resched();
> latency_ration = LATENCY_LIMIT;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/4] add swap cache interface for swap reference v2 (updated)
2009-05-29 5:37 ` [PATCH 1/4] add swap cache interface for swap reference v2 (updated) KAMEZAWA Hiroyuki
2009-05-29 6:05 ` Daisuke Nishimura
@ 2009-05-30 5:21 ` Balbir Singh
1 sibling, 0 replies; 19+ messages in thread
From: Balbir Singh @ 2009-05-30 5:21 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura, linux-mm, linux-kernel, hugh.dickins, hannes, akpm
* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-05-29 14:37:58]:
> On Fri, 29 May 2009 14:08:32 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > IIUC, swap_free() at the end of shmem_writepage() should also be changed to swapcache_free().
> > >
> > Hmm!. Oh, yes. shmem_writepage()'s error path. Thank you. It will be fixed.
> >
> here.
>
> ==
>
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> In following patch, usage of swap cache will be recorded into swap_map.
> This patch is for necessary interface changes to do that.
>
> 2 interfaces:
> - swapcache_prepare()
> - swapcache_free()
> is added for allocating/freeing refcnt from swap-cache to existing
> swap entries. But implementation itself is not changed under this patch.
> At adding swapcache_free(), memcg's hook code is moved under swapcache_free().
> This is better than using scattered hooks.
>
> Changelog: v1->v2
> - fixed shmem_writepage() error path.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Looks good to me so far
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
--
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/4] modify swap_map and add SWAP_HAS_CACHE flag.
2009-05-28 5:19 ` [PATCH 2/4] modify swap_map and add SWAP_HAS_CACHE flag KAMEZAWA Hiroyuki
@ 2009-05-30 6:10 ` Balbir Singh
2009-05-30 11:16 ` KAMEZAWA Hiroyuki
2009-06-01 7:04 ` Daisuke Nishimura
1 sibling, 1 reply; 19+ messages in thread
From: Balbir Singh @ 2009-05-30 6:10 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm, linux-kernel, nishimura, hugh.dickins, hannes, akpm
* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-05-28 14:19:00]:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> This is a part of patches for fixing memcg's swap account leak. But, IMHO,
> not a bad patch even if no memcg.
>
> Now, reference to swap is counted by swap_map[], an array of unsigned short.
> There are 2 kinds of references to swap.
> - reference from swap entry
> - reference from swap cache
> Then,
> - If there is swap cache && swap's refcnt is 1, there is only swap cache.
> (*) swapcount(entry) == 1 && find_get_page(swapper_space, entry) != NULL
>
> This counting logic have worked well for a long time. But considering that
> we cannot know there is a _real_ reference or not by swap_map[], current usage
> of counter is not very good.
>
> This patch adds a flag SWAP_HAS_CACHE and recored information that a swap entry
> has a cache or not. This will remove -1 magic used in swapfile.c and be a help
> to avoid unnecessary find_get_page().
>
> Changelog: v1->v2
> - fixed swapcache_prepare()'s return code.
> - changed swap_duplicate() be void function.
> - fixed racy case in swapoff().
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
> include/linux/swap.h | 14 ++-
> mm/swap_state.c | 5 +
> mm/swapfile.c | 203 ++++++++++++++++++++++++++++++++++++---------------
> 3 files changed, 160 insertions(+), 62 deletions(-)
>
> Index: new-trial-swapcount2/include/linux/swap.h
> ===================================================================
> --- new-trial-swapcount2.orig/include/linux/swap.h
> +++ new-trial-swapcount2/include/linux/swap.h
> @@ -129,9 +129,10 @@ enum {
>
> #define SWAP_CLUSTER_MAX 32
>
> -#define SWAP_MAP_MAX 0x7fff
> -#define SWAP_MAP_BAD 0x8000
> -
> +#define SWAP_MAP_MAX 0x7ffe
> +#define SWAP_MAP_BAD 0x7fff
> +#define SWAP_HAS_CACHE 0x8000 /* There is a swap cache of entry. */
Why count, can't we use swp->flags?
> +#define SWAP_COUNT_MASK (~SWAP_HAS_CACHE)
> /*
> * The in-memory structure used to track swap areas.
> */
> @@ -300,7 +301,7 @@ extern long total_swap_pages;
> extern void si_swapinfo(struct sysinfo *);
> extern swp_entry_t get_swap_page(void);
> extern swp_entry_t get_swap_page_of_type(int);
> -extern int swap_duplicate(swp_entry_t);
> +extern void swap_duplicate(swp_entry_t);
> extern int swapcache_prepare(swp_entry_t);
> extern int valid_swaphandles(swp_entry_t, unsigned long *);
> extern void swap_free(swp_entry_t);
> @@ -372,9 +373,12 @@ static inline void show_swap_cache_info(
> }
>
> #define free_swap_and_cache(swp) is_migration_entry(swp)
> -#define swap_duplicate(swp) is_migration_entry(swp)
> #define swapcache_prepare(swp) is_migration_entry(swp)
>
> +static inline void swap_duplicate(swp_entry_t swp)
> +{
> +}
> +
> static inline void swap_free(swp_entry_t swp)
> {
> }
> Index: new-trial-swapcount2/mm/swapfile.c
> ===================================================================
> --- new-trial-swapcount2.orig/mm/swapfile.c
> +++ new-trial-swapcount2/mm/swapfile.c
> @@ -53,6 +53,26 @@ static struct swap_info_struct swap_info
>
> static DEFINE_MUTEX(swapon_mutex);
>
> +/* For reference count accounting in swap_map */
> +static inline int swap_count(unsigned short ent)
> +{
> + return ent & SWAP_COUNT_MASK;
> +}
> +
> +static inline int swap_has_cache(unsigned short ent)
> +{
> + return ent & SWAP_HAS_CACHE;
> +}
> +
> +static inline unsigned short make_swap_count(int count, int has_cache)
> +{
> + unsigned short ret = count;
> +
> + if (has_cache)
> + return SWAP_HAS_CACHE | ret;
> + return ret;
> +}
make_swap_count() does not make too much sense in terms of the name
for the function. Should it be called generate_swap_count or
assign_swap_count_info?
> +
> /*
> * We need this because the bdev->unplug_fn can sleep and we cannot
> * hold swap_lock while calling the unplug_fn. And swap_lock
> @@ -167,7 +187,8 @@ static int wait_for_discard(void *word)
> #define SWAPFILE_CLUSTER 256
> #define LATENCY_LIMIT 256
>
> -static inline unsigned long scan_swap_map(struct swap_info_struct *si)
> +static inline unsigned long scan_swap_map(struct swap_info_struct *si,
> + int cache)
Can we please use bool for readability or even better an enum?
Looks good at first glance otherwise. I think distinguishing between
the counts is good, but also complex. Overall the patch is useful.
--
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/4] reuse unused swap entry if necessary
2009-05-28 5:20 ` [PATCH 3/4] reuse unused swap entry if necessary KAMEZAWA Hiroyuki
2009-05-29 21:55 ` Andrew Morton
@ 2009-05-30 6:40 ` Balbir Singh
1 sibling, 0 replies; 19+ messages in thread
From: Balbir Singh @ 2009-05-30 6:40 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm, linux-kernel, nishimura, hugh.dickins, hannes, akpm
* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-05-28 14:20:47]:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> Now, we can know a swap entry is just used as SwapCache via swap_map,
> without looking up swap cache.
>
> Then, we have a chance to reuse swap-cache-only swap entries in
> get_swap_pages().
>
> This patch tries to free swap-cache-only swap entries if swap is
> not enough.
> Note: We hit following path when swap_cluster code cannot find
> a free cluster. Then, vm_swap_full() is not only condition to allow
> the kernel to reclaim unused swap.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Looks good, except the now changed behaviour where swap will get
distributed between clusters. Having the vm_swap_full() is a good
optimization for swap allocation performance though. I'd say lets wait
to see the results of testing
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
--
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 4/4] memcg: fix swap accounting
2009-05-28 5:21 ` [PATCH 4/4] memcg: fix swap accounting KAMEZAWA Hiroyuki
@ 2009-05-30 7:20 ` Balbir Singh
0 siblings, 0 replies; 19+ messages in thread
From: Balbir Singh @ 2009-05-30 7:20 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm, linux-kernel, nishimura, hugh.dickins, hannes, akpm
* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-05-28 14:21:56]:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> This patch fixes mis-accounting of swap usage in memcg.
>
> In current implementation, memcg's swap account is uncharged only when
> swap is completely freed. But there are several cases where swap
> cannot be freed cleanly. For handling that, this patch changes that
> memcg uncharges swap account when swap has no references other than cache.
>
> By this, memcg's swap entry accounting can be fully synchronous with
> the application's behavior.
> This patch also changes memcg's hooks for swap-out.
>
Looks good, so for count == 0, we directly free the and uncharge, for
the others we use retry_to_use_swap(). cool!
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
--
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/4] reuse unused swap entry if necessary
2009-05-29 21:55 ` Andrew Morton
@ 2009-05-30 11:11 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-30 11:11 UTC (permalink / raw)
To: Andrew Morton
Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel, nishimura, balbir,
hugh.dickins, hannes
Andrew Morton wrote:
> On Thu, 28 May 2009 14:20:47 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
>> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>
>> Now, we can know a swap entry is just used as SwapCache via swap_map,
>> without looking up swap cache.
>>
>> Then, we have a chance to reuse swap-cache-only swap entries in
>> get_swap_pages().
>>
>> This patch tries to free swap-cache-only swap entries if swap is
>> not enough.
>> Note: We hit following path when swap_cluster code cannot find
>> a free cluster. Then, vm_swap_full() is not only condition to allow
>> the kernel to reclaim unused swap.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> ---
>> mm/swapfile.c | 39 +++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 39 insertions(+)
>>
>> Index: new-trial-swapcount2/mm/swapfile.c
>> ===================================================================
>> --- new-trial-swapcount2.orig/mm/swapfile.c
>> +++ new-trial-swapcount2/mm/swapfile.c
>> @@ -73,6 +73,25 @@ static inline unsigned short make_swap_c
>> return ret;
>> }
>>
>> +static int
>> +try_to_reuse_swap(struct swap_info_struct *si, unsigned long offset)
>> +{
>> + int type = si - swap_info;
>> + swp_entry_t entry = swp_entry(type, offset);
>> + struct page *page;
>> + int ret = 0;
>> +
>> + page = find_get_page(&swapper_space, entry.val);
>> + if (!page)
>> + return 0;
>> + if (trylock_page(page)) {
>> + ret = try_to_free_swap(page);
>> + unlock_page(page);
>> + }
>> + page_cache_release(page);
>> + return ret;
>> +}
>
> This function could do with some comments explaining what it does, and
> why. Also describing the semantics of its return value.
>
Ah, there are no comments ...
> afacit it's misnamed. It doesn't 'reuse' anything. It in fact tries
> to release a swap entry so that (presumably) its _caller_ can reuse the
> swap slot.
>
yes.
> The missing comment should also explain why this function is forced to
> use the nasty trylock_page().
>
> Why _is_ this function forced to use the nasty trylock_page()?
>
Because get_swap_page() is called by vmscan.c and when this is called
the caller hold page_lock() on a page. IIUC, nesting lock_page()
without trylock is not good here.
I'll explain this in the next post.
>> /*
>> * We need this because the bdev->unplug_fn can sleep and we cannot
>> * hold swap_lock while calling the unplug_fn. And swap_lock
>> @@ -294,6 +313,18 @@ checks:
>> goto no_page;
>> if (offset > si->highest_bit)
>> scan_base = offset = si->lowest_bit;
>> +
>> + /* reuse swap entry of cache-only swap if not busy. */
>> + if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
>> + int ret;
>> + spin_unlock(&swap_lock);
>> + ret = try_to_reuse_swap(si, offset);
>> + spin_lock(&swap_lock);
>> + if (ret)
>> + goto checks; /* we released swap_lock. retry. */
>> + goto scan; /* In some racy case */
>> + }
>
> So.. what prevents an infinite (or long) busy loop here? It appears
> that if try_to_reuse_swap() returned non-zero, it will have cleared
> si->swap_map[offset], so we don't rerun try_to_reuse_swap(). Yes?
>
yes.
> `ret' is a poor choice of identifier. It is usually used to hold the
> value which this function will be returning. Ditto `retval'. But that
> is not this variable's role in this case. Perhaps a better name would
> be slot_was_freed or something.
>
Sure, I'll modifty this patch to be more clear one.
Thank you for review!
-Kame
>> if (si->swap_map[offset])
>> goto scan;
>>
>> @@ -375,6 +406,10 @@ scan:
>> spin_lock(&swap_lock);
>> goto checks;
>> }
>> + if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
>> + spin_lock(&swap_lock);
>> + goto checks;
>> + }
>> if (unlikely(--latency_ration < 0)) {
>> cond_resched();
>> latency_ration = LATENCY_LIMIT;
>> @@ -386,6 +421,10 @@ scan:
>> spin_lock(&swap_lock);
>> goto checks;
>> }
>> + if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
>> + spin_lock(&swap_lock);
>> + goto checks;
>> + }
>> if (unlikely(--latency_ration < 0)) {
>> cond_resched();
>> latency_ration = LATENCY_LIMIT;
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/4] modify swap_map and add SWAP_HAS_CACHE flag.
2009-05-30 6:10 ` Balbir Singh
@ 2009-05-30 11:16 ` KAMEZAWA Hiroyuki
2009-05-30 11:35 ` Balbir Singh
0 siblings, 1 reply; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-30 11:16 UTC (permalink / raw)
To: balbir
Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel, nishimura,
hugh.dickins, hannes, akpm
Balbir Singh wrote:
>> #define SWAP_CLUSTER_MAX 32
>>
>> -#define SWAP_MAP_MAX 0x7fff
>> -#define SWAP_MAP_BAD 0x8000
>> -
>> +#define SWAP_MAP_MAX 0x7ffe
>> +#define SWAP_MAP_BAD 0x7fff
>> +#define SWAP_HAS_CACHE 0x8000 /* There is a swap cache of entry. */
>
> Why count, can't we use swp->flags?
>
Hmm ? swap_map just only a "unsiged short" value per entry..sorry,
I can't catch what you mention to.
>> +#define SWAP_COUNT_MASK (~SWAP_HAS_CACHE)
>> /*
>> * The in-memory structure used to track swap areas.
>> */
>> @@ -300,7 +301,7 @@ extern long total_swap_pages;
>> extern void si_swapinfo(struct sysinfo *);
>> extern swp_entry_t get_swap_page(void);
>> extern swp_entry_t get_swap_page_of_type(int);
>> -extern int swap_duplicate(swp_entry_t);
>> +extern void swap_duplicate(swp_entry_t);
>> extern int swapcache_prepare(swp_entry_t);
>> extern int valid_swaphandles(swp_entry_t, unsigned long *);
>> extern void swap_free(swp_entry_t);
>> @@ -372,9 +373,12 @@ static inline void show_swap_cache_info(
>> }
>>
>> #define free_swap_and_cache(swp) is_migration_entry(swp)
>> -#define swap_duplicate(swp) is_migration_entry(swp)
>> #define swapcache_prepare(swp) is_migration_entry(swp)
>>
>> +static inline void swap_duplicate(swp_entry_t swp)
>> +{
>> +}
>> +
>> static inline void swap_free(swp_entry_t swp)
>> {
>> }
>> Index: new-trial-swapcount2/mm/swapfile.c
>> ===================================================================
>> --- new-trial-swapcount2.orig/mm/swapfile.c
>> +++ new-trial-swapcount2/mm/swapfile.c
>> @@ -53,6 +53,26 @@ static struct swap_info_struct swap_info
>>
>> static DEFINE_MUTEX(swapon_mutex);
>>
>> +/* For reference count accounting in swap_map */
>> +static inline int swap_count(unsigned short ent)
>> +{
>> + return ent & SWAP_COUNT_MASK;
>> +}
>> +
>> +static inline int swap_has_cache(unsigned short ent)
>> +{
>> + return ent & SWAP_HAS_CACHE;
>> +}
>> +
>> +static inline unsigned short make_swap_count(int count, int has_cache)
>> +{
>> + unsigned short ret = count;
>> +
>> + if (has_cache)
>> + return SWAP_HAS_CACHE | ret;
>> + return ret;
>> +}
>
> make_swap_count() does not make too much sense in terms of the name
> for the function. Should it be called generate_swap_count or
> assign_swap_count_info?
Hmm ? ok, rename this as generate_swap_count(). or
generate_swapmap_info().
>> +
>> /*
>> * We need this because the bdev->unplug_fn can sleep and we cannot
>> * hold swap_lock while calling the unplug_fn. And swap_lock
>> @@ -167,7 +187,8 @@ static int wait_for_discard(void *word)
>> #define SWAPFILE_CLUSTER 256
>> #define LATENCY_LIMIT 256
>>
>> -static inline unsigned long scan_swap_map(struct swap_info_struct *si)
>> +static inline unsigned long scan_swap_map(struct swap_info_struct *si,
>> + int cache)
>
> Can we please use bool for readability or even better an enum?
>
ok, enum.
> Looks good at first glance otherwise. I think distinguishing between
> the counts is good, but also complex. Overall the patch is useful.
>
Thank you for review.
-Kame
> --
> Balbir
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/4] modify swap_map and add SWAP_HAS_CACHE flag.
2009-05-30 11:16 ` KAMEZAWA Hiroyuki
@ 2009-05-30 11:35 ` Balbir Singh
0 siblings, 0 replies; 19+ messages in thread
From: Balbir Singh @ 2009-05-30 11:35 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm, linux-kernel, nishimura, hugh.dickins, hannes, akpm
* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-05-30 20:16:51]:
> Balbir Singh wrote:
> >> #define SWAP_CLUSTER_MAX 32
> >>
> >> -#define SWAP_MAP_MAX 0x7fff
> >> -#define SWAP_MAP_BAD 0x8000
> >> -
> >> +#define SWAP_MAP_MAX 0x7ffe
> >> +#define SWAP_MAP_BAD 0x7fff
> >> +#define SWAP_HAS_CACHE 0x8000 /* There is a swap cache of entry. */
> >
> > Why count, can't we use swp->flags?
> >
>
> Hmm ? swap_map just only a "unsiged short" value per entry..sorry,
> I can't catch what you mention to.
Sorry for the noise, the count is directly contained in swap_map[].
--
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/4] modify swap_map and add SWAP_HAS_CACHE flag.
2009-05-28 5:19 ` [PATCH 2/4] modify swap_map and add SWAP_HAS_CACHE flag KAMEZAWA Hiroyuki
2009-05-30 6:10 ` Balbir Singh
@ 2009-06-01 7:04 ` Daisuke Nishimura
1 sibling, 0 replies; 19+ messages in thread
From: Daisuke Nishimura @ 2009-06-01 7:04 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm, linux-kernel, balbir, hugh.dickins, hannes, akpm,
Daisuke Nishimura
> BTW, I'm now testing(with swap-in/out and swap-on/off) [2/4] of this patch set.
I've not hit any critical BUG during this weekend in my test.
Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Thanks,
Daisuke Nishimura.
On Thu, 28 May 2009 14:19:00 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> This is a part of patches for fixing memcg's swap account leak. But, IMHO,
> not a bad patch even if no memcg.
>
> Now, reference to swap is counted by swap_map[], an array of unsigned short.
> There are 2 kinds of references to swap.
> - reference from swap entry
> - reference from swap cache
> Then,
> - If there is swap cache && swap's refcnt is 1, there is only swap cache.
> (*) swapcount(entry) == 1 && find_get_page(swapper_space, entry) != NULL
>
> This counting logic have worked well for a long time. But considering that
> we cannot know there is a _real_ reference or not by swap_map[], current usage
> of counter is not very good.
>
> This patch adds a flag SWAP_HAS_CACHE and recored information that a swap entry
> has a cache or not. This will remove -1 magic used in swapfile.c and be a help
> to avoid unnecessary find_get_page().
>
> Changelog: v1->v2
> - fixed swapcache_prepare()'s return code.
> - changed swap_duplicate() be void function.
> - fixed racy case in swapoff().
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
> include/linux/swap.h | 14 ++-
> mm/swap_state.c | 5 +
> mm/swapfile.c | 203 ++++++++++++++++++++++++++++++++++++---------------
> 3 files changed, 160 insertions(+), 62 deletions(-)
>
> Index: new-trial-swapcount2/include/linux/swap.h
> ===================================================================
> --- new-trial-swapcount2.orig/include/linux/swap.h
> +++ new-trial-swapcount2/include/linux/swap.h
> @@ -129,9 +129,10 @@ enum {
>
> #define SWAP_CLUSTER_MAX 32
>
> -#define SWAP_MAP_MAX 0x7fff
> -#define SWAP_MAP_BAD 0x8000
> -
> +#define SWAP_MAP_MAX 0x7ffe
> +#define SWAP_MAP_BAD 0x7fff
> +#define SWAP_HAS_CACHE 0x8000 /* There is a swap cache of entry. */
> +#define SWAP_COUNT_MASK (~SWAP_HAS_CACHE)
> /*
> * The in-memory structure used to track swap areas.
> */
> @@ -300,7 +301,7 @@ extern long total_swap_pages;
> extern void si_swapinfo(struct sysinfo *);
> extern swp_entry_t get_swap_page(void);
> extern swp_entry_t get_swap_page_of_type(int);
> -extern int swap_duplicate(swp_entry_t);
> +extern void swap_duplicate(swp_entry_t);
> extern int swapcache_prepare(swp_entry_t);
> extern int valid_swaphandles(swp_entry_t, unsigned long *);
> extern void swap_free(swp_entry_t);
> @@ -372,9 +373,12 @@ static inline void show_swap_cache_info(
> }
>
> #define free_swap_and_cache(swp) is_migration_entry(swp)
> -#define swap_duplicate(swp) is_migration_entry(swp)
> #define swapcache_prepare(swp) is_migration_entry(swp)
>
> +static inline void swap_duplicate(swp_entry_t swp)
> +{
> +}
> +
> static inline void swap_free(swp_entry_t swp)
> {
> }
> Index: new-trial-swapcount2/mm/swapfile.c
> ===================================================================
> --- new-trial-swapcount2.orig/mm/swapfile.c
> +++ new-trial-swapcount2/mm/swapfile.c
> @@ -53,6 +53,26 @@ static struct swap_info_struct swap_info
>
> static DEFINE_MUTEX(swapon_mutex);
>
> +/* For reference count accounting in swap_map */
> +static inline int swap_count(unsigned short ent)
> +{
> + return ent & SWAP_COUNT_MASK;
> +}
> +
> +static inline int swap_has_cache(unsigned short ent)
> +{
> + return ent & SWAP_HAS_CACHE;
> +}
> +
> +static inline unsigned short make_swap_count(int count, int has_cache)
> +{
> + unsigned short ret = count;
> +
> + if (has_cache)
> + return SWAP_HAS_CACHE | ret;
> + return ret;
> +}
> +
> /*
> * We need this because the bdev->unplug_fn can sleep and we cannot
> * hold swap_lock while calling the unplug_fn. And swap_lock
> @@ -167,7 +187,8 @@ static int wait_for_discard(void *word)
> #define SWAPFILE_CLUSTER 256
> #define LATENCY_LIMIT 256
>
> -static inline unsigned long scan_swap_map(struct swap_info_struct *si)
> +static inline unsigned long scan_swap_map(struct swap_info_struct *si,
> + int cache)
> {
> unsigned long offset;
> unsigned long scan_base;
> @@ -285,7 +306,10 @@ checks:
> si->lowest_bit = si->max;
> si->highest_bit = 0;
> }
> - si->swap_map[offset] = 1;
> + if (cache) /* at usual swap-out via vmscan.c */
> + si->swap_map[offset] = make_swap_count(0, 1);
> + else /* at suspend */
> + si->swap_map[offset] = make_swap_count(1, 0);
> si->cluster_next = offset + 1;
> si->flags -= SWP_SCANNING;
>
> @@ -401,7 +425,8 @@ swp_entry_t get_swap_page(void)
> continue;
>
> swap_list.next = next;
> - offset = scan_swap_map(si);
> + /* This is called for allocating swap entry for cache */
> + offset = scan_swap_map(si, 1);
> if (offset) {
> spin_unlock(&swap_lock);
> return swp_entry(type, offset);
> @@ -415,6 +440,7 @@ noswap:
> return (swp_entry_t) {0};
> }
>
> +/* The only caller of this function is now susupend routine */
> swp_entry_t get_swap_page_of_type(int type)
> {
> struct swap_info_struct *si;
> @@ -424,7 +450,8 @@ swp_entry_t get_swap_page_of_type(int ty
> si = swap_info + type;
> if (si->flags & SWP_WRITEOK) {
> nr_swap_pages--;
> - offset = scan_swap_map(si);
> + /* This is called for allocating swap entry, not cache */
> + offset = scan_swap_map(si, 0);
> if (offset) {
> spin_unlock(&swap_lock);
> return swp_entry(type, offset);
> @@ -471,25 +498,36 @@ out:
> return NULL;
> }
>
> -static int swap_entry_free(struct swap_info_struct *p, swp_entry_t ent)
> +static int swap_entry_free(struct swap_info_struct *p,
> + swp_entry_t ent, int cache)
> {
> unsigned long offset = swp_offset(ent);
> - int count = p->swap_map[offset];
> + int count = swap_count(p->swap_map[offset]);
> + int has_cache = swap_has_cache(p->swap_map[offset]);
>
> - if (count < SWAP_MAP_MAX) {
> - count--;
> - p->swap_map[offset] = count;
> - if (!count) {
> - if (offset < p->lowest_bit)
> - p->lowest_bit = offset;
> - if (offset > p->highest_bit)
> - p->highest_bit = offset;
> - if (p->prio > swap_info[swap_list.next].prio)
> - swap_list.next = p - swap_info;
> - nr_swap_pages++;
> - p->inuse_pages--;
> - mem_cgroup_uncharge_swap(ent);
> - }
> + if (!cache) { /* dropping usage count of swap */
> + if (count < SWAP_MAP_MAX) {
> + count--;
> + p->swap_map[offset] = make_swap_count(count, has_cache);
> + }
> + } else { /* dropping swap cache flag */
> + VM_BUG_ON(!has_cache);
> + p->swap_map[offset] = make_swap_count(count, 0);
> +
> + }
> + /* return code. */
> + count = p->swap_map[offset];
> + /* free if no reference */
> + if (!count) {
> + if (offset < p->lowest_bit)
> + p->lowest_bit = offset;
> + if (offset > p->highest_bit)
> + p->highest_bit = offset;
> + if (p->prio > swap_info[swap_list.next].prio)
> + swap_list.next = p - swap_info;
> + nr_swap_pages++;
> + p->inuse_pages--;
> + mem_cgroup_uncharge_swap(ent);
> }
> return count;
> }
> @@ -504,7 +542,7 @@ void swap_free(swp_entry_t entry)
>
> p = swap_info_get(entry);
> if (p) {
> - swap_entry_free(p, entry);
> + swap_entry_free(p, entry, 0);
> spin_unlock(&swap_lock);
> }
> }
> @@ -514,9 +552,16 @@ void swap_free(swp_entry_t entry)
> */
> void swapcache_free(swp_entry_t entry, struct page *page)
> {
> + struct swap_info_struct *p;
> +
> if (page)
> mem_cgroup_uncharge_swapcache(page, entry);
> - return swap_free(entry);
> + p = swap_info_get(entry);
> + if (p) {
> + swap_entry_free(p, entry, 1);
> + spin_unlock(&swap_lock);
> + }
> + return;
> }
>
> /*
> @@ -531,8 +576,7 @@ static inline int page_swapcount(struct
> entry.val = page_private(page);
> p = swap_info_get(entry);
> if (p) {
> - /* Subtract the 1 for the swap cache itself */
> - count = p->swap_map[swp_offset(entry)] - 1;
> + count = swap_count(p->swap_map[swp_offset(entry)]);
> spin_unlock(&swap_lock);
> }
> return count;
> @@ -594,7 +638,7 @@ int free_swap_and_cache(swp_entry_t entr
>
> p = swap_info_get(entry);
> if (p) {
> - if (swap_entry_free(p, entry) == 1) {
> + if (swap_entry_free(p, entry, 0) == SWAP_HAS_CACHE) {
> page = find_get_page(&swapper_space, entry.val);
> if (page && !trylock_page(page)) {
> page_cache_release(page);
> @@ -901,7 +945,7 @@ static unsigned int find_next_to_unuse(s
> i = 1;
> }
> count = si->swap_map[i];
> - if (count && count != SWAP_MAP_BAD)
> + if (count && swap_count(count) != SWAP_MAP_BAD)
> break;
> }
> return i;
> @@ -1005,13 +1049,13 @@ static int try_to_unuse(unsigned int typ
> */
> shmem = 0;
> swcount = *swap_map;
> - if (swcount > 1) {
> + if (swap_count(swcount)) {
> if (start_mm == &init_mm)
> shmem = shmem_unuse(entry, page);
> else
> retval = unuse_mm(start_mm, entry, page);
> }
> - if (*swap_map > 1) {
> + if (swap_count(*swap_map)) {
> int set_start_mm = (*swap_map >= swcount);
> struct list_head *p = &start_mm->mmlist;
> struct mm_struct *new_start_mm = start_mm;
> @@ -1021,7 +1065,7 @@ static int try_to_unuse(unsigned int typ
> atomic_inc(&new_start_mm->mm_users);
> atomic_inc(&prev_mm->mm_users);
> spin_lock(&mmlist_lock);
> - while (*swap_map > 1 && !retval && !shmem &&
> + while (swap_count(*swap_map) && !retval && !shmem &&
> (p = p->next) != &start_mm->mmlist) {
> mm = list_entry(p, struct mm_struct, mmlist);
> if (!atomic_inc_not_zero(&mm->mm_users))
> @@ -1033,14 +1077,16 @@ static int try_to_unuse(unsigned int typ
> cond_resched();
>
> swcount = *swap_map;
> - if (swcount <= 1)
> + if (!swap_count(swcount)) /* any usage ? */
> ;
> else if (mm == &init_mm) {
> set_start_mm = 1;
> shmem = shmem_unuse(entry, page);
> } else
> retval = unuse_mm(mm, entry, page);
> - if (set_start_mm && *swap_map < swcount) {
> +
> + if (set_start_mm &&
> + swap_count(*swap_map) < swcount) {
> mmput(new_start_mm);
> atomic_inc(&mm->mm_users);
> new_start_mm = mm;
> @@ -1067,21 +1113,25 @@ static int try_to_unuse(unsigned int typ
> }
>
> /*
> - * How could swap count reach 0x7fff when the maximum
> - * pid is 0x7fff, and there's no way to repeat a swap
> - * page within an mm (except in shmem, where it's the
> - * shared object which takes the reference count)?
> - * We believe SWAP_MAP_MAX cannot occur in Linux 2.4.
> - *
> + * How could swap count reach 0x7ffe ?
> + * There's no way to repeat a swap page within an mm
> + * (except in shmem, where it's the shared object which takes
> + * the reference count)?
> + * We believe SWAP_MAP_MAX cannot occur.(if occur, unsigned
> + * short is too small....)
> * If that's wrong, then we should worry more about
> * exit_mmap() and do_munmap() cases described above:
> * we might be resetting SWAP_MAP_MAX too early here.
> * We know "Undead"s can happen, they're okay, so don't
> * report them; but do report if we reset SWAP_MAP_MAX.
> */
> - if (*swap_map == SWAP_MAP_MAX) {
> + /* We might release the lock_page() in unuse_mm(). */
> + if (!PageSwapCache(page) || page_private(page) != entry.val)
> + goto retry;
> +
> + if (swap_count(*swap_map) == SWAP_MAP_MAX) {
> spin_lock(&swap_lock);
> - *swap_map = 1;
> + *swap_map = make_swap_count(0, 1);
> spin_unlock(&swap_lock);
> reset_overflow = 1;
> }
> @@ -1099,7 +1149,8 @@ static int try_to_unuse(unsigned int typ
> * pages would be incorrect if swap supported "shared
> * private" pages, but they are handled by tmpfs files.
> */
> - if ((*swap_map > 1) && PageDirty(page) && PageSwapCache(page)) {
> + if (swap_count(*swap_map) &&
> + PageDirty(page) && PageSwapCache(page)) {
> struct writeback_control wbc = {
> .sync_mode = WB_SYNC_NONE,
> };
> @@ -1126,6 +1177,7 @@ static int try_to_unuse(unsigned int typ
> * mark page dirty so shrink_page_list will preserve it.
> */
> SetPageDirty(page);
> +retry:
> unlock_page(page);
> page_cache_release(page);
>
> @@ -1952,15 +2004,22 @@ void si_swapinfo(struct sysinfo *val)
> *
> * Note: if swap_map[] reaches SWAP_MAP_MAX the entries are treated as
> * "permanent", but will be reclaimed by the next swapoff.
> + * Returns error code in following case.
> + * - success -> 0
> + * - swp_entry is invalid -> EINVAL
> + * - swp_entry is migration entry -> EINVAL
> + * - swap-cache reference is requested but there is already one. -> EEXIST
> + * - swap-cache reference is requested but the entry is not used. -> ENOENT
> */
> -int swap_duplicate(swp_entry_t entry)
> +static int __swap_duplicate(swp_entry_t entry, int cache)
> {
> struct swap_info_struct * p;
> unsigned long offset, type;
> - int result = 0;
> + int result = -EINVAL;
> + int count, has_cache;
>
> if (is_migration_entry(entry))
> - return 1;
> + return -EINVAL;
>
> type = swp_type(entry);
> if (type >= nr_swapfiles)
> @@ -1969,17 +2028,37 @@ int swap_duplicate(swp_entry_t entry)
> offset = swp_offset(entry);
>
> spin_lock(&swap_lock);
> - if (offset < p->max && p->swap_map[offset]) {
> - if (p->swap_map[offset] < SWAP_MAP_MAX - 1) {
> - p->swap_map[offset]++;
> - result = 1;
> - } else if (p->swap_map[offset] <= SWAP_MAP_MAX) {
> +
> + if (unlikely(offset >= p->max))
> + goto unlock_out;
> +
> + count = swap_count(p->swap_map[offset]);
> + has_cache = swap_has_cache(p->swap_map[offset]);
> + if (cache) { /* called for swapcache/swapin-readahead */
> + /* set SWAP_HAS_CACHE if there is no cache and entry is used */
> + if (!has_cache && count) {
> + p->swap_map[offset] = make_swap_count(count, 1);
> + result = 0;
> + } else if (has_cache)
> + result = -EEXIST;
> + else if (!count)
> + result = -ENOENT;
> + } else if (count || has_cache) {
> + if (count < SWAP_MAP_MAX - 1) {
> + p->swap_map[offset] = make_swap_count(count + 1,
> + has_cache);
> + result = 0;
> + } else if (count <= SWAP_MAP_MAX) {
> if (swap_overflow++ < 5)
> - printk(KERN_WARNING "swap_dup: swap entry overflow\n");
> - p->swap_map[offset] = SWAP_MAP_MAX;
> - result = 1;
> - }
> - }
> + printk(KERN_WARNING
> + "swap_dup: swap entry overflow\n");
> + p->swap_map[offset] = make_swap_count(SWAP_MAP_MAX,
> + has_cache);
> + result = 0;
> + }
> + } else
> + result = -ENOENT; /* unused swap entry */
> +unlock_out:
> spin_unlock(&swap_lock);
> out:
> return result;
> @@ -1988,13 +2067,25 @@ bad_file:
> printk(KERN_ERR "swap_dup: %s%08lx\n", Bad_file, entry.val);
> goto out;
> }
> +/*
> + * increase reference count of swap entry by 1.
> + */
> +void swap_duplicate(swp_entry_t entry)
> +{
> + __swap_duplicate(entry, 0);
> +}
>
> /*
> + * @entry: swap entry for which we allocate swap cache.
> + *
> * Called when allocating swap cache for exising swap entry,
> + * This can return error codes. Returns 0 at success.
> + * -EBUSY means there is a swap cache.
> + * Note: return code is different from swap_duplicate().
> */
> int swapcache_prepare(swp_entry_t entry)
> {
> - return swap_duplicate(entry);
> + return __swap_duplicate(entry, 1);
> }
>
>
> @@ -2035,7 +2126,7 @@ int valid_swaphandles(swp_entry_t entry,
> /* Don't read in free or bad pages */
> if (!si->swap_map[toff])
> break;
> - if (si->swap_map[toff] == SWAP_MAP_BAD)
> + if (swap_count(si->swap_map[toff]) == SWAP_MAP_BAD)
> break;
> }
> /* Count contiguous allocated slots below our target */
> @@ -2043,7 +2134,7 @@ int valid_swaphandles(swp_entry_t entry,
> /* Don't read in free or bad pages */
> if (!si->swap_map[toff])
> break;
> - if (si->swap_map[toff] == SWAP_MAP_BAD)
> + if (swap_count(si->swap_map[toff]) == SWAP_MAP_BAD)
> break;
> }
> spin_unlock(&swap_lock);
> Index: new-trial-swapcount2/mm/swap_state.c
> ===================================================================
> --- new-trial-swapcount2.orig/mm/swap_state.c
> +++ new-trial-swapcount2/mm/swap_state.c
> @@ -292,7 +292,10 @@ struct page *read_swap_cache_async(swp_e
> /*
> * Swap entry may have been freed since our caller observed it.
> */
> - if (!swapcache_prepare(entry))
> + err = swapcache_prepare(entry);
> + if (err == -EEXIST) /* seems racy */
> + continue;
> + if (err) /* swp entry is obsolete ? */
> break;
>
> /*
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2009-06-01 7:09 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-28 4:54 [PATCH 0/4] memcg fix swap accounting (28/May) KAMEZAWA Hiroyuki
2009-05-28 5:10 ` [PATCH 1/4] add swap cache interface for swap reference KAMEZAWA Hiroyuki
2009-05-29 4:21 ` Daisuke Nishimura
2009-05-29 5:08 ` KAMEZAWA Hiroyuki
2009-05-29 5:37 ` [PATCH 1/4] add swap cache interface for swap reference v2 (updated) KAMEZAWA Hiroyuki
2009-05-29 6:05 ` Daisuke Nishimura
2009-05-29 6:53 ` KAMEZAWA Hiroyuki
2009-05-30 5:21 ` Balbir Singh
2009-05-28 5:19 ` [PATCH 2/4] modify swap_map and add SWAP_HAS_CACHE flag KAMEZAWA Hiroyuki
2009-05-30 6:10 ` Balbir Singh
2009-05-30 11:16 ` KAMEZAWA Hiroyuki
2009-05-30 11:35 ` Balbir Singh
2009-06-01 7:04 ` Daisuke Nishimura
2009-05-28 5:20 ` [PATCH 3/4] reuse unused swap entry if necessary KAMEZAWA Hiroyuki
2009-05-29 21:55 ` Andrew Morton
2009-05-30 11:11 ` KAMEZAWA Hiroyuki
2009-05-30 6:40 ` Balbir Singh
2009-05-28 5:21 ` [PATCH 4/4] memcg: fix swap accounting KAMEZAWA Hiroyuki
2009-05-30 7:20 ` Balbir Singh
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox