* [RFC][PATCH 0/2] mm: some patches about add_to_swap_cache()
@ 2009-08-10 2:23 Daisuke Nishimura
2009-08-10 2:26 ` [BUGFIX][1/2] mm: add_to_swap_cache() must not sleep Daisuke Nishimura
2009-08-10 2:27 ` [cleanup][2/2] mm: add_to_swap_cache() does not return -EEXIST Daisuke Nishimura
0 siblings, 2 replies; 8+ messages in thread
From: Daisuke Nishimura @ 2009-08-10 2:23 UTC (permalink / raw)
To: LKML, linux-mm
Cc: Andrew Morton, Balbir Singh, Hugh Dickins, KAMEZAWA Hiroyuki,
Johannes Weiner, Daisuke Nishimura
These are patches about add_to_swap_cache(), related to
the commit 355cfa73(mm: modify swap_map and add SWAP_HAS_CACHE flag).
[BUGFIX][1/2] mm: add_to_swap_cache() must not sleep
[cleanup][2/2] mm: add_to_swap_cache() does not return -EEXIST
These are based on 2.6.31-rc5, but can be applied onto mmotm.
Any comments or suggestions would be welcome.
Thanks,
Daisuke Nishimura.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 8+ messages in thread
* [BUGFIX][1/2] mm: add_to_swap_cache() must not sleep
2009-08-10 2:23 [RFC][PATCH 0/2] mm: some patches about add_to_swap_cache() Daisuke Nishimura
@ 2009-08-10 2:26 ` Daisuke Nishimura
2009-08-10 3:16 ` KAMEZAWA Hiroyuki
2009-08-10 2:27 ` [cleanup][2/2] mm: add_to_swap_cache() does not return -EEXIST Daisuke Nishimura
1 sibling, 1 reply; 8+ messages in thread
From: Daisuke Nishimura @ 2009-08-10 2:26 UTC (permalink / raw)
To: LKML, linux-mm
Cc: Andrew Morton, Balbir Singh, Hugh Dickins, KAMEZAWA Hiroyuki,
Johannes Weiner, Daisuke Nishimura
After commit 355cfa73(mm: modify swap_map and add SWAP_HAS_CACHE flag),
read_swap_cache_async() will busy-wait while a entry doesn't on swap cache
but it has SWAP_HAS_CACHE flag.
Such entries can exist on add/delete path of swap cache.
On add path, add_to_swap_cache() is called soon after SWAP_HAS_CACHE flag
is set, and on delete path, swapcache_free() will be called (SWAP_HAS_CACHE
flag is cleared) soon after __delete_from_swap_cache() is called.
So, the busy-wait works well in most cases.
But this mechanism can cause soft lockup if add_to_swap_cache() sleeps
and read_swap_cache_async() tries to swap-in the same entry on the same cpu.
add_to_swap() and shmem_writepage() call add_to_swap_cache() w/o __GFP_WAIT,
but read_swap_cache_async() can call it w/ __GFP_WAIT, so it can cause
soft lockup.
This patch changes the gfp_mask of add_to_swap_cache() in read_swap_cache_async().
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
mm/swap_state.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 42cd38e..3e6dd72 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -76,6 +76,7 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
VM_BUG_ON(!PageLocked(page));
VM_BUG_ON(PageSwapCache(page));
VM_BUG_ON(!PageSwapBacked(page));
+ VM_BUG_ON(gfp_mask & __GFP_WAIT);
error = radix_tree_preload(gfp_mask);
if (!error) {
@@ -307,7 +308,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
*/
__set_page_locked(new_page);
SetPageSwapBacked(new_page);
- err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
+ err = add_to_swap_cache(new_page, entry, GFP_ATOMIC);
if (likely(!err)) {
/*
* Initiate read into locked page and return.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 8+ messages in thread
* [cleanup][2/2] mm: add_to_swap_cache() does not return -EEXIST
2009-08-10 2:23 [RFC][PATCH 0/2] mm: some patches about add_to_swap_cache() Daisuke Nishimura
2009-08-10 2:26 ` [BUGFIX][1/2] mm: add_to_swap_cache() must not sleep Daisuke Nishimura
@ 2009-08-10 2:27 ` Daisuke Nishimura
2009-08-10 3:19 ` KAMEZAWA Hiroyuki
2009-08-14 6:16 ` Daisuke Nishimura
1 sibling, 2 replies; 8+ messages in thread
From: Daisuke Nishimura @ 2009-08-10 2:27 UTC (permalink / raw)
To: LKML, linux-mm
Cc: Andrew Morton, Balbir Singh, Hugh Dickins, KAMEZAWA Hiroyuki,
Johannes Weiner, Daisuke Nishimura
After commit 355cfa73(mm: modify swap_map and add SWAP_HAS_CACHE flag),
only the context which have set SWAP_HAS_CACHE flag by swapcache_prepare()
or get_swap_page() would call add_to_swap_cache().
So add_to_swap_cache() doesn't return -EEXIST any more.
Even though it doesn't return -EEXIST, it's not a good behavior conceptually
to call swapcache_prepare() in -EEXIST case, because it means clearing
SWAP_HAS_CACHE flag while the entry is on swap cache.
This patch removes redundant codes and comments from callers of it, and
adds VM_BUG_ON() in error path of add_to_swap_cache() and some comments.
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
mm/shmem.c | 4 +++
mm/swap_state.c | 75 +++++++++++++++++++++++++++----------------------------
2 files changed, 41 insertions(+), 38 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index d713239..c71ac6c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1097,6 +1097,10 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
shmem_swp_unmap(entry);
unlock:
spin_unlock(&info->lock);
+ /*
+ * add_to_swap_cache() doesn't return -EEXIST, so we can safely
+ * clear SWAP_HAS_CACHE flag.
+ */
swapcache_free(swap, NULL);
redirty:
set_page_dirty(page);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3e6dd72..e891208 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -96,6 +96,12 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
radix_tree_preload_end();
if (unlikely(error)) {
+ /*
+ * Only the context which have set SWAP_HAS_CACHE flag
+ * would call add_to_swap_cache().
+ * So add_to_swap_cache() doesn't returns -EEXIST.
+ */
+ VM_BUG_ON(error == -EEXIST);
set_page_private(page, 0UL);
ClearPageSwapCache(page);
page_cache_release(page);
@@ -137,38 +143,34 @@ int add_to_swap(struct page *page)
VM_BUG_ON(!PageLocked(page));
VM_BUG_ON(!PageUptodate(page));
- for (;;) {
- entry = get_swap_page();
- if (!entry.val)
- return 0;
+ entry = get_swap_page();
+ if (!entry.val)
+ return 0;
+ /*
+ * Radix-tree node allocations from PF_MEMALLOC contexts could
+ * completely exhaust the page allocator. __GFP_NOMEMALLOC
+ * stops emergency reserves from being allocated.
+ *
+ * TODO: this could cause a theoretical memory reclaim
+ * deadlock in the swap out path.
+ */
+ /*
+ * Add it to the swap cache and mark it dirty
+ */
+ err = add_to_swap_cache(page, entry,
+ __GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN);
+
+ if (!err) { /* Success */
+ SetPageDirty(page);
+ return 1;
+ } else { /* -ENOMEM radix-tree allocation failure */
/*
- * Radix-tree node allocations from PF_MEMALLOC contexts could
- * completely exhaust the page allocator. __GFP_NOMEMALLOC
- * stops emergency reserves from being allocated.
- *
- * TODO: this could cause a theoretical memory reclaim
- * deadlock in the swap out path.
- */
- /*
- * Add it to the swap cache and mark it dirty
+ * add_to_swap_cache() doesn't return -EEXIST, so we can safely
+ * clear SWAP_HAS_CACHE flag.
*/
- err = add_to_swap_cache(page, entry,
- __GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN);
-
- switch (err) {
- case 0: /* Success */
- SetPageDirty(page);
- return 1;
- case -EEXIST:
- /* Raced with "speculative" read_swap_cache_async */
- swapcache_free(entry, NULL);
- continue;
- default:
- /* -ENOMEM radix-tree allocation failure */
- swapcache_free(entry, NULL);
- return 0;
- }
+ swapcache_free(entry, NULL);
+ return 0;
}
}
@@ -298,14 +300,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
if (err) /* swp entry is obsolete ? */
break;
- /*
- * Associate the page with swap entry in the swap cache.
- * May fail (-EEXIST) if there is already a page associated
- * with this entry in the swap cache: added by a racing
- * read_swap_cache_async, or add_to_swap or shmem_writepage
- * re-using the just freed swap entry for an existing page.
- * May fail (-ENOMEM) if radix-tree node allocation failed.
- */
+ /* May fail (-ENOMEM) if radix-tree node allocation failed. */
__set_page_locked(new_page);
SetPageSwapBacked(new_page);
err = add_to_swap_cache(new_page, entry, GFP_ATOMIC);
@@ -319,6 +314,10 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
}
ClearPageSwapBacked(new_page);
__clear_page_locked(new_page);
+ /*
+ * add_to_swap_cache() doesn't return -EEXIST, so we can safely
+ * clear SWAP_HAS_CACHE flag.
+ */
swapcache_free(entry, NULL);
} while (err != -ENOMEM);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [BUGFIX][1/2] mm: add_to_swap_cache() must not sleep
2009-08-10 2:26 ` [BUGFIX][1/2] mm: add_to_swap_cache() must not sleep Daisuke Nishimura
@ 2009-08-10 3:16 ` KAMEZAWA Hiroyuki
2009-08-10 5:49 ` Daisuke Nishimura
0 siblings, 1 reply; 8+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-10 3:16 UTC (permalink / raw)
To: Daisuke Nishimura
Cc: LKML, linux-mm, Andrew Morton, Balbir Singh, Hugh Dickins,
Johannes Weiner
On Mon, 10 Aug 2009 11:26:41 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> After commit 355cfa73(mm: modify swap_map and add SWAP_HAS_CACHE flag),
> read_swap_cache_async() will busy-wait while a entry doesn't on swap cache
> but it has SWAP_HAS_CACHE flag.
>
> Such entries can exist on add/delete path of swap cache.
> On add path, add_to_swap_cache() is called soon after SWAP_HAS_CACHE flag
> is set, and on delete path, swapcache_free() will be called (SWAP_HAS_CACHE
> flag is cleared) soon after __delete_from_swap_cache() is called.
> So, the busy-wait works well in most cases.
>
yes.
> But this mechanism can cause soft lockup if add_to_swap_cache() sleeps
> and read_swap_cache_async() tries to swap-in the same entry on the same cpu.
>
Hmm..
> add_to_swap() and shmem_writepage() call add_to_swap_cache() w/o __GFP_WAIT,
> but read_swap_cache_async() can call it w/ __GFP_WAIT, so it can cause
> soft lockup.
>
> This patch changes the gfp_mask of add_to_swap_cache() in read_swap_cache_async().
>
> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Thank you for catching.
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
But Hm...I wonder whether this is the best fix.
If I was you, I may do following.
1. remove radix_tree_preload() and gfp_mask from add_to_swapcache().
Then, rename it fo __add_to_swapcache().
Or, move swap_duplicate() into add_to_swapcache() with a new flag.
2. do things in following order.
radix_tree_peload();
swap_duplicate(); # this never sleeps.
add_to_swapcache()
radix_tree_peload_end();
Good point of this approach is
- we can use __GFP_WAIT in gfp_mask.
- -ENOMEM means OOM, then, we should be aggressive to get a page.
How do you think ?
Thanks,
-Kame
> ---
> mm/swap_state.c | 3 ++-
> 1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 42cd38e..3e6dd72 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -76,6 +76,7 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
> VM_BUG_ON(!PageLocked(page));
> VM_BUG_ON(PageSwapCache(page));
> VM_BUG_ON(!PageSwapBacked(page));
> + VM_BUG_ON(gfp_mask & __GFP_WAIT);
>
> error = radix_tree_preload(gfp_mask);
> if (!error) {
> @@ -307,7 +308,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> */
> __set_page_locked(new_page);
> SetPageSwapBacked(new_page);
> - err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
> + err = add_to_swap_cache(new_page, entry, GFP_ATOMIC);
> if (likely(!err)) {
> /*
> * Initiate read into locked page and return.
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [cleanup][2/2] mm: add_to_swap_cache() does not return -EEXIST
2009-08-10 2:27 ` [cleanup][2/2] mm: add_to_swap_cache() does not return -EEXIST Daisuke Nishimura
@ 2009-08-10 3:19 ` KAMEZAWA Hiroyuki
2009-08-14 6:16 ` Daisuke Nishimura
1 sibling, 0 replies; 8+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-10 3:19 UTC (permalink / raw)
To: Daisuke Nishimura
Cc: LKML, linux-mm, Andrew Morton, Balbir Singh, Hugh Dickins,
Johannes Weiner
On Mon, 10 Aug 2009 11:27:16 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> After commit 355cfa73(mm: modify swap_map and add SWAP_HAS_CACHE flag),
> only the context which have set SWAP_HAS_CACHE flag by swapcache_prepare()
> or get_swap_page() would call add_to_swap_cache().
> So add_to_swap_cache() doesn't return -EEXIST any more.
>
> Even though it doesn't return -EEXIST, it's not a good behavior conceptually
> to call swapcache_prepare() in -EEXIST case, because it means clearing
> SWAP_HAS_CACHE flag while the entry is on swap cache.
>
> This patch removes redundant codes and comments from callers of it, and
> adds VM_BUG_ON() in error path of add_to_swap_cache() and some comments.
>
> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Nice! I've postponed this ;(
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
> mm/shmem.c | 4 +++
> mm/swap_state.c | 75 +++++++++++++++++++++++++++----------------------------
> 2 files changed, 41 insertions(+), 38 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index d713239..c71ac6c 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1097,6 +1097,10 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
> shmem_swp_unmap(entry);
> unlock:
> spin_unlock(&info->lock);
> + /*
> + * add_to_swap_cache() doesn't return -EEXIST, so we can safely
> + * clear SWAP_HAS_CACHE flag.
> + */
> swapcache_free(swap, NULL);
> redirty:
> set_page_dirty(page);
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 3e6dd72..e891208 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -96,6 +96,12 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
> radix_tree_preload_end();
>
> if (unlikely(error)) {
> + /*
> + * Only the context which have set SWAP_HAS_CACHE flag
> + * would call add_to_swap_cache().
> + * So add_to_swap_cache() doesn't returns -EEXIST.
> + */
> + VM_BUG_ON(error == -EEXIST);
> set_page_private(page, 0UL);
> ClearPageSwapCache(page);
> page_cache_release(page);
> @@ -137,38 +143,34 @@ int add_to_swap(struct page *page)
> VM_BUG_ON(!PageLocked(page));
> VM_BUG_ON(!PageUptodate(page));
>
> - for (;;) {
> - entry = get_swap_page();
> - if (!entry.val)
> - return 0;
> + entry = get_swap_page();
> + if (!entry.val)
> + return 0;
>
> + /*
> + * Radix-tree node allocations from PF_MEMALLOC contexts could
> + * completely exhaust the page allocator. __GFP_NOMEMALLOC
> + * stops emergency reserves from being allocated.
> + *
> + * TODO: this could cause a theoretical memory reclaim
> + * deadlock in the swap out path.
> + */
> + /*
> + * Add it to the swap cache and mark it dirty
> + */
> + err = add_to_swap_cache(page, entry,
> + __GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN);
> +
> + if (!err) { /* Success */
> + SetPageDirty(page);
> + return 1;
> + } else { /* -ENOMEM radix-tree allocation failure */
> /*
> - * Radix-tree node allocations from PF_MEMALLOC contexts could
> - * completely exhaust the page allocator. __GFP_NOMEMALLOC
> - * stops emergency reserves from being allocated.
> - *
> - * TODO: this could cause a theoretical memory reclaim
> - * deadlock in the swap out path.
> - */
> - /*
> - * Add it to the swap cache and mark it dirty
> + * add_to_swap_cache() doesn't return -EEXIST, so we can safely
> + * clear SWAP_HAS_CACHE flag.
> */
> - err = add_to_swap_cache(page, entry,
> - __GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN);
> -
> - switch (err) {
> - case 0: /* Success */
> - SetPageDirty(page);
> - return 1;
> - case -EEXIST:
> - /* Raced with "speculative" read_swap_cache_async */
> - swapcache_free(entry, NULL);
> - continue;
> - default:
> - /* -ENOMEM radix-tree allocation failure */
> - swapcache_free(entry, NULL);
> - return 0;
> - }
> + swapcache_free(entry, NULL);
> + return 0;
> }
> }
>
> @@ -298,14 +300,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> if (err) /* swp entry is obsolete ? */
> break;
>
> - /*
> - * Associate the page with swap entry in the swap cache.
> - * May fail (-EEXIST) if there is already a page associated
> - * with this entry in the swap cache: added by a racing
> - * read_swap_cache_async, or add_to_swap or shmem_writepage
> - * re-using the just freed swap entry for an existing page.
> - * May fail (-ENOMEM) if radix-tree node allocation failed.
> - */
> + /* May fail (-ENOMEM) if radix-tree node allocation failed. */
> __set_page_locked(new_page);
> SetPageSwapBacked(new_page);
> err = add_to_swap_cache(new_page, entry, GFP_ATOMIC);
> @@ -319,6 +314,10 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> }
> ClearPageSwapBacked(new_page);
> __clear_page_locked(new_page);
> + /*
> + * add_to_swap_cache() doesn't return -EEXIST, so we can safely
> + * clear SWAP_HAS_CACHE flag.
> + */
> swapcache_free(entry, NULL);
> } while (err != -ENOMEM);
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [BUGFIX][1/2] mm: add_to_swap_cache() must not sleep
2009-08-10 3:16 ` KAMEZAWA Hiroyuki
@ 2009-08-10 5:49 ` Daisuke Nishimura
2009-08-10 5:58 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 8+ messages in thread
From: Daisuke Nishimura @ 2009-08-10 5:49 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: LKML, linux-mm, Andrew Morton, Balbir Singh, Hugh Dickins,
Johannes Weiner, Daisuke Nishimura
On Mon, 10 Aug 2009 12:16:44 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 10 Aug 2009 11:26:41 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
>
> > After commit 355cfa73(mm: modify swap_map and add SWAP_HAS_CACHE flag),
> > read_swap_cache_async() will busy-wait while a entry doesn't on swap cache
> > but it has SWAP_HAS_CACHE flag.
> >
> > Such entries can exist on add/delete path of swap cache.
> > On add path, add_to_swap_cache() is called soon after SWAP_HAS_CACHE flag
> > is set, and on delete path, swapcache_free() will be called (SWAP_HAS_CACHE
> > flag is cleared) soon after __delete_from_swap_cache() is called.
> > So, the busy-wait works well in most cases.
> >
> yes.
>
> > But this mechanism can cause soft lockup if add_to_swap_cache() sleeps
> > and read_swap_cache_async() tries to swap-in the same entry on the same cpu.
> >
> Hmm..
>
> > add_to_swap() and shmem_writepage() call add_to_swap_cache() w/o __GFP_WAIT,
> > but read_swap_cache_async() can call it w/ __GFP_WAIT, so it can cause
> > soft lockup.
> >
> > This patch changes the gfp_mask of add_to_swap_cache() in read_swap_cache_async().
> >
> > Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
>
> Thank you for catching.
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> But Hm...I wonder whether this is the best fix.
>
> If I was you, I may do following.
>
> 1. remove radix_tree_preload() and gfp_mask from add_to_swapcache().
> Then, rename it fo __add_to_swapcache().
> Or, move swap_duplicate() into add_to_swapcache() with a new flag.
>
> 2. do things in following order.
>
> radix_tree_peload();
> swap_duplicate(); # this never sleeps.
> add_to_swapcache()
> radix_tree_peload_end();
>
> Good point of this approach is
> - we can use __GFP_WAIT in gfp_mask.
> - -ENOMEM means OOM, then, we should be aggressive to get a page.
>
> How do you think ?
>
Thank you for your suggestion. It's a good idea.
How about this one (Passed build test only) ?
If it's good for you, I'll resend it removing RFC after a long term test,
unless someone else tell me not to.
===
From: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
After commit 355cfa73(mm: modify swap_map and add SWAP_HAS_CACHE flag),
read_swap_cache_async() will busy-wait while a entry doesn't on swap cache
but it has SWAP_HAS_CACHE flag.
Such entries can exist on add/delete path of swap cache.
On add path, add_to_swap_cache() is called soon after SWAP_HAS_CACHE flag
is set, and on delete path, swapcache_free() will be called (SWAP_HAS_CACHE
flag is cleared) soon after __delete_from_swap_cache() is called.
So, the busy-wait works well in most cases.
But this mechanism can cause soft lockup if add_to_swap_cache() sleeps
and read_swap_cache_async() tries to swap-in the same entry on the same cpu.
This patch calls radix_tree_preload() before swapcache_prepare() and divide
add_to_swap_cache() into two part: radix_tree_preload() part and
radix_tree_insert() part(define it as __add_to_swap_cache()).
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
mm/swap_state.c | 70 ++++++++++++++++++++++++++++++++++++-------------------
1 files changed, 46 insertions(+), 24 deletions(-)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 42cd38e..0313a13 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -66,10 +66,10 @@ void show_swap_cache_info(void)
}
/*
- * add_to_swap_cache resembles add_to_page_cache_locked on swapper_space,
+ * __add_to_swap_cache resembles add_to_page_cache_locked on swapper_space,
* but sets SwapCache flag and private instead of mapping and index.
*/
-int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
+static int __add_to_swap_cache(struct page *page, swp_entry_t entry)
{
int error;
@@ -77,28 +77,37 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
VM_BUG_ON(PageSwapCache(page));
VM_BUG_ON(!PageSwapBacked(page));
+ page_cache_get(page);
+ SetPageSwapCache(page);
+ set_page_private(page, entry.val);
+
+ spin_lock_irq(&swapper_space.tree_lock);
+ error = radix_tree_insert(&swapper_space.page_tree, entry.val, page);
+ if (likely(!error)) {
+ total_swapcache_pages++;
+ __inc_zone_page_state(page, NR_FILE_PAGES);
+ INC_CACHE_INFO(add_total);
+ }
+ spin_unlock_irq(&swapper_space.tree_lock);
+
+ if (unlikely(error)) {
+ set_page_private(page, 0UL);
+ ClearPageSwapCache(page);
+ page_cache_release(page);
+ }
+
+ return error;
+}
+
+
+int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
+{
+ int error;
+
error = radix_tree_preload(gfp_mask);
if (!error) {
- page_cache_get(page);
- SetPageSwapCache(page);
- set_page_private(page, entry.val);
-
- spin_lock_irq(&swapper_space.tree_lock);
- error = radix_tree_insert(&swapper_space.page_tree,
- entry.val, page);
- if (likely(!error)) {
- total_swapcache_pages++;
- __inc_zone_page_state(page, NR_FILE_PAGES);
- INC_CACHE_INFO(add_total);
- }
- spin_unlock_irq(&swapper_space.tree_lock);
+ error = __add_to_swap_cache(page, entry);
radix_tree_preload_end();
-
- if (unlikely(error)) {
- set_page_private(page, 0UL);
- ClearPageSwapCache(page);
- page_cache_release(page);
- }
}
return error;
}
@@ -289,13 +298,24 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
}
/*
+ * call radix_tree_preload() while we can wait.
+ */
+ err = radix_tree_preload(gfp_mask & GFP_KERNEL);
+ if (err)
+ break;
+
+ /*
* Swap entry may have been freed since our caller observed it.
*/
err = swapcache_prepare(entry);
- if (err == -EEXIST) /* seems racy */
+ if (err == -EEXIST) { /* seems racy */
+ radix_tree_preload_end();
continue;
- if (err) /* swp entry is obsolete ? */
+ }
+ if (err) { /* swp entry is obsolete ? */
+ radix_tree_preload_end();
break;
+ }
/*
* Associate the page with swap entry in the swap cache.
@@ -307,8 +327,9 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
*/
__set_page_locked(new_page);
SetPageSwapBacked(new_page);
- err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
+ err = __add_to_swap_cache(new_page, entry);
if (likely(!err)) {
+ radix_tree_preload_end();
/*
* Initiate read into locked page and return.
*/
@@ -316,6 +337,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
swap_readpage(new_page);
return new_page;
}
+ radix_tree_preload_end();
ClearPageSwapBacked(new_page);
__clear_page_locked(new_page);
swapcache_free(entry, NULL);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [BUGFIX][1/2] mm: add_to_swap_cache() must not sleep
2009-08-10 5:49 ` Daisuke Nishimura
@ 2009-08-10 5:58 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 8+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-10 5:58 UTC (permalink / raw)
To: Daisuke Nishimura
Cc: LKML, linux-mm, Andrew Morton, Balbir Singh, Hugh Dickins,
Johannes Weiner
On Mon, 10 Aug 2009 14:49:41 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> Thank you for your suggestion. It's a good idea.
>
> How about this one (Passed build test only) ?
>
> If it's good for you, I'll resend it removing RFC after a long term test,
> unless someone else tell me not to.
>
Seems very nice to me. I hope answers from swap-people is good..
Thanks,
-Kame
> ===
> From: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
>
> After commit 355cfa73(mm: modify swap_map and add SWAP_HAS_CACHE flag),
> read_swap_cache_async() will busy-wait while a entry doesn't on swap cache
> but it has SWAP_HAS_CACHE flag.
>
> Such entries can exist on add/delete path of swap cache.
> On add path, add_to_swap_cache() is called soon after SWAP_HAS_CACHE flag
> is set, and on delete path, swapcache_free() will be called (SWAP_HAS_CACHE
> flag is cleared) soon after __delete_from_swap_cache() is called.
> So, the busy-wait works well in most cases.
>
> But this mechanism can cause soft lockup if add_to_swap_cache() sleeps
> and read_swap_cache_async() tries to swap-in the same entry on the same cpu.
>
> This patch calls radix_tree_preload() before swapcache_prepare() and divide
> add_to_swap_cache() into two part: radix_tree_preload() part and
> radix_tree_insert() part(define it as __add_to_swap_cache()).
>
> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> ---
> mm/swap_state.c | 70 ++++++++++++++++++++++++++++++++++++-------------------
> 1 files changed, 46 insertions(+), 24 deletions(-)
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 42cd38e..0313a13 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -66,10 +66,10 @@ void show_swap_cache_info(void)
> }
>
> /*
> - * add_to_swap_cache resembles add_to_page_cache_locked on swapper_space,
> + * __add_to_swap_cache resembles add_to_page_cache_locked on swapper_space,
> * but sets SwapCache flag and private instead of mapping and index.
> */
> -int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
> +static int __add_to_swap_cache(struct page *page, swp_entry_t entry)
> {
> int error;
>
> @@ -77,28 +77,37 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
> VM_BUG_ON(PageSwapCache(page));
> VM_BUG_ON(!PageSwapBacked(page));
>
> + page_cache_get(page);
> + SetPageSwapCache(page);
> + set_page_private(page, entry.val);
> +
> + spin_lock_irq(&swapper_space.tree_lock);
> + error = radix_tree_insert(&swapper_space.page_tree, entry.val, page);
> + if (likely(!error)) {
> + total_swapcache_pages++;
> + __inc_zone_page_state(page, NR_FILE_PAGES);
> + INC_CACHE_INFO(add_total);
> + }
> + spin_unlock_irq(&swapper_space.tree_lock);
> +
> + if (unlikely(error)) {
> + set_page_private(page, 0UL);
> + ClearPageSwapCache(page);
> + page_cache_release(page);
> + }
> +
> + return error;
> +}
> +
> +
> +int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
> +{
> + int error;
> +
> error = radix_tree_preload(gfp_mask);
> if (!error) {
> - page_cache_get(page);
> - SetPageSwapCache(page);
> - set_page_private(page, entry.val);
> -
> - spin_lock_irq(&swapper_space.tree_lock);
> - error = radix_tree_insert(&swapper_space.page_tree,
> - entry.val, page);
> - if (likely(!error)) {
> - total_swapcache_pages++;
> - __inc_zone_page_state(page, NR_FILE_PAGES);
> - INC_CACHE_INFO(add_total);
> - }
> - spin_unlock_irq(&swapper_space.tree_lock);
> + error = __add_to_swap_cache(page, entry);
> radix_tree_preload_end();
> -
> - if (unlikely(error)) {
> - set_page_private(page, 0UL);
> - ClearPageSwapCache(page);
> - page_cache_release(page);
> - }
> }
> return error;
> }
> @@ -289,13 +298,24 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> }
>
> /*
> + * call radix_tree_preload() while we can wait.
> + */
> + err = radix_tree_preload(gfp_mask & GFP_KERNEL);
> + if (err)
> + break;
> +
> + /*
> * Swap entry may have been freed since our caller observed it.
> */
> err = swapcache_prepare(entry);
> - if (err == -EEXIST) /* seems racy */
> + if (err == -EEXIST) { /* seems racy */
> + radix_tree_preload_end();
> continue;
> - if (err) /* swp entry is obsolete ? */
> + }
> + if (err) { /* swp entry is obsolete ? */
> + radix_tree_preload_end();
> break;
> + }
>
> /*
> * Associate the page with swap entry in the swap cache.
> @@ -307,8 +327,9 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> */
> __set_page_locked(new_page);
> SetPageSwapBacked(new_page);
> - err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
> + err = __add_to_swap_cache(new_page, entry);
> if (likely(!err)) {
> + radix_tree_preload_end();
> /*
> * Initiate read into locked page and return.
> */
> @@ -316,6 +337,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> swap_readpage(new_page);
> return new_page;
> }
> + radix_tree_preload_end();
> ClearPageSwapBacked(new_page);
> __clear_page_locked(new_page);
> swapcache_free(entry, NULL);
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [cleanup][2/2] mm: add_to_swap_cache() does not return -EEXIST
2009-08-10 2:27 ` [cleanup][2/2] mm: add_to_swap_cache() does not return -EEXIST Daisuke Nishimura
2009-08-10 3:19 ` KAMEZAWA Hiroyuki
@ 2009-08-14 6:16 ` Daisuke Nishimura
1 sibling, 0 replies; 8+ messages in thread
From: Daisuke Nishimura @ 2009-08-14 6:16 UTC (permalink / raw)
To: Andrew Morton
Cc: Balbir Singh, Hugh Dickins, KAMEZAWA Hiroyuki, Johannes Weiner,
Daisuke Nishimura, LKML, linux-mm
Andrew, this is a minor fix for mm-add_to_swap_cache-does-not-return-eexist.patch.
It didn't catch up with the change in [1/2](mm-add_to_swap_cache-must-not-sleep.patch).
===
From: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
fix indent size.
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
mm/swap_state.c | 12 ++++++------
1 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index ff5bd8c..6d1daeb 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -92,12 +92,12 @@ static int __add_to_swap_cache(struct page *page, swp_entry_t entry)
spin_unlock_irq(&swapper_space.tree_lock);
if (unlikely(error)) {
- /*
- * Only the context which have set SWAP_HAS_CACHE flag
- * would call add_to_swap_cache().
- * So add_to_swap_cache() doesn't returns -EEXIST.
- */
- VM_BUG_ON(error == -EEXIST);
+ /*
+ * Only the context which have set SWAP_HAS_CACHE flag
+ * would call add_to_swap_cache().
+ * So add_to_swap_cache() doesn't returns -EEXIST.
+ */
+ VM_BUG_ON(error == -EEXIST);
set_page_private(page, 0UL);
ClearPageSwapCache(page);
page_cache_release(page);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2009-08-14 6:20 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-10 2:23 [RFC][PATCH 0/2] mm: some patches about add_to_swap_cache() Daisuke Nishimura
2009-08-10 2:26 ` [BUGFIX][1/2] mm: add_to_swap_cache() must not sleep Daisuke Nishimura
2009-08-10 3:16 ` KAMEZAWA Hiroyuki
2009-08-10 5:49 ` Daisuke Nishimura
2009-08-10 5:58 ` KAMEZAWA Hiroyuki
2009-08-10 2:27 ` [cleanup][2/2] mm: add_to_swap_cache() does not return -EEXIST Daisuke Nishimura
2009-08-10 3:19 ` KAMEZAWA Hiroyuki
2009-08-14 6:16 ` Daisuke Nishimura
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox