[PATCH 2/2] memcg: fix livelock in try charge during readahead

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 2/2] memcg: fix livelock in try charge during readahead
@ 2011-12-13  2:16 Ying Han
  2011-12-13  4:45 ` KAMEZAWA Hiroyuki
  2011-12-13  6:10 ` Minchan Kim
  0 siblings, 2 replies; 7+ messages in thread
From: Ying Han @ 2011-12-13  2:16 UTC (permalink / raw)
  To: Michal Hocko, Balbir Singh, Rik van Riel, Hugh Dickins,
	Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki, Pavel Emelyanov,
	Fengguang Wu, Greg Thelen
  Cc: linux-mm

Couple of kernel dumps are triggered by watchdog timeout. It turns out that two
processes within a memcg livelock on a same page lock. We believe this is not
memcg specific issue and the same livelock exists in non-memcg world as well.

The sequence of triggering the livelock:
1. Task_A enters pagefault (filemap_fault) and then starts readahead
filemap_fault
 -> do_sync_mmap_readahead
    -> ra_submit
       ->__do_page_cache_readahead // here we allocate the readahead pages
         ->read_pages
         ...
           ->add_to_page_cache_locked
             //for each page, we do the try charge and then add the page into
             //radix tree. If one of the try charge failed, it enters per-memcg
             //oom while holding the page lock of previous readahead pages.

            // in the memcg oom killer, it picks a task within the same memcg
            // and mark it TIF_MEMDIE. then it goes back into retry loop and
            // hopes the task exits to free some memory.

2. Task_B enters pagefault (filemap_fault) and finds the page in radix tree (
one of the readahead pages from ProcessA)

filemap_fault
 ->__lock_page // here it is marked as TIF_MEMDIE. but it can not proceed since
               // the page lock is hold by ProcessA looping at OOM.

Since the TIF_MEMDIE task_B is live locked, it ends up blocking other tasks
making forward progress since they are also checking the flag in
select_bad_process. The same issue exists in the non-memcg world. Instead of
entering oom through mem_cgroup_cache_charge(), we might enter it through
radix_tree_preload().

The proposed fix here is to pass __GFP_NORETRY gfp_mask into try charge under
readahead. Then we skip entering memcg OOM kill which eliminates the case where
it OOMs on one page and holds other page locks. It seems to be safe to do that
since both filemap_fault() and do_generic_file_read() handles the fallback case
of "no_cached_page".

Note:
After this patch, we might experience some charge fails for readahead pages
(since we don't enter oom). But this sounds sane compared to letting the system
trying extremely hard to charge a readahead page by doing reclaim and then oom,
the later one also triggers livelock as listed above.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Ying Han <yinghan@google.com>
---
 fs/mpage.c     |    3 ++-
 mm/readahead.c |    3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/mpage.c b/fs/mpage.c
index 643e9f5..90d608e 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -380,7 +380,8 @@ mpage_readpages(struct address_space *mapping, struct list_head *pages,
 		prefetchw(&page->flags);
 		list_del(&page->lru);
 		if (!add_to_page_cache_lru(page, mapping,
-					page->index, GFP_KERNEL)) {
+					page->index,
+					GFP_KERNEL | __GFP_NORETRY)) {
 			bio = do_mpage_readpage(bio, page,
 					nr_pages - page_idx,
 					&last_block_in_bio, &map_bh,
diff --git a/mm/readahead.c b/mm/readahead.c
index cbcbb02..bc9431c 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -126,7 +126,8 @@ static int read_pages(struct address_space *mapping, struct file *filp,
 		struct page *page = list_to_page(pages);
 		list_del(&page->lru);
 		if (!add_to_page_cache_lru(page, mapping,
-					page->index, GFP_KERNEL)) {
+					page->index,
+					GFP_KERNEL | __GFP_NORETRY)) {
 			mapping->a_ops->readpage(filp, page);
 		}
 		page_cache_release(page);
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] memcg: fix livelock in try charge during readahead
  2011-12-13  2:16 [PATCH 2/2] memcg: fix livelock in try charge during readahead Ying Han
@ 2011-12-13  4:45 ` KAMEZAWA Hiroyuki
  2011-12-13 17:59   ` Ying Han
  2011-12-13  6:10 ` Minchan Kim
  1 sibling, 1 reply; 7+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-12-13  4:45 UTC (permalink / raw)
  To: Ying Han
  Cc: Michal Hocko, Balbir Singh, Rik van Riel, Hugh Dickins,
	Johannes Weiner, Mel Gorman, Pavel Emelyanov, Fengguang Wu,
	Greg Thelen, linux-mm

On Mon, 12 Dec 2011 18:16:48 -0800
Ying Han <yinghan@google.com> wrote:

> Couple of kernel dumps are triggered by watchdog timeout. It turns out that two
> processes within a memcg livelock on a same page lock. We believe this is not
> memcg specific issue and the same livelock exists in non-memcg world as well.
> 
> The sequence of triggering the livelock:
> 1. Task_A enters pagefault (filemap_fault) and then starts readahead
> filemap_fault
>  -> do_sync_mmap_readahead
>     -> ra_submit
>        ->__do_page_cache_readahead // here we allocate the readahead pages
>          ->read_pages
>          ...
>            ->add_to_page_cache_locked
>              //for each page, we do the try charge and then add the page into
>              //radix tree. If one of the try charge failed, it enters per-memcg
>              //oom while holding the page lock of previous readahead pages.
> 
>             // in the memcg oom killer, it picks a task within the same memcg
>             // and mark it TIF_MEMDIE. then it goes back into retry loop and
>             // hopes the task exits to free some memory.
> 
> 2. Task_B enters pagefault (filemap_fault) and finds the page in radix tree (
> one of the readahead pages from ProcessA)
> 
> filemap_fault
>  ->__lock_page // here it is marked as TIF_MEMDIE. but it can not proceed since
>                // the page lock is hold by ProcessA looping at OOM.
> 

Should this __lock_page() be lock_page_killable() ?
Hmm, at seeing linux-next, it's now lock_page_or_retry() and FAULT_FLAG_KILLABLE
is set. why not killed immediately ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] memcg: fix livelock in try charge during readahead
  2011-12-13  2:16 [PATCH 2/2] memcg: fix livelock in try charge during readahead Ying Han
  2011-12-13  4:45 ` KAMEZAWA Hiroyuki
@ 2011-12-13  6:10 ` Minchan Kim
  2011-12-13 18:29   ` Ying Han
  1 sibling, 1 reply; 7+ messages in thread
From: Minchan Kim @ 2011-12-13  6:10 UTC (permalink / raw)
  To: Ying Han
  Cc: Michal Hocko, Balbir Singh, Rik van Riel, Hugh Dickins,
	Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki, Pavel Emelyanov,
	Fengguang Wu, Greg Thelen, linux-mm

On Mon, Dec 12, 2011 at 06:16:48PM -0800, Ying Han wrote:
> Couple of kernel dumps are triggered by watchdog timeout. It turns out that two
> processes within a memcg livelock on a same page lock. We believe this is not
> memcg specific issue and the same livelock exists in non-memcg world as well.
> 
> The sequence of triggering the livelock:
> 1. Task_A enters pagefault (filemap_fault) and then starts readahead
> filemap_fault
>  -> do_sync_mmap_readahead
>     -> ra_submit
>        ->__do_page_cache_readahead // here we allocate the readahead pages
>          ->read_pages
>          ...
>            ->add_to_page_cache_locked
>              //for each page, we do the try charge and then add the page into
>              //radix tree. If one of the try charge failed, it enters per-memcg
>              //oom while holding the page lock of previous readahead pages.
> 
>             // in the memcg oom killer, it picks a task within the same memcg
>             // and mark it TIF_MEMDIE. then it goes back into retry loop and
>             // hopes the task exits to free some memory.
> 
> 2. Task_B enters pagefault (filemap_fault) and finds the page in radix tree (
> one of the readahead pages from ProcessA)
> 
> filemap_fault
>  ->__lock_page // here it is marked as TIF_MEMDIE. but it can not proceed since
>                // the page lock is hold by ProcessA looping at OOM.
> 
> Since the TIF_MEMDIE task_B is live locked, it ends up blocking other tasks
> making forward progress since they are also checking the flag in
> select_bad_process. The same issue exists in the non-memcg world. Instead of
> entering oom through mem_cgroup_cache_charge(), we might enter it through
> radix_tree_preload().
> 
> The proposed fix here is to pass __GFP_NORETRY gfp_mask into try charge under
> readahead. Then we skip entering memcg OOM kill which eliminates the case where
> it OOMs on one page and holds other page locks. It seems to be safe to do that
> since both filemap_fault() and do_generic_file_read() handles the fallback case
> of "no_cached_page".
> 
> Note:
> After this patch, we might experience some charge fails for readahead pages
> (since we don't enter oom). But this sounds sane compared to letting the system
> trying extremely hard to charge a readahead page by doing reclaim and then oom,
> the later one also triggers livelock as listed above.
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Signed-off-by: Ying Han <yinghan@google.com>

Nice catch.

The concern is GFP_KERNEL != avoid OOM.
Although it works now, it can be changed.

With alternative idea, We can use explicit oom_killer_disable with __GFP_NOWARN
but it wouldn't work since oom_killer_disabled isn't reference count variable.
Of course, we can change it with reference-counted atomic variable.
The benefit is it's more explicit and doesn't depends on __GFP_NORETRY implementation.
So I don't have a good idea except above.
If you want __GFP_NORTRY patch, thing we can do best is add comment in detail, at least.
both side, here add_to_page_cache_lru and there __GFP_NORETRY in include/linux/gfp.h.

> ---
>  fs/mpage.c     |    3 ++-
>  mm/readahead.c |    3 ++-
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/mpage.c b/fs/mpage.c
> index 643e9f5..90d608e 100644
> --- a/fs/mpage.c
> +++ b/fs/mpage.c
> @@ -380,7 +380,8 @@ mpage_readpages(struct address_space *mapping, struct list_head *pages,
>  		prefetchw(&page->flags);
>  		list_del(&page->lru);
>  		if (!add_to_page_cache_lru(page, mapping,
> -					page->index, GFP_KERNEL)) {
> +					page->index,
> +					GFP_KERNEL | __GFP_NORETRY)) {
>  			bio = do_mpage_readpage(bio, page,
>  					nr_pages - page_idx,
>  					&last_block_in_bio, &map_bh,
> diff --git a/mm/readahead.c b/mm/readahead.c
> index cbcbb02..bc9431c 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -126,7 +126,8 @@ static int read_pages(struct address_space *mapping, struct file *filp,
>  		struct page *page = list_to_page(pages);
>  		list_del(&page->lru);
>  		if (!add_to_page_cache_lru(page, mapping,
> -					page->index, GFP_KERNEL)) {
> +					page->index,
> +					GFP_KERNEL | __GFP_NORETRY)) {
>  			mapping->a_ops->readpage(filp, page);
>  		}
>  		page_cache_release(page);
> -- 
> 1.7.3.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] memcg: fix livelock in try charge during readahead
  2011-12-13  4:45 ` KAMEZAWA Hiroyuki
@ 2011-12-13 17:59   ` Ying Han
  0 siblings, 0 replies; 7+ messages in thread
From: Ying Han @ 2011-12-13 17:59 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Michal Hocko, Balbir Singh, Rik van Riel, Hugh Dickins,
	Johannes Weiner, Mel Gorman, Pavel Emelyanov, Fengguang Wu,
	Greg Thelen, linux-mm

On Mon, Dec 12, 2011 at 8:45 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 12 Dec 2011 18:16:48 -0800
> Ying Han <yinghan@google.com> wrote:
>
>> Couple of kernel dumps are triggered by watchdog timeout. It turns out that two
>> processes within a memcg livelock on a same page lock. We believe this is not
>> memcg specific issue and the same livelock exists in non-memcg world as well.
>>
>> The sequence of triggering the livelock:
>> 1. Task_A enters pagefault (filemap_fault) and then starts readahead
>> filemap_fault
>>  -> do_sync_mmap_readahead
>>     -> ra_submit
>>        ->__do_page_cache_readahead // here we allocate the readahead pages
>>          ->read_pages
>>          ...
>>            ->add_to_page_cache_locked
>>              //for each page, we do the try charge and then add the page into
>>              //radix tree. If one of the try charge failed, it enters per-memcg
>>              //oom while holding the page lock of previous readahead pages.
>>
>>             // in the memcg oom killer, it picks a task within the same memcg
>>             // and mark it TIF_MEMDIE. then it goes back into retry loop and
>>             // hopes the task exits to free some memory.
>>
>> 2. Task_B enters pagefault (filemap_fault) and finds the page in radix tree (
>> one of the readahead pages from ProcessA)
>>
>> filemap_fault
>>  ->__lock_page // here it is marked as TIF_MEMDIE. but it can not proceed since
>>                // the page lock is hold by ProcessA looping at OOM.
>>
>
> Should this __lock_page() be lock_page_killable() ?
> Hmm, at seeing linux-next, it's now lock_page_or_retry() and FAULT_FLAG_KILLABLE
> is set. why not killed immediately ?

Hmm, thank you for pointing it out. It seems that we are missing the
following patch in the tree triggering the problem:

commit 37b23e0525d393d48a7d59f870b3bc061a30ccdb
Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date:   Tue May 24 17:11:30 2011 -0700

    x86,mm: make pagefault killable

    When an oom killing occurs, almost all processes are getting stuck at the
    following two points.

        1) __alloc_pages_nodemask
        2) __lock_page_or_retry

By eye-balling the linux-next including the patch above, we should be
able to avoid the live-lock by checking the fatal_signal_pending in
the page fault path.

--Ying

>
> Thanks,
> -Kame
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] memcg: fix livelock in try charge during readahead
  2011-12-13  6:10 ` Minchan Kim
@ 2011-12-13 18:29   ` Ying Han
  2011-12-14 10:53     ` Michal Hocko
  2011-12-14 11:31     ` Minchan Kim
  0 siblings, 2 replies; 7+ messages in thread
From: Ying Han @ 2011-12-13 18:29 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michal Hocko, Balbir Singh, Rik van Riel, Hugh Dickins,
	Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki, Pavel Emelyanov,
	Fengguang Wu, Greg Thelen, linux-mm

On Mon, Dec 12, 2011 at 10:10 PM, Minchan Kim <minchan@kernel.org> wrote:
> On Mon, Dec 12, 2011 at 06:16:48PM -0800, Ying Han wrote:
>> Couple of kernel dumps are triggered by watchdog timeout. It turns out that two
>> processes within a memcg livelock on a same page lock. We believe this is not
>> memcg specific issue and the same livelock exists in non-memcg world as well.
>>
>> The sequence of triggering the livelock:
>> 1. Task_A enters pagefault (filemap_fault) and then starts readahead
>> filemap_fault
>>  -> do_sync_mmap_readahead
>>     -> ra_submit
>>        ->__do_page_cache_readahead // here we allocate the readahead pages
>>          ->read_pages
>>          ...
>>            ->add_to_page_cache_locked
>>              //for each page, we do the try charge and then add the page into
>>              //radix tree. If one of the try charge failed, it enters per-memcg
>>              //oom while holding the page lock of previous readahead pages.
>>
>>             // in the memcg oom killer, it picks a task within the same memcg
>>             // and mark it TIF_MEMDIE. then it goes back into retry loop and
>>             // hopes the task exits to free some memory.
>>
>> 2. Task_B enters pagefault (filemap_fault) and finds the page in radix tree (
>> one of the readahead pages from ProcessA)
>>
>> filemap_fault
>>  ->__lock_page // here it is marked as TIF_MEMDIE. but it can not proceed since
>>                // the page lock is hold by ProcessA looping at OOM.
>>
>> Since the TIF_MEMDIE task_B is live locked, it ends up blocking other tasks
>> making forward progress since they are also checking the flag in
>> select_bad_process. The same issue exists in the non-memcg world. Instead of
>> entering oom through mem_cgroup_cache_charge(), we might enter it through
>> radix_tree_preload().
>>
>> The proposed fix here is to pass __GFP_NORETRY gfp_mask into try charge under
>> readahead. Then we skip entering memcg OOM kill which eliminates the case where
>> it OOMs on one page and holds other page locks. It seems to be safe to do that
>> since both filemap_fault() and do_generic_file_read() handles the fallback case
>> of "no_cached_page".
>>
>> Note:
>> After this patch, we might experience some charge fails for readahead pages
>> (since we don't enter oom). But this sounds sane compared to letting the system
>> trying extremely hard to charge a readahead page by doing reclaim and then oom,
>> the later one also triggers livelock as listed above.
>>
>> Signed-off-by: Greg Thelen <gthelen@google.com>
>> Signed-off-by: Ying Han <yinghan@google.com>
>
> Nice catch.
>
> The concern is GFP_KERNEL != avoid OOM.
> Although it works now, it can be changed.
>
> With alternative idea, We can use explicit oom_killer_disable with __GFP_NOWARN
> but it wouldn't work since oom_killer_disabled isn't reference count variable.
> Of course, we can change it with reference-counted atomic variable.
> The benefit is it's more explicit and doesn't depends on __GFP_NORETRY implementation.
> So I don't have a good idea except above.

> If you want __GFP_NORTRY patch, thing we can do best is add comment in detail, at least.
> both side, here add_to_page_cache_lru and there __GFP_NORETRY in include/linux/gfp.h.

Correct me in case i missed something, looks like I want to backport
the " x86,mm: make pagefault killable" patch, and we might be able to
solve the livelock w/o changing the readahead code.

Thanks

--Ying

>
>> ---
>>  fs/mpage.c     |    3 ++-
>>  mm/readahead.c |    3 ++-
>>  2 files changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/mpage.c b/fs/mpage.c
>> index 643e9f5..90d608e 100644
>> --- a/fs/mpage.c
>> +++ b/fs/mpage.c
>> @@ -380,7 +380,8 @@ mpage_readpages(struct address_space *mapping, struct list_head *pages,
>>               prefetchw(&page->flags);
>>               list_del(&page->lru);
>>               if (!add_to_page_cache_lru(page, mapping,
>> -                                     page->index, GFP_KERNEL)) {
>> +                                     page->index,
>> +                                     GFP_KERNEL | __GFP_NORETRY)) {
>>                       bio = do_mpage_readpage(bio, page,
>>                                       nr_pages - page_idx,
>>                                       &last_block_in_bio, &map_bh,
>> diff --git a/mm/readahead.c b/mm/readahead.c
>> index cbcbb02..bc9431c 100644
>> --- a/mm/readahead.c
>> +++ b/mm/readahead.c
>> @@ -126,7 +126,8 @@ static int read_pages(struct address_space *mapping, struct file *filp,
>>               struct page *page = list_to_page(pages);
>>               list_del(&page->lru);
>>               if (!add_to_page_cache_lru(page, mapping,
>> -                                     page->index, GFP_KERNEL)) {
>> +                                     page->index,
>> +                                     GFP_KERNEL | __GFP_NORETRY)) {
>>                       mapping->a_ops->readpage(filp, page);
>>               }
>>               page_cache_release(page);
>> --
>> 1.7.3.1
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
> --
> Kind regards,
> Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] memcg: fix livelock in try charge during readahead
  2011-12-13 18:29   ` Ying Han
@ 2011-12-14 10:53     ` Michal Hocko
  2011-12-14 11:31     ` Minchan Kim
  1 sibling, 0 replies; 7+ messages in thread
From: Michal Hocko @ 2011-12-14 10:53 UTC (permalink / raw)
  To: Ying Han
  Cc: Minchan Kim, Balbir Singh, Rik van Riel, Hugh Dickins,
	Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki, Pavel Emelyanov,
	Fengguang Wu, Greg Thelen, linux-mm

On Tue 13-12-11 10:29:34, Ying Han wrote:
> On Mon, Dec 12, 2011 at 10:10 PM, Minchan Kim <minchan@kernel.org> wrote:
> > On Mon, Dec 12, 2011 at 06:16:48PM -0800, Ying Han wrote:
> >> Couple of kernel dumps are triggered by watchdog timeout. It turns out that two
> >> processes within a memcg livelock on a same page lock. We believe this is not
> >> memcg specific issue and the same livelock exists in non-memcg world as well.
> >>
> >> The sequence of triggering the livelock:
> >> 1. Task_A enters pagefault (filemap_fault) and then starts readahead
> >> filemap_fault
> >>  -> do_sync_mmap_readahead
> >>     -> ra_submit
> >>        ->__do_page_cache_readahead // here we allocate the readahead pages
> >>          ->read_pages
> >>          ...
> >>            ->add_to_page_cache_locked
> >>              //for each page, we do the try charge and then add the page into
> >>              //radix tree. If one of the try charge failed, it enters per-memcg
> >>              //oom while holding the page lock of previous readahead pages.
> >>
> >>             // in the memcg oom killer, it picks a task within the same memcg
> >>             // and mark it TIF_MEMDIE. then it goes back into retry loop and
> >>             // hopes the task exits to free some memory.
> >>
> >> 2. Task_B enters pagefault (filemap_fault) and finds the page in radix tree (
> >> one of the readahead pages from ProcessA)
> >>
> >> filemap_fault
> >>  ->__lock_page // here it is marked as TIF_MEMDIE. but it can not proceed since
> >>                // the page lock is hold by ProcessA looping at OOM.
> >>
> >> Since the TIF_MEMDIE task_B is live locked, it ends up blocking other tasks
> >> making forward progress since they are also checking the flag in
> >> select_bad_process. The same issue exists in the non-memcg world. Instead of
> >> entering oom through mem_cgroup_cache_charge(), we might enter it through
> >> radix_tree_preload().
> >>
> >> The proposed fix here is to pass __GFP_NORETRY gfp_mask into try charge under
> >> readahead. Then we skip entering memcg OOM kill which eliminates the case where
> >> it OOMs on one page and holds other page locks. It seems to be safe to do that
> >> since both filemap_fault() and do_generic_file_read() handles the fallback case
> >> of "no_cached_page".
> >>
> >> Note:
> >> After this patch, we might experience some charge fails for readahead pages
> >> (since we don't enter oom). But this sounds sane compared to letting the system
> >> trying extremely hard to charge a readahead page by doing reclaim and then oom,
> >> the later one also triggers livelock as listed above.
> >>
> >> Signed-off-by: Greg Thelen <gthelen@google.com>
> >> Signed-off-by: Ying Han <yinghan@google.com>
> >
> > Nice catch.
> >
> > The concern is GFP_KERNEL != avoid OOM.
> > Although it works now, it can be changed.
> >
> > With alternative idea, We can use explicit oom_killer_disable with __GFP_NOWARN
> > but it wouldn't work since oom_killer_disabled isn't reference count variable.
> > Of course, we can change it with reference-counted atomic variable.
> > The benefit is it's more explicit and doesn't depends on __GFP_NORETRY implementation.
> > So I don't have a good idea except above.
> 
> > If you want __GFP_NORTRY patch, thing we can do best is add comment in detail, at least.
> > both side, here add_to_page_cache_lru and there __GFP_NORETRY in include/linux/gfp.h.
> 
> Correct me in case i missed something, looks like I want to backport
> the " x86,mm: make pagefault killable" patch, and we might be able to
> solve the livelock w/o changing the readahead code.

Yes, that seems correct to me. 

Thanks
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] memcg: fix livelock in try charge during readahead
  2011-12-13 18:29   ` Ying Han
  2011-12-14 10:53     ` Michal Hocko
@ 2011-12-14 11:31     ` Minchan Kim
  1 sibling, 0 replies; 7+ messages in thread
From: Minchan Kim @ 2011-12-14 11:31 UTC (permalink / raw)
  To: Ying Han
  Cc: Michal Hocko, Balbir Singh, Rik van Riel, Hugh Dickins,
	Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki, Pavel Emelyanov,
	Fengguang Wu, Greg Thelen, linux-mm

On Wed, Dec 14, 2011 at 3:29 AM, Ying Han <yinghan@google.com> wrote:
> On Mon, Dec 12, 2011 at 10:10 PM, Minchan Kim <minchan@kernel.org> wrote:
>> On Mon, Dec 12, 2011 at 06:16:48PM -0800, Ying Han wrote:
>>> Couple of kernel dumps are triggered by watchdog timeout. It turns out that two
>>> processes within a memcg livelock on a same page lock. We believe this is not
>>> memcg specific issue and the same livelock exists in non-memcg world as well.
>>>
>>> The sequence of triggering the livelock:
>>> 1. Task_A enters pagefault (filemap_fault) and then starts readahead
>>> filemap_fault
>>>  -> do_sync_mmap_readahead
>>>     -> ra_submit
>>>        ->__do_page_cache_readahead // here we allocate the readahead pages
>>>          ->read_pages
>>>          ...
>>>            ->add_to_page_cache_locked
>>>              //for each page, we do the try charge and then add the page into
>>>              //radix tree. If one of the try charge failed, it enters per-memcg
>>>              //oom while holding the page lock of previous readahead pages.
>>>
>>>             // in the memcg oom killer, it picks a task within the same memcg
>>>             // and mark it TIF_MEMDIE. then it goes back into retry loop and
>>>             // hopes the task exits to free some memory.
>>>
>>> 2. Task_B enters pagefault (filemap_fault) and finds the page in radix tree (
>>> one of the readahead pages from ProcessA)
>>>
>>> filemap_fault
>>>  ->__lock_page // here it is marked as TIF_MEMDIE. but it can not proceed since
>>>                // the page lock is hold by ProcessA looping at OOM.
>>>
>>> Since the TIF_MEMDIE task_B is live locked, it ends up blocking other tasks
>>> making forward progress since they are also checking the flag in
>>> select_bad_process. The same issue exists in the non-memcg world. Instead of
>>> entering oom through mem_cgroup_cache_charge(), we might enter it through
>>> radix_tree_preload().
>>>
>>> The proposed fix here is to pass __GFP_NORETRY gfp_mask into try charge under
>>> readahead. Then we skip entering memcg OOM kill which eliminates the case where
>>> it OOMs on one page and holds other page locks. It seems to be safe to do that
>>> since both filemap_fault() and do_generic_file_read() handles the fallback case
>>> of "no_cached_page".
>>>
>>> Note:
>>> After this patch, we might experience some charge fails for readahead pages
>>> (since we don't enter oom). But this sounds sane compared to letting the system
>>> trying extremely hard to charge a readahead page by doing reclaim and then oom,
>>> the later one also triggers livelock as listed above.
>>>
>>> Signed-off-by: Greg Thelen <gthelen@google.com>
>>> Signed-off-by: Ying Han <yinghan@google.com>
>>
>> Nice catch.
>>
>> The concern is GFP_KERNEL != avoid OOM.
>> Although it works now, it can be changed.
>>
>> With alternative idea, We can use explicit oom_killer_disable with __GFP_NOWARN
>> but it wouldn't work since oom_killer_disabled isn't reference count variable.
>> Of course, we can change it with reference-counted atomic variable.
>> The benefit is it's more explicit and doesn't depends on __GFP_NORETRY implementation.
>> So I don't have a good idea except above.
>
>> If you want __GFP_NORTRY patch, thing we can do best is add comment in detail, at least.
>> both side, here add_to_page_cache_lru and there __GFP_NORETRY in include/linux/gfp.h.
>
> Correct me in case i missed something, looks like I want to backport
> the " x86,mm: make pagefault killable" patch, and we might be able to
> solve the livelock w/o changing the readahead code.
>

I missed lock_page_or_retry Kame pointed out.
So, backport should solve the problem.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2011-12-14 11:31 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-13  2:16 [PATCH 2/2] memcg: fix livelock in try charge during readahead Ying Han
2011-12-13  4:45 ` KAMEZAWA Hiroyuki
2011-12-13 17:59   ` Ying Han
2011-12-13  6:10 ` Minchan Kim
2011-12-13 18:29   ` Ying Han
2011-12-14 10:53     ` Michal Hocko
2011-12-14 11:31     ` Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox