mm: memcg: A infinite loop in __handle_mm

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* mm: memcg: A infinite loop in __handle_mm_fault()
@ 2014-02-10  0:25 Mizuma, Masayoshi
  2014-02-10 11:19 ` Michal Hocko
  0 siblings, 1 reply; 6+ messages in thread
From: Mizuma, Masayoshi @ 2014-02-10  0:25 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Balbir Singh, KAMEZAWA Hiroyuki,
	cgroups, linux-mm

Hi,

This is a bug report for memory cgroup hang up.
I reproduced this using 3.14-rc1 but I couldn't in 3.7.

When I ran a program (see below) under a limit of memcg, the process hanged up.
Using kprobe trace, I detected the hangup in __handle_mm_fault().
do_huge_pmd_wp_page(), which is called by __handle_mm_fault(), always returns
VM_FAULT_OOM, so it repeats goto retry and the task can't be killed.
--------------------------------------------------
static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
                             unsigned long address, unsigned int flags)
{Hi all,

This is a bug report for memory cgroup hang up.
I reproduced this using 3.14-rc1 but I couldn't in 3.7.

When I ran a program (see below) under a limit of memcg, the process hangs up.
Using kprobe trace, I detected the hangup in __handle_mm_fault().
do_huge_pmd_wp_page(), which is called by __handle_mm_fault(), always returns
VM_FAULT_OOM but the task can't be killed.
It seems to be in infinite loop and the process is never killed.

--------------------------------------------------
static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
                             unsigned long address, unsigned int flags)
{
...
retry:
        pgd = pgd_offset(mm, address);
...
                        if (dirty && !pmd_write(orig_pmd)) {
                                ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
                                                          orig_pmd);
                                /*
                                 * If COW results in an oom, the huge pmd will
                                 * have been split, so retry the fault on the
                                 * pte for a smaller charge.
                                 */
                                if (unlikely(ret & VM_FAULT_OOM))
                                        goto retry;
--------------------------------------------------

[Step to reproduce]

1. Set memory cgroup as follows:

--------------------------------------------------
# mkdir /sys/fs/cgroup/memory/test
# echo "6M" > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
# echo "6M" > /sys/fs/cgroup/memory/test/memory.memsw.limit_in_bytes 
--------------------------------------------------

2. Ran the following process (test.c).

test.c:
--------------------------------------------------
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#define SIZE 4*1024*1024
#define HUGE 2*1024*1024
#define PAGESIZE 4096
#define NUM SIZE/PAGESIZE

int main(void)
{
	char *a;
	char *c;
	int i;

	/* wait until set cgroup limits */
	sleep(1);

	posix_memalign((void **)&a, HUGE, SIZE);
	posix_memalign((void **)&c, HUGE, SIZE);

	for (i = 0; i<NUM; i++) {
		*(a + i * PAGESIZE) = *(c + i * PAGESIZE);
	}

	for (i = 0; i<NUM; i++) {
		*(c + i * PAGESIZE) = *(a + i * PAGESIZE);
	}

	free(a);
	free(c);
	return 0;
}
--------------------------------------------------

3. Add it to memory cgroup.
--------------------------------------------------
# ./test &
# echo $! > /sys/fs/cgroup/memory/test/tasks
--------------------------------------------------

Then, the process will hangup.
I checked the infinit loop by using kprobetrace.

Setting of kprobetrace:
--------------------------------------------------
# echo 'p:do_huge_pmd_wp_page do_huge_pmd_wp_page address=%dx' > /sys/kernel/debug/tracing/kprobe_events
# echo 'r:do_huge_pmd_wp_page_r do_huge_pmd_wp_page ret=$retval' >> /sys/kernel/debug/tracing/kprobe_events
# echo 'r:mem_cgroup_newpage_charge mem_cgroup_newpage_charge ret=$retval' >> /sys/kernel/debug/tracing/kprobe_events
# echo 'r:mem_cgroup_charge_common mem_cgroup_charge_common ret=$retval' >> /sys/kernel/debug/tracing/kprobe_events
# echo 'r:__mem_cgroup_try_charge __mem_cgroup_try_charge ret=$retval' >> /sys/kernel/debug/tracing/kprobe_events
# echo 1 > /sys/kernel/debug/tracing/events/kprobes/do_huge_pmd_wp_page/enable
# echo 1 > /sys/kernel/debug/tracing/events/kprobes/do_huge_pmd_wp_page_r/enable
# echo 1 > /sys/kernel/debug/tracing/events/kprobes/mem_cgroup_newpage_charge/enable
# echo 1 > /sys/kernel/debug/tracing/events/kprobes/mem_cgroup_charge_common/enable
# echo 1 > /sys/kernel/debug/tracing/events/kprobes/__mem_cgroup_try_charge/enable
--------------------------------------------------

The result:
--------------------------------------------------
test-2721  [001] dN..  2530.635679: do_huge_pmd_wp_page: (do_huge_pmd_wp_page+0x0/0xa90) address=0x7f55a4400000
test-2721  [001] dN..  2530.635723: __mem_cgroup_try_charge: (mem_cgroup_charge_common+0x4a/0xa0 <- __mem_cgroup_try_charge) ret=0xfffffff4
test-2721  [001] dN..  2530.635724: mem_cgroup_charge_common: (mem_cgroup_newpage_charge+0x26/0x30 <- mem_cgroup_charge_common) ret=0xfffffff4
test-2721  [001] dN..  2530.635725: mem_cgroup_newpage_charge: (do_huge_pmd_wp_page+0x125/0xa90 <- mem_cgroup_newpage_charge) ret=0xfffffff4
test-2721  [001] dN..  2530.635733: do_huge_pmd_wp_page_r: (handle_mm_fault+0x19e/0x4b0 <- do_huge_pmd_wp_page) ret=0x1
test-2721  [001] dN..  2530.635735: do_huge_pmd_wp_page: (do_huge_pmd_wp_page+0x0/0xa90) address=0x7f55a4400000
test-2721  [001] dN..  2530.635761: __mem_cgroup_try_charge: (mem_cgroup_charge_common+0x4a/0xa0 <- __mem_cgroup_try_charge) ret=0xfffffff4
test-2721  [001] dN..  2530.635761: mem_cgroup_charge_common: (mem_cgroup_newpage_charge+0x26/0x30 <- mem_cgroup_charge_common) ret=0xfffffff4
test-2721  [001] dN..  2530.635762: mem_cgroup_newpage_charge: (do_huge_pmd_wp_page+0x125/0xa90 <- mem_cgroup_newpage_charge) ret=0xfffffff4
test-2721  [001] dN..  2530.635768: do_huge_pmd_wp_page_r: (handle_mm_fault+0x19e/0x4b0 <- do_huge_pmd_wp_page) ret=0x1
(...repeat...)
--------------------------------------------------

Regards,
Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
...
retry:
        pgd = pgd_offset(mm, address);
...
                        if (dirty && !pmd_write(orig_pmd)) {
                                ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
                                                          orig_pmd);
                                /*
                                 * If COW results in an oom, the huge pmd will
                                 * have been split, so retry the fault on the
                                 * pte for a smaller charge.
                                 */
                                if (unlikely(ret & VM_FAULT_OOM))
                                        goto retry;
--------------------------------------------------

[Step to reproduce]

1. Set memory cgroup as follows:

--------------------------------------------------
# mkdir /sys/fs/cgroup/memory/test
# echo "6M" > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
# echo "6M" > /sys/fs/cgroup/memory/test/memory.memsw.limit_in_bytes 
--------------------------------------------------

2. Ran the following process (test.c).

test.c:
--------------------------------------------------
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#define SIZE 4*1024*1024
#define HUGE 2*1024*1024
#define PAGESIZE 4096
#define NUM SIZE/PAGESIZE

int main(void)
{
	char *a;
	char *c;
	int i;

	/* wait until set cgroup limits */
	sleep(1);

	posix_memalign((void **)&a, HUGE, SIZE);
	posix_memalign((void **)&c, HUGE, SIZE);

	for (i = 0; i<NUM; i++) {
		*(a + i * PAGESIZE) = *(c + i * PAGESIZE);
	}

	for (i = 0; i<NUM; i++) {
		*(c + i * PAGESIZE) = *(a + i * PAGESIZE);
	}

	free(a);
	free(c);
	return 0;
}
--------------------------------------------------

3. Add it to memory cgroup.
--------------------------------------------------
# ./test &
# echo $! > /sys/fs/cgroup/memory/test/tasks
--------------------------------------------------

Then, the process will hangup.
I checked the infinit loop by using kprobetrace.

Setting of kprobetrace:
--------------------------------------------------
# echo 'p:do_huge_pmd_wp_page do_huge_pmd_wp_page address=%dx' > /sys/kernel/debug/tracing/kprobe_events
# echo 'r:do_huge_pmd_wp_page_r do_huge_pmd_wp_page ret=$retval' >> /sys/kernel/debug/tracing/kprobe_events
# echo 'r:mem_cgroup_newpage_charge mem_cgroup_newpage_charge ret=$retval' >> /sys/kernel/debug/tracing/kprobe_events
# echo 'r:mem_cgroup_charge_common mem_cgroup_charge_common ret=$retval' >> /sys/kernel/debug/tracing/kprobe_events
# echo 'r:__mem_cgroup_try_charge __mem_cgroup_try_charge ret=$retval' >> /sys/kernel/debug/tracing/kprobe_events
# echo 1 > /sys/kernel/debug/tracing/events/kprobes/do_huge_pmd_wp_page/enable
# echo 1 > /sys/kernel/debug/tracing/events/kprobes/do_huge_pmd_wp_page_r/enable
# echo 1 > /sys/kernel/debug/tracing/events/kprobes/mem_cgroup_newpage_charge/enable
# echo 1 > /sys/kernel/debug/tracing/events/kprobes/mem_cgroup_charge_common/enable
# echo 1 > /sys/kernel/debug/tracing/events/kprobes/__mem_cgroup_try_charge/enable
--------------------------------------------------

The result:
--------------------------------------------------
test-2721  [001] dN..  2530.635679: do_huge_pmd_wp_page: (do_huge_pmd_wp_page+0x0/0xa90) address=0x7f55a4400000
test-2721  [001] dN..  2530.635723: __mem_cgroup_try_charge: (mem_cgroup_charge_common+0x4a/0xa0 <- __mem_cgroup_try_charge) ret=0xfffffff4
test-2721  [001] dN..  2530.635724: mem_cgroup_charge_common: (mem_cgroup_newpage_charge+0x26/0x30 <- mem_cgroup_charge_common) ret=0xfffffff4
test-2721  [001] dN..  2530.635725: mem_cgroup_newpage_charge: (do_huge_pmd_wp_page+0x125/0xa90 <- mem_cgroup_newpage_charge) ret=0xfffffff4
test-2721  [001] dN..  2530.635733: do_huge_pmd_wp_page_r: (handle_mm_fault+0x19e/0x4b0 <- do_huge_pmd_wp_page) ret=0x1
test-2721  [001] dN..  2530.635735: do_huge_pmd_wp_page: (do_huge_pmd_wp_page+0x0/0xa90) address=0x7f55a4400000
test-2721  [001] dN..  2530.635761: __mem_cgroup_try_charge: (mem_cgroup_charge_common+0x4a/0xa0 <- __mem_cgroup_try_charge) ret=0xfffffff4
test-2721  [001] dN..  2530.635761: mem_cgroup_charge_common: (mem_cgroup_newpage_charge+0x26/0x30 <- mem_cgroup_charge_common) ret=0xfffffff4
test-2721  [001] dN..  2530.635762: mem_cgroup_newpage_charge: (do_huge_pmd_wp_page+0x125/0xa90 <- mem_cgroup_newpage_charge) ret=0xfffffff4
test-2721  [001] dN..  2530.635768: do_huge_pmd_wp_page_r: (handle_mm_fault+0x19e/0x4b0 <- do_huge_pmd_wp_page) ret=0x1
(...repeat...)
--------------------------------------------------

Regards,
Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mm: memcg: A infinite loop in __handle_mm_fault()
  2014-02-10  0:25 mm: memcg: A infinite loop in __handle_mm_fault() Mizuma, Masayoshi
@ 2014-02-10 11:19 ` Michal Hocko
  2014-02-10 11:51   ` Mizuma, Masayoshi
  2014-02-10 12:56   ` Kirill A. Shutemov
  0 siblings, 2 replies; 6+ messages in thread
From: Michal Hocko @ 2014-02-10 11:19 UTC (permalink / raw)
  To: Mizuma, Masayoshi
  Cc: Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups,
	linux-mm, Kirill A. Shutemov

[CCing Kirill]

On Mon 10-02-14 09:25:01, Mizuma, Masayoshi wrote:
> Hi,

Hi,

> This is a bug report for memory cgroup hang up.
> I reproduced this using 3.14-rc1 but I couldn't in 3.7.
> 
> When I ran a program (see below) under a limit of memcg, the process hanged up.
> Using kprobe trace, I detected the hangup in __handle_mm_fault().
> do_huge_pmd_wp_page(), which is called by __handle_mm_fault(), always returns
> VM_FAULT_OOM, so it repeats goto retry and the task can't be killed.

Thanks a lot for this very good report. I would bet the issue is related
to the THP zero page.

__handle_mm_fault retry loop for VM_FAULT_OOM from do_huge_pmd_wp_page
expects that the pmd is marked for splitting so that it can break out
and retry the fault. This is not the case for THP zero page though.
do_huge_pmd_wp_page checks is_huge_zero_pmd and goes to allocate a new
huge page which will succeed in your case because you are hitting memcg
limit not the global memory pressure. But then a new page is charged by
mem_cgroup_newpage_charge which fails. An existing page is then split
and we are returning VM_FAULT_OOM. But we do not have page initialized
in that path because page = pmd_page(orig_pmd) is called after
is_huge_zero_pmd check.

I am not familiar with THP zero page code much but I guess splitting
such a zero page is not a way to go. Instead we should simply drop the
zero page and retry the fault. I would assume that one of
do_huge_pmd_wp_zero_page_fallback or do_huge_pmd_wp_page_fallback should
do the trick but both of them try to charge new page(s) before the
current zero page is uncharged. That makes it prone to the same issue
AFAICS.

But may be Kirill has a better idea.

> --------------------------------------------------
> static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>                              unsigned long address, unsigned int flags)
> {Hi all,
> 
> This is a bug report for memory cgroup hang up.
> I reproduced this using 3.14-rc1 but I couldn't in 3.7.
> 
> When I ran a program (see below) under a limit of memcg, the process hangs up.
> Using kprobe trace, I detected the hangup in __handle_mm_fault().
> do_huge_pmd_wp_page(), which is called by __handle_mm_fault(), always returns
> VM_FAULT_OOM but the task can't be killed.
> It seems to be in infinite loop and the process is never killed.
> 
> --------------------------------------------------
> static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>                              unsigned long address, unsigned int flags)
> {
> ...
> retry:
>         pgd = pgd_offset(mm, address);
> ...
>                         if (dirty && !pmd_write(orig_pmd)) {
>                                 ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
>                                                           orig_pmd);
>                                 /*
>                                  * If COW results in an oom, the huge pmd will
>                                  * have been split, so retry the fault on the
>                                  * pte for a smaller charge.
>                                  */
>                                 if (unlikely(ret & VM_FAULT_OOM))
>                                         goto retry;
> --------------------------------------------------
> 
> [Step to reproduce]
> 
> 1. Set memory cgroup as follows:
> 
> --------------------------------------------------
> # mkdir /sys/fs/cgroup/memory/test
> # echo "6M" > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
> # echo "6M" > /sys/fs/cgroup/memory/test/memory.memsw.limit_in_bytes 
> --------------------------------------------------
> 
> 2. Ran the following process (test.c).
> 
> test.c:
> --------------------------------------------------
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #define SIZE 4*1024*1024
> #define HUGE 2*1024*1024
> #define PAGESIZE 4096
> #define NUM SIZE/PAGESIZE
> 
> int main(void)
> {
> 	char *a;
> 	char *c;
> 	int i;
> 
> 	/* wait until set cgroup limits */
> 	sleep(1);
> 
> 	posix_memalign((void **)&a, HUGE, SIZE);
> 	posix_memalign((void **)&c, HUGE, SIZE);
> 
> 	for (i = 0; i<NUM; i++) {
> 		*(a + i * PAGESIZE) = *(c + i * PAGESIZE);
> 	}
> 
> 	for (i = 0; i<NUM; i++) {
> 		*(c + i * PAGESIZE) = *(a + i * PAGESIZE);
> 	}
> 
> 	free(a);
> 	free(c);
> 	return 0;
> }
> --------------------------------------------------
> 
> 3. Add it to memory cgroup.
> --------------------------------------------------
> # ./test &
> # echo $! > /sys/fs/cgroup/memory/test/tasks
> --------------------------------------------------
> 
> Then, the process will hangup.
> I checked the infinit loop by using kprobetrace.
> 
> Setting of kprobetrace:
> --------------------------------------------------
> # echo 'p:do_huge_pmd_wp_page do_huge_pmd_wp_page address=%dx' > /sys/kernel/debug/tracing/kprobe_events
> # echo 'r:do_huge_pmd_wp_page_r do_huge_pmd_wp_page ret=$retval' >> /sys/kernel/debug/tracing/kprobe_events
> # echo 'r:mem_cgroup_newpage_charge mem_cgroup_newpage_charge ret=$retval' >> /sys/kernel/debug/tracing/kprobe_events
> # echo 'r:mem_cgroup_charge_common mem_cgroup_charge_common ret=$retval' >> /sys/kernel/debug/tracing/kprobe_events
> # echo 'r:__mem_cgroup_try_charge __mem_cgroup_try_charge ret=$retval' >> /sys/kernel/debug/tracing/kprobe_events
> # echo 1 > /sys/kernel/debug/tracing/events/kprobes/do_huge_pmd_wp_page/enable
> # echo 1 > /sys/kernel/debug/tracing/events/kprobes/do_huge_pmd_wp_page_r/enable
> # echo 1 > /sys/kernel/debug/tracing/events/kprobes/mem_cgroup_newpage_charge/enable
> # echo 1 > /sys/kernel/debug/tracing/events/kprobes/mem_cgroup_charge_common/enable
> # echo 1 > /sys/kernel/debug/tracing/events/kprobes/__mem_cgroup_try_charge/enable
> --------------------------------------------------
> 
> The result:
> --------------------------------------------------
> test-2721  [001] dN..  2530.635679: do_huge_pmd_wp_page: (do_huge_pmd_wp_page+0x0/0xa90) address=0x7f55a4400000
> test-2721  [001] dN..  2530.635723: __mem_cgroup_try_charge: (mem_cgroup_charge_common+0x4a/0xa0 <- __mem_cgroup_try_charge) ret=0xfffffff4
> test-2721  [001] dN..  2530.635724: mem_cgroup_charge_common: (mem_cgroup_newpage_charge+0x26/0x30 <- mem_cgroup_charge_common) ret=0xfffffff4
> test-2721  [001] dN..  2530.635725: mem_cgroup_newpage_charge: (do_huge_pmd_wp_page+0x125/0xa90 <- mem_cgroup_newpage_charge) ret=0xfffffff4
> test-2721  [001] dN..  2530.635733: do_huge_pmd_wp_page_r: (handle_mm_fault+0x19e/0x4b0 <- do_huge_pmd_wp_page) ret=0x1
> test-2721  [001] dN..  2530.635735: do_huge_pmd_wp_page: (do_huge_pmd_wp_page+0x0/0xa90) address=0x7f55a4400000
> test-2721  [001] dN..  2530.635761: __mem_cgroup_try_charge: (mem_cgroup_charge_common+0x4a/0xa0 <- __mem_cgroup_try_charge) ret=0xfffffff4
> test-2721  [001] dN..  2530.635761: mem_cgroup_charge_common: (mem_cgroup_newpage_charge+0x26/0x30 <- mem_cgroup_charge_common) ret=0xfffffff4
> test-2721  [001] dN..  2530.635762: mem_cgroup_newpage_charge: (do_huge_pmd_wp_page+0x125/0xa90 <- mem_cgroup_newpage_charge) ret=0xfffffff4
> test-2721  [001] dN..  2530.635768: do_huge_pmd_wp_page_r: (handle_mm_fault+0x19e/0x4b0 <- do_huge_pmd_wp_page) ret=0x1
> (...repeat...)
> --------------------------------------------------
> 
> Regards,
> Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
> ...
> retry:
>         pgd = pgd_offset(mm, address);
> ...
>                         if (dirty && !pmd_write(orig_pmd)) {
>                                 ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
>                                                           orig_pmd);
>                                 /*
>                                  * If COW results in an oom, the huge pmd will
>                                  * have been split, so retry the fault on the
>                                  * pte for a smaller charge.
>                                  */
>                                 if (unlikely(ret & VM_FAULT_OOM))
>                                         goto retry;
> --------------------------------------------------
> 
> [Step to reproduce]
> 
> 1. Set memory cgroup as follows:
> 
> --------------------------------------------------
> # mkdir /sys/fs/cgroup/memory/test
> # echo "6M" > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
> # echo "6M" > /sys/fs/cgroup/memory/test/memory.memsw.limit_in_bytes 
> --------------------------------------------------
> 
> 2. Ran the following process (test.c).
> 
> test.c:
> --------------------------------------------------
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #define SIZE 4*1024*1024
> #define HUGE 2*1024*1024
> #define PAGESIZE 4096
> #define NUM SIZE/PAGESIZE
> 
> int main(void)
> {
> 	char *a;
> 	char *c;
> 	int i;
> 
> 	/* wait until set cgroup limits */
> 	sleep(1);
> 
> 	posix_memalign((void **)&a, HUGE, SIZE);
> 	posix_memalign((void **)&c, HUGE, SIZE);
> 
> 	for (i = 0; i<NUM; i++) {
> 		*(a + i * PAGESIZE) = *(c + i * PAGESIZE);
> 	}
> 
> 	for (i = 0; i<NUM; i++) {
> 		*(c + i * PAGESIZE) = *(a + i * PAGESIZE);
> 	}
> 
> 	free(a);
> 	free(c);
> 	return 0;
> }
> --------------------------------------------------
> 
> 3. Add it to memory cgroup.
> --------------------------------------------------
> # ./test &
> # echo $! > /sys/fs/cgroup/memory/test/tasks
> --------------------------------------------------
> 
> Then, the process will hangup.
> I checked the infinit loop by using kprobetrace.
> 
> Setting of kprobetrace:
> --------------------------------------------------
> # echo 'p:do_huge_pmd_wp_page do_huge_pmd_wp_page address=%dx' > /sys/kernel/debug/tracing/kprobe_events
> # echo 'r:do_huge_pmd_wp_page_r do_huge_pmd_wp_page ret=$retval' >> /sys/kernel/debug/tracing/kprobe_events
> # echo 'r:mem_cgroup_newpage_charge mem_cgroup_newpage_charge ret=$retval' >> /sys/kernel/debug/tracing/kprobe_events
> # echo 'r:mem_cgroup_charge_common mem_cgroup_charge_common ret=$retval' >> /sys/kernel/debug/tracing/kprobe_events
> # echo 'r:__mem_cgroup_try_charge __mem_cgroup_try_charge ret=$retval' >> /sys/kernel/debug/tracing/kprobe_events
> # echo 1 > /sys/kernel/debug/tracing/events/kprobes/do_huge_pmd_wp_page/enable
> # echo 1 > /sys/kernel/debug/tracing/events/kprobes/do_huge_pmd_wp_page_r/enable
> # echo 1 > /sys/kernel/debug/tracing/events/kprobes/mem_cgroup_newpage_charge/enable
> # echo 1 > /sys/kernel/debug/tracing/events/kprobes/mem_cgroup_charge_common/enable
> # echo 1 > /sys/kernel/debug/tracing/events/kprobes/__mem_cgroup_try_charge/enable
> --------------------------------------------------
> 
> The result:
> --------------------------------------------------
> test-2721  [001] dN..  2530.635679: do_huge_pmd_wp_page: (do_huge_pmd_wp_page+0x0/0xa90) address=0x7f55a4400000
> test-2721  [001] dN..  2530.635723: __mem_cgroup_try_charge: (mem_cgroup_charge_common+0x4a/0xa0 <- __mem_cgroup_try_charge) ret=0xfffffff4
> test-2721  [001] dN..  2530.635724: mem_cgroup_charge_common: (mem_cgroup_newpage_charge+0x26/0x30 <- mem_cgroup_charge_common) ret=0xfffffff4
> test-2721  [001] dN..  2530.635725: mem_cgroup_newpage_charge: (do_huge_pmd_wp_page+0x125/0xa90 <- mem_cgroup_newpage_charge) ret=0xfffffff4
> test-2721  [001] dN..  2530.635733: do_huge_pmd_wp_page_r: (handle_mm_fault+0x19e/0x4b0 <- do_huge_pmd_wp_page) ret=0x1
> test-2721  [001] dN..  2530.635735: do_huge_pmd_wp_page: (do_huge_pmd_wp_page+0x0/0xa90) address=0x7f55a4400000
> test-2721  [001] dN..  2530.635761: __mem_cgroup_try_charge: (mem_cgroup_charge_common+0x4a/0xa0 <- __mem_cgroup_try_charge) ret=0xfffffff4
> test-2721  [001] dN..  2530.635761: mem_cgroup_charge_common: (mem_cgroup_newpage_charge+0x26/0x30 <- mem_cgroup_charge_common) ret=0xfffffff4
> test-2721  [001] dN..  2530.635762: mem_cgroup_newpage_charge: (do_huge_pmd_wp_page+0x125/0xa90 <- mem_cgroup_newpage_charge) ret=0xfffffff4
> test-2721  [001] dN..  2530.635768: do_huge_pmd_wp_page_r: (handle_mm_fault+0x19e/0x4b0 <- do_huge_pmd_wp_page) ret=0x1
> (...repeat...)
> --------------------------------------------------
> 
> Regards,
> Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mm: memcg: A infinite loop in __handle_mm_fault()
  2014-02-10 11:19 ` Michal Hocko
@ 2014-02-10 11:51   ` Mizuma, Masayoshi
  2014-02-10 12:56   ` Kirill A. Shutemov
  1 sibling, 0 replies; 6+ messages in thread
From: Mizuma, Masayoshi @ 2014-02-10 11:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups,
	linux-mm, Kirill A. Shutemov



(2014/02/10 20:19), Michal Hocko wrote:
> [CCing Kirill]
>
> On Mon 10-02-14 09:25:01, Mizuma, Masayoshi wrote:
>> Hi,
>
> Hi,

Thank you for response and sorry for my broken mail text (I mistook copy and paste...).

>
>> This is a bug report for memory cgroup hang up.
>> I reproduced this using 3.14-rc1 but I couldn't in 3.7.
>>
>> When I ran a program (see below) under a limit of memcg, the process hanged up.
>> Using kprobe trace, I detected the hangup in __handle_mm_fault().
>> do_huge_pmd_wp_page(), which is called by __handle_mm_fault(), always returns
>> VM_FAULT_OOM, so it repeats goto retry and the task can't be killed.
>
> Thanks a lot for this very good report. I would bet the issue is related
> to the THP zero page.
>
> __handle_mm_fault retry loop for VM_FAULT_OOM from do_huge_pmd_wp_page
> expects that the pmd is marked for splitting so that it can break out
> and retry the fault. This is not the case for THP zero page though.
> do_huge_pmd_wp_page checks is_huge_zero_pmd and goes to allocate a new
> huge page which will succeed in your case because you are hitting memcg
> limit not the global memory pressure. But then a new page is charged by
> mem_cgroup_newpage_charge which fails. An existing page is then split
> and we are returning VM_FAULT_OOM. But we do not have page initialized
> in that path because page = pmd_page(orig_pmd) is called after
> is_huge_zero_pmd check.
>
> I am not familiar with THP zero page code much but I guess splitting
> such a zero page is not a way to go. Instead we should simply drop the
> zero page and retry the fault. I would assume that one of
> do_huge_pmd_wp_zero_page_fallback or do_huge_pmd_wp_page_fallback should
> do the trick but both of them try to charge new page(s) before the
> current zero page is uncharged. That makes it prone to the same issue
> AFAICS.
>
> But may be Kirill has a better idea.

I think this issue is related to THP, too. Because, it is not reproduced when
THP is disabled as following.

# echo never > /sys/kernel/mm/transparent_hugepage/enabled

Regards,
Masayoshi Mizuma

>
> But may be Kirill has a better idea.
>
>> --------------------------------------------------
>> static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>>                               unsigned long address, unsigned int flags)
>> {Hi all,
>>
>> This is a bug report for memory cgroup hang up.
>> I reproduced this using 3.14-rc1 but I couldn't in 3.7.
>>
>> When I ran a program (see below) under a limit of memcg, the process hangs up.
>> Using kprobe trace, I detected the hangup in __handle_mm_fault().
>> do_huge_pmd_wp_page(), which is called by __handle_mm_fault(), always returns
>> VM_FAULT_OOM but the task can't be killed.
>> It seems to be in infinite loop and the process is never killed.
>>
>> --------------------------------------------------
>> static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>>                               unsigned long address, unsigned int flags)
>> {
>> ...
>> retry:
>>          pgd = pgd_offset(mm, address);
>> ...
>>                          if (dirty && !pmd_write(orig_pmd)) {
>>                                  ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
>>                                                            orig_pmd);
>>                                  /*
>>                                   * If COW results in an oom, the huge pmd will
>>                                   * have been split, so retry the fault on the
>>                                   * pte for a smaller charge.
>>                                   */
>>                                  if (unlikely(ret & VM_FAULT_OOM))
>>                                          goto retry;
>> --------------------------------------------------
>>
>> [Step to reproduce]
>>
>> 1. Set memory cgroup as follows:
>>
>> --------------------------------------------------
>> # mkdir /sys/fs/cgroup/memory/test
>> # echo "6M" > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
>> # echo "6M" > /sys/fs/cgroup/memory/test/memory.memsw.limit_in_bytes
>> --------------------------------------------------
>>
>> 2. Ran the following process (test.c).
>>
>> test.c:
>> --------------------------------------------------
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <unistd.h>
>> #define SIZE 4*1024*1024
>> #define HUGE 2*1024*1024
>> #define PAGESIZE 4096
>> #define NUM SIZE/PAGESIZE
>>
>> int main(void)
>> {
>> 	char *a;
>> 	char *c;
>> 	int i;
>>
>> 	/* wait until set cgroup limits */
>> 	sleep(1);
>>
>> 	posix_memalign((void **)&a, HUGE, SIZE);
>> 	posix_memalign((void **)&c, HUGE, SIZE);
>>
>> 	for (i = 0; i<NUM; i++) {
>> 		*(a + i * PAGESIZE) = *(c + i * PAGESIZE);
>> 	}
>>
>> 	for (i = 0; i<NUM; i++) {
>> 		*(c + i * PAGESIZE) = *(a + i * PAGESIZE);
>> 	}
>>
>> 	free(a);
>> 	free(c);
>> 	return 0;
>> }
>> --------------------------------------------------
>>
>> 3. Add it to memory cgroup.
>> --------------------------------------------------
>> # ./test &
>> # echo $! > /sys/fs/cgroup/memory/test/tasks
>> --------------------------------------------------
>>
>> Then, the process will hangup.
>> I checked the infinit loop by using kprobetrace.
>>
>> Setting of kprobetrace:
>> --------------------------------------------------
>> # echo 'p:do_huge_pmd_wp_page do_huge_pmd_wp_page address=%dx' > /sys/kernel/debug/tracing/kprobe_events
>> # echo 'r:do_huge_pmd_wp_page_r do_huge_pmd_wp_page ret=$retval' >> /sys/kernel/debug/tracing/kprobe_events
>> # echo 'r:mem_cgroup_newpage_charge mem_cgroup_newpage_charge ret=$retval' >> /sys/kernel/debug/tracing/kprobe_events
>> # echo 'r:mem_cgroup_charge_common mem_cgroup_charge_common ret=$retval' >> /sys/kernel/debug/tracing/kprobe_events
>> # echo 'r:__mem_cgroup_try_charge __mem_cgroup_try_charge ret=$retval' >> /sys/kernel/debug/tracing/kprobe_events
>> # echo 1 > /sys/kernel/debug/tracing/events/kprobes/do_huge_pmd_wp_page/enable
>> # echo 1 > /sys/kernel/debug/tracing/events/kprobes/do_huge_pmd_wp_page_r/enable
>> # echo 1 > /sys/kernel/debug/tracing/events/kprobes/mem_cgroup_newpage_charge/enable
>> # echo 1 > /sys/kernel/debug/tracing/events/kprobes/mem_cgroup_charge_common/enable
>> # echo 1 > /sys/kernel/debug/tracing/events/kprobes/__mem_cgroup_try_charge/enable
>> --------------------------------------------------
>>
>> The result:
>> --------------------------------------------------
>> test-2721  [001] dN..  2530.635679: do_huge_pmd_wp_page: (do_huge_pmd_wp_page+0x0/0xa90) address=0x7f55a4400000
>> test-2721  [001] dN..  2530.635723: __mem_cgroup_try_charge: (mem_cgroup_charge_common+0x4a/0xa0 <- __mem_cgroup_try_charge) ret=0xfffffff4
>> test-2721  [001] dN..  2530.635724: mem_cgroup_charge_common: (mem_cgroup_newpage_charge+0x26/0x30 <- mem_cgroup_charge_common) ret=0xfffffff4
>> test-2721  [001] dN..  2530.635725: mem_cgroup_newpage_charge: (do_huge_pmd_wp_page+0x125/0xa90 <- mem_cgroup_newpage_charge) ret=0xfffffff4
>> test-2721  [001] dN..  2530.635733: do_huge_pmd_wp_page_r: (handle_mm_fault+0x19e/0x4b0 <- do_huge_pmd_wp_page) ret=0x1
>> test-2721  [001] dN..  2530.635735: do_huge_pmd_wp_page: (do_huge_pmd_wp_page+0x0/0xa90) address=0x7f55a4400000
>> test-2721  [001] dN..  2530.635761: __mem_cgroup_try_charge: (mem_cgroup_charge_common+0x4a/0xa0 <- __mem_cgroup_try_charge) ret=0xfffffff4
>> test-2721  [001] dN..  2530.635761: mem_cgroup_charge_common: (mem_cgroup_newpage_charge+0x26/0x30 <- mem_cgroup_charge_common) ret=0xfffffff4
>> test-2721  [001] dN..  2530.635762: mem_cgroup_newpage_charge: (do_huge_pmd_wp_page+0x125/0xa90 <- mem_cgroup_newpage_charge) ret=0xfffffff4
>> test-2721  [001] dN..  2530.635768: do_huge_pmd_wp_page_r: (handle_mm_fault+0x19e/0x4b0 <- do_huge_pmd_wp_page) ret=0x1
>> (...repeat...)
>> --------------------------------------------------
>>
>> Regards,
>> Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
>> ...
>> retry:
>>          pgd = pgd_offset(mm, address);
>> ...
>>                          if (dirty && !pmd_write(orig_pmd)) {
>>                                  ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
>>                                                            orig_pmd);
>>                                  /*
>>                                   * If COW results in an oom, the huge pmd will
>>                                   * have been split, so retry the fault on the
>>                                   * pte for a smaller charge.
>>                                   */
>>                                  if (unlikely(ret & VM_FAULT_OOM))
>>                                          goto retry;
>> --------------------------------------------------
>>
>> [Step to reproduce]
>>
>> 1. Set memory cgroup as follows:
>>
>> --------------------------------------------------
>> # mkdir /sys/fs/cgroup/memory/test
>> # echo "6M" > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
>> # echo "6M" > /sys/fs/cgroup/memory/test/memory.memsw.limit_in_bytes
>> --------------------------------------------------
>>
>> 2. Ran the following process (test.c).
>>
>> test.c:
>> --------------------------------------------------
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <unistd.h>
>> #define SIZE 4*1024*1024
>> #define HUGE 2*1024*1024
>> #define PAGESIZE 4096
>> #define NUM SIZE/PAGESIZE
>>
>> int main(void)
>> {
>> 	char *a;
>> 	char *c;
>> 	int i;
>>
>> 	/* wait until set cgroup limits */
>> 	sleep(1);
>>
>> 	posix_memalign((void **)&a, HUGE, SIZE);
>> 	posix_memalign((void **)&c, HUGE, SIZE);
>>
>> 	for (i = 0; i<NUM; i++) {
>> 		*(a + i * PAGESIZE) = *(c + i * PAGESIZE);
>> 	}
>>
>> 	for (i = 0; i<NUM; i++) {
>> 		*(c + i * PAGESIZE) = *(a + i * PAGESIZE);
>> 	}
>>
>> 	free(a);
>> 	free(c);
>> 	return 0;
>> }
>> --------------------------------------------------
>>
>> 3. Add it to memory cgroup.
>> --------------------------------------------------
>> # ./test &
>> # echo $! > /sys/fs/cgroup/memory/test/tasks
>> --------------------------------------------------
>>
>> Then, the process will hangup.
>> I checked the infinit loop by using kprobetrace.
>>
>> Setting of kprobetrace:
>> --------------------------------------------------
>> # echo 'p:do_huge_pmd_wp_page do_huge_pmd_wp_page address=%dx' > /sys/kernel/debug/tracing/kprobe_events
>> # echo 'r:do_huge_pmd_wp_page_r do_huge_pmd_wp_page ret=$retval' >> /sys/kernel/debug/tracing/kprobe_events
>> # echo 'r:mem_cgroup_newpage_charge mem_cgroup_newpage_charge ret=$retval' >> /sys/kernel/debug/tracing/kprobe_events
>> # echo 'r:mem_cgroup_charge_common mem_cgroup_charge_common ret=$retval' >> /sys/kernel/debug/tracing/kprobe_events
>> # echo 'r:__mem_cgroup_try_charge __mem_cgroup_try_charge ret=$retval' >> /sys/kernel/debug/tracing/kprobe_events
>> # echo 1 > /sys/kernel/debug/tracing/events/kprobes/do_huge_pmd_wp_page/enable
>> # echo 1 > /sys/kernel/debug/tracing/events/kprobes/do_huge_pmd_wp_page_r/enable
>> # echo 1 > /sys/kernel/debug/tracing/events/kprobes/mem_cgroup_newpage_charge/enable
>> # echo 1 > /sys/kernel/debug/tracing/events/kprobes/mem_cgroup_charge_common/enable
>> # echo 1 > /sys/kernel/debug/tracing/events/kprobes/__mem_cgroup_try_charge/enable
>> --------------------------------------------------
>>
>> The result:
>> --------------------------------------------------
>> test-2721  [001] dN..  2530.635679: do_huge_pmd_wp_page: (do_huge_pmd_wp_page+0x0/0xa90) address=0x7f55a4400000
>> test-2721  [001] dN..  2530.635723: __mem_cgroup_try_charge: (mem_cgroup_charge_common+0x4a/0xa0 <- __mem_cgroup_try_charge) ret=0xfffffff4
>> test-2721  [001] dN..  2530.635724: mem_cgroup_charge_common: (mem_cgroup_newpage_charge+0x26/0x30 <- mem_cgroup_charge_common) ret=0xfffffff4
>> test-2721  [001] dN..  2530.635725: mem_cgroup_newpage_charge: (do_huge_pmd_wp_page+0x125/0xa90 <- mem_cgroup_newpage_charge) ret=0xfffffff4
>> test-2721  [001] dN..  2530.635733: do_huge_pmd_wp_page_r: (handle_mm_fault+0x19e/0x4b0 <- do_huge_pmd_wp_page) ret=0x1
>> test-2721  [001] dN..  2530.635735: do_huge_pmd_wp_page: (do_huge_pmd_wp_page+0x0/0xa90) address=0x7f55a4400000
>> test-2721  [001] dN..  2530.635761: __mem_cgroup_try_charge: (mem_cgroup_charge_common+0x4a/0xa0 <- __mem_cgroup_try_charge) ret=0xfffffff4
>> test-2721  [001] dN..  2530.635761: mem_cgroup_charge_common: (mem_cgroup_newpage_charge+0x26/0x30 <- mem_cgroup_charge_common) ret=0xfffffff4
>> test-2721  [001] dN..  2530.635762: mem_cgroup_newpage_charge: (do_huge_pmd_wp_page+0x125/0xa90 <- mem_cgroup_newpage_charge) ret=0xfffffff4
>> test-2721  [001] dN..  2530.635768: do_huge_pmd_wp_page_r: (handle_mm_fault+0x19e/0x4b0 <- do_huge_pmd_wp_page) ret=0x1
>> (...repeat...)
>> --------------------------------------------------
>>
>> Regards,
>> Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
>> --
>> To unsubscribe from this list: send the line "unsubscribe cgroups" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mm: memcg: A infinite loop in __handle_mm_fault()
  2014-02-10 11:19 ` Michal Hocko
  2014-02-10 11:51   ` Mizuma, Masayoshi
@ 2014-02-10 12:56   ` Kirill A. Shutemov
  2014-02-10 13:52     ` Michal Hocko
  2014-02-12  1:04     ` Mizuma, Masayoshi
  1 sibling, 2 replies; 6+ messages in thread
From: Kirill A. Shutemov @ 2014-02-10 12:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mizuma, Masayoshi, Johannes Weiner, Balbir Singh,
	KAMEZAWA Hiroyuki, cgroups, linux-mm, Kirill A. Shutemov

Michal Hocko wrote:
> [CCing Kirill]
> 
> On Mon 10-02-14 09:25:01, Mizuma, Masayoshi wrote:
> > Hi,
> 
> Hi,
> 
> > This is a bug report for memory cgroup hang up.
> > I reproduced this using 3.14-rc1 but I couldn't in 3.7.
> > 
> > When I ran a program (see below) under a limit of memcg, the process hanged up.
> > Using kprobe trace, I detected the hangup in __handle_mm_fault().
> > do_huge_pmd_wp_page(), which is called by __handle_mm_fault(), always returns
> > VM_FAULT_OOM, so it repeats goto retry and the task can't be killed.
> 
> Thanks a lot for this very good report. I would bet the issue is related
> to the THP zero page.
> 
> __handle_mm_fault retry loop for VM_FAULT_OOM from do_huge_pmd_wp_page
> expects that the pmd is marked for splitting so that it can break out
> and retry the fault. This is not the case for THP zero page though.
> do_huge_pmd_wp_page checks is_huge_zero_pmd and goes to allocate a new
> huge page which will succeed in your case because you are hitting memcg
> limit not the global memory pressure. But then a new page is charged by
> mem_cgroup_newpage_charge which fails. An existing page is then split
> and we are returning VM_FAULT_OOM. But we do not have page initialized
> in that path because page = pmd_page(orig_pmd) is called after
> is_huge_zero_pmd check.
> 
> I am not familiar with THP zero page code much but I guess splitting
> such a zero page is not a way to go. Instead we should simply drop the
> zero page and retry the fault. I would assume that one of
> do_huge_pmd_wp_zero_page_fallback or do_huge_pmd_wp_page_fallback should
> do the trick but both of them try to charge new page(s) before the
> current zero page is uncharged. That makes it prone to the same issue
> AFAICS.
> 
> But may be Kirill has a better idea.

Your analysis looks accurate. Although I was not able to reproduce
hang up.

The problem with do_huge_pmd_wp_zero_page_fallback() that it can return
VM_FAULT_OOM if it failed to allocate new *small* page, so it's real OOM.

Untested patch below tries to fix. Masayoshi, could you test.

BTW, Michal, I've triggered sleep-in-atomic bug in
mem_cgroup_print_oom_info():

[    2.386563] Task in /test killed as a result of limit of /test
[    2.387326] memory: usage 10240kB, limit 10240kB, failcnt 51
[    2.388098] memory+swap: usage 10240kB, limit 10240kB, failcnt 0
[    2.388861] kmem: usage 0kB, limit 18014398509481983kB, failcnt 0
[    2.389640] Memory cgroup stats for /test:
[    2.390178] BUG: sleeping function called from invalid context at /home/space/kas/git/public/linux/kernel/cpu.c:68
[    2.391516] in_atomic(): 1, irqs_disabled(): 0, pid: 66, name: memcg_test
[    2.392416] 2 locks held by memcg_test/66:
[    2.392945]  #0:  (memcg_oom_lock#2){+.+...}, at: [<ffffffff81131014>] pagefault_out_of_memory+0x14/0x90
[    2.394233]  #1:  (oom_info_lock){+.+...}, at: [<ffffffff81197b2a>] mem_cgroup_print_oom_info+0x2a/0x390
[    2.395496] CPU: 2 PID: 66 Comm: memcg_test Not tainted 3.14.0-rc1-dirty #745
[    2.396536] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011
[    2.397540]  ffffffff81a3cc90 ffff88081d26dba0 ffffffff81776ea3 0000000000000000
[    2.398541]  ffff88081d26dbc8 ffffffff8108418a 0000000000000000 ffff88081d15c000
[    2.399533]  0000000000000000 ffff88081d26dbd8 ffffffff8104f6bc ffff88081d26dc10
[    2.400588] Call Trace:
[    2.400908]  [<ffffffff81776ea3>] dump_stack+0x4d/0x66
[    2.401578]  [<ffffffff8108418a>] __might_sleep+0x16a/0x210
[    2.402295]  [<ffffffff8104f6bc>] get_online_cpus+0x1c/0x60
[    2.403005]  [<ffffffff8118fb67>] mem_cgroup_read_stat+0x27/0xb0
[    2.403769]  [<ffffffff81197d60>] mem_cgroup_print_oom_info+0x260/0x390
[    2.404653]  [<ffffffff8177314e>] dump_header+0x88/0x251
[    2.405342]  [<ffffffff810a3bfd>] ? trace_hardirqs_on+0xd/0x10
[    2.406098]  [<ffffffff81130618>] oom_kill_process+0x258/0x3d0
[    2.406833]  [<ffffffff81198746>] mem_cgroup_oom_synchronize+0x656/0x6c0
[    2.407674]  [<ffffffff811973a0>] ? mem_cgroup_charge_common+0xd0/0xd0
[    2.408553]  [<ffffffff81131014>] pagefault_out_of_memory+0x14/0x90
[    2.409354]  [<ffffffff817712f7>] mm_fault_error+0x91/0x189
[    2.410069]  [<ffffffff81783eae>] __do_page_fault+0x48e/0x580
[    2.410791]  [<ffffffff8108f656>] ? local_clock+0x16/0x30
[    2.411467]  [<ffffffff810a3bfd>] ? trace_hardirqs_on+0xd/0x10
[    2.412248]  [<ffffffff8177f6fc>] ? _raw_spin_unlock_irq+0x2c/0x40
[    2.413039]  [<ffffffff8108312b>] ? finish_task_switch+0x7b/0x100
[    2.413821]  [<ffffffff813b954a>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[    2.414652]  [<ffffffff81783fae>] do_page_fault+0xe/0x10
[    2.415330]  [<ffffffff81780552>] page_fault+0x22/0x30

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 82166bf974e1..974eb9eea2c0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1166,8 +1166,10 @@ alloc:
 		} else {
 			ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
 					pmd, orig_pmd, page, haddr);
-			if (ret & VM_FAULT_OOM)
+			if (ret & VM_FAULT_OOM) {
 				split_huge_page(page);
+				ret |= VM_FAULT_FALLBACK;
+			}
 			put_page(page);
 		}
 		count_vm_event(THP_FAULT_FALLBACK);
@@ -1179,9 +1181,12 @@ alloc:
 		if (page) {
 			split_huge_page(page);
 			put_page(page);
+			ret |= VM_FAULT_FALLBACK;
+		} else {
+			ret = do_huge_pmd_wp_zero_page_fallback(mm, vma,
+					address, pmd, orig_pmd, haddr);
 		}
 		count_vm_event(THP_FAULT_FALLBACK);
-		ret |= VM_FAULT_OOM;
 		goto out;
 	}
 
diff --git a/mm/memory.c b/mm/memory.c
index be6a0c0d4ae0..3b57b7864667 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3703,7 +3703,6 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (unlikely(is_vm_hugetlb_page(vma)))
 		return hugetlb_fault(mm, vma, address, flags);
 
-retry:
 	pgd = pgd_offset(mm, address);
 	pud = pud_alloc(mm, pgd, address);
 	if (!pud)
@@ -3741,20 +3740,13 @@ retry:
 			if (dirty && !pmd_write(orig_pmd)) {
 				ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
 							  orig_pmd);
-				/*
-				 * If COW results in an oom, the huge pmd will
-				 * have been split, so retry the fault on the
-				 * pte for a smaller charge.
-				 */
-				if (unlikely(ret & VM_FAULT_OOM))
-					goto retry;
-				return ret;
+				if (!(ret & VM_FAULT_FALLBACK))
+					return ret;
 			} else {
 				huge_pmd_set_accessed(mm, vma, address, pmd,
 						      orig_pmd, dirty);
+				return 0;
 			}
-
-			return 0;
 		}
 	}
 
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mm: memcg: A infinite loop in __handle_mm_fault()
  2014-02-10 12:56   ` Kirill A. Shutemov
@ 2014-02-10 13:52     ` Michal Hocko
  2014-02-12  1:04     ` Mizuma, Masayoshi
  1 sibling, 0 replies; 6+ messages in thread
From: Michal Hocko @ 2014-02-10 13:52 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Mizuma, Masayoshi, Johannes Weiner, Balbir Singh,
	KAMEZAWA Hiroyuki, cgroups, linux-mm

On Mon 10-02-14 14:56:55, Kirill A. Shutemov wrote:
[...]
> BTW, Michal, I've triggered sleep-in-atomic bug in
> mem_cgroup_print_oom_info():

Ouch, I am wondering why I haven't triggered that while testing the
patch.

# CONFIG_DEBUG_ATOMIC_SLEEP is not set
explains why might_sleep didn't warn.

Anyway, fix posted in a separate mail. Thanks for reporting.
 
> [    2.386563] Task in /test killed as a result of limit of /test
> [    2.387326] memory: usage 10240kB, limit 10240kB, failcnt 51
> [    2.388098] memory+swap: usage 10240kB, limit 10240kB, failcnt 0
> [    2.388861] kmem: usage 0kB, limit 18014398509481983kB, failcnt 0
> [    2.389640] Memory cgroup stats for /test:
> [    2.390178] BUG: sleeping function called from invalid context at /home/space/kas/git/public/linux/kernel/cpu.c:68
> [    2.391516] in_atomic(): 1, irqs_disabled(): 0, pid: 66, name: memcg_test
> [    2.392416] 2 locks held by memcg_test/66:
> [    2.392945]  #0:  (memcg_oom_lock#2){+.+...}, at: [<ffffffff81131014>] pagefault_out_of_memory+0x14/0x90
> [    2.394233]  #1:  (oom_info_lock){+.+...}, at: [<ffffffff81197b2a>] mem_cgroup_print_oom_info+0x2a/0x390
> [    2.395496] CPU: 2 PID: 66 Comm: memcg_test Not tainted 3.14.0-rc1-dirty #745
> [    2.396536] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011
> [    2.397540]  ffffffff81a3cc90 ffff88081d26dba0 ffffffff81776ea3 0000000000000000
> [    2.398541]  ffff88081d26dbc8 ffffffff8108418a 0000000000000000 ffff88081d15c000
> [    2.399533]  0000000000000000 ffff88081d26dbd8 ffffffff8104f6bc ffff88081d26dc10
> [    2.400588] Call Trace:
> [    2.400908]  [<ffffffff81776ea3>] dump_stack+0x4d/0x66
> [    2.401578]  [<ffffffff8108418a>] __might_sleep+0x16a/0x210
> [    2.402295]  [<ffffffff8104f6bc>] get_online_cpus+0x1c/0x60
> [    2.403005]  [<ffffffff8118fb67>] mem_cgroup_read_stat+0x27/0xb0
> [    2.403769]  [<ffffffff81197d60>] mem_cgroup_print_oom_info+0x260/0x390
> [    2.404653]  [<ffffffff8177314e>] dump_header+0x88/0x251
> [    2.405342]  [<ffffffff810a3bfd>] ? trace_hardirqs_on+0xd/0x10
> [    2.406098]  [<ffffffff81130618>] oom_kill_process+0x258/0x3d0
> [    2.406833]  [<ffffffff81198746>] mem_cgroup_oom_synchronize+0x656/0x6c0
> [    2.407674]  [<ffffffff811973a0>] ? mem_cgroup_charge_common+0xd0/0xd0
> [    2.408553]  [<ffffffff81131014>] pagefault_out_of_memory+0x14/0x90
> [    2.409354]  [<ffffffff817712f7>] mm_fault_error+0x91/0x189
> [    2.410069]  [<ffffffff81783eae>] __do_page_fault+0x48e/0x580
> [    2.410791]  [<ffffffff8108f656>] ? local_clock+0x16/0x30
> [    2.411467]  [<ffffffff810a3bfd>] ? trace_hardirqs_on+0xd/0x10
> [    2.412248]  [<ffffffff8177f6fc>] ? _raw_spin_unlock_irq+0x2c/0x40
> [    2.413039]  [<ffffffff8108312b>] ? finish_task_switch+0x7b/0x100
> [    2.413821]  [<ffffffff813b954a>] ? trace_hardirqs_off_thunk+0x3a/0x3c
> [    2.414652]  [<ffffffff81783fae>] do_page_fault+0xe/0x10
> [    2.415330]  [<ffffffff81780552>] page_fault+0x22/0x30
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mm: memcg: A infinite loop in __handle_mm_fault()
  2014-02-10 12:56   ` Kirill A. Shutemov
  2014-02-10 13:52     ` Michal Hocko
@ 2014-02-12  1:04     ` Mizuma, Masayoshi
  1 sibling, 0 replies; 6+ messages in thread
From: Mizuma, Masayoshi @ 2014-02-12  1:04 UTC (permalink / raw)
  To: Kirill A. Shutemov, Michal Hocko
  Cc: Johannes Weiner, Balbir Singh, KAMEZAWA Hiroyuki, cgroups, linux-mm

On Mon, 10 Feb 2014 14:56:55 +0200 Kirill A. Shutemov wrote:
> Michal Hocko wrote:
>> [CCing Kirill]
>>
>> On Mon 10-02-14 09:25:01, Mizuma, Masayoshi wrote:
>>> Hi,
>>
>> Hi,
>>
>>> This is a bug report for memory cgroup hang up.
>>> I reproduced this using 3.14-rc1 but I couldn't in 3.7.
>>>
>>> When I ran a program (see below) under a limit of memcg, the process hanged up.
>>> Using kprobe trace, I detected the hangup in __handle_mm_fault().
>>> do_huge_pmd_wp_page(), which is called by __handle_mm_fault(), always returns
>>> VM_FAULT_OOM, so it repeats goto retry and the task can't be killed.
>>
>> Thanks a lot for this very good report. I would bet the issue is related
>> to the THP zero page.
>>
>> __handle_mm_fault retry loop for VM_FAULT_OOM from do_huge_pmd_wp_page
>> expects that the pmd is marked for splitting so that it can break out
>> and retry the fault. This is not the case for THP zero page though.
>> do_huge_pmd_wp_page checks is_huge_zero_pmd and goes to allocate a new
>> huge page which will succeed in your case because you are hitting memcg
>> limit not the global memory pressure. But then a new page is charged by
>> mem_cgroup_newpage_charge which fails. An existing page is then split
>> and we are returning VM_FAULT_OOM. But we do not have page initialized
>> in that path because page = pmd_page(orig_pmd) is called after
>> is_huge_zero_pmd check.
>>
>> I am not familiar with THP zero page code much but I guess splitting
>> such a zero page is not a way to go. Instead we should simply drop the
>> zero page and retry the fault. I would assume that one of
>> do_huge_pmd_wp_zero_page_fallback or do_huge_pmd_wp_page_fallback should
>> do the trick but both of them try to charge new page(s) before the
>> current zero page is uncharged. That makes it prone to the same issue
>> AFAICS.
>>
>> But may be Kirill has a better idea.
> 
> Your analysis looks accurate. Although I was not able to reproduce
> hang up.
> 
> The problem with do_huge_pmd_wp_zero_page_fallback() that it can return
> VM_FAULT_OOM if it failed to allocate new *small* page, so it's real OOM.
> 
> Untested patch below tries to fix. Masayoshi, could you test.

I applied the patch to 3.14-rc2.
Then, I confirmed this issue does not happen and the process is killed by
oom-killer normally.
Thank you for analyzing the root cause and providing the fix!

Regards,
Masayoshi Mizuma

> 
> BTW, Michal, I've triggered sleep-in-atomic bug in
> mem_cgroup_print_oom_info():
> 
> [    2.386563] Task in /test killed as a result of limit of /test
> [    2.387326] memory: usage 10240kB, limit 10240kB, failcnt 51
> [    2.388098] memory+swap: usage 10240kB, limit 10240kB, failcnt 0
> [    2.388861] kmem: usage 0kB, limit 18014398509481983kB, failcnt 0
> [    2.389640] Memory cgroup stats for /test:
> [    2.390178] BUG: sleeping function called from invalid context at /home/space/kas/git/public/linux/kernel/cpu.c:68
> [    2.391516] in_atomic(): 1, irqs_disabled(): 0, pid: 66, name: memcg_test
> [    2.392416] 2 locks held by memcg_test/66:
> [    2.392945]  #0:  (memcg_oom_lock#2){+.+...}, at: [<ffffffff81131014>] pagefault_out_of_memory+0x14/0x90
> [    2.394233]  #1:  (oom_info_lock){+.+...}, at: [<ffffffff81197b2a>] mem_cgroup_print_oom_info+0x2a/0x390
> [    2.395496] CPU: 2 PID: 66 Comm: memcg_test Not tainted 3.14.0-rc1-dirty #745
> [    2.396536] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011
> [    2.397540]  ffffffff81a3cc90 ffff88081d26dba0 ffffffff81776ea3 0000000000000000
> [    2.398541]  ffff88081d26dbc8 ffffffff8108418a 0000000000000000 ffff88081d15c000
> [    2.399533]  0000000000000000 ffff88081d26dbd8 ffffffff8104f6bc ffff88081d26dc10
> [    2.400588] Call Trace:
> [    2.400908]  [<ffffffff81776ea3>] dump_stack+0x4d/0x66
> [    2.401578]  [<ffffffff8108418a>] __might_sleep+0x16a/0x210
> [    2.402295]  [<ffffffff8104f6bc>] get_online_cpus+0x1c/0x60
> [    2.403005]  [<ffffffff8118fb67>] mem_cgroup_read_stat+0x27/0xb0
> [    2.403769]  [<ffffffff81197d60>] mem_cgroup_print_oom_info+0x260/0x390
> [    2.404653]  [<ffffffff8177314e>] dump_header+0x88/0x251
> [    2.405342]  [<ffffffff810a3bfd>] ? trace_hardirqs_on+0xd/0x10
> [    2.406098]  [<ffffffff81130618>] oom_kill_process+0x258/0x3d0
> [    2.406833]  [<ffffffff81198746>] mem_cgroup_oom_synchronize+0x656/0x6c0
> [    2.407674]  [<ffffffff811973a0>] ? mem_cgroup_charge_common+0xd0/0xd0
> [    2.408553]  [<ffffffff81131014>] pagefault_out_of_memory+0x14/0x90
> [    2.409354]  [<ffffffff817712f7>] mm_fault_error+0x91/0x189
> [    2.410069]  [<ffffffff81783eae>] __do_page_fault+0x48e/0x580
> [    2.410791]  [<ffffffff8108f656>] ? local_clock+0x16/0x30
> [    2.411467]  [<ffffffff810a3bfd>] ? trace_hardirqs_on+0xd/0x10
> [    2.412248]  [<ffffffff8177f6fc>] ? _raw_spin_unlock_irq+0x2c/0x40
> [    2.413039]  [<ffffffff8108312b>] ? finish_task_switch+0x7b/0x100
> [    2.413821]  [<ffffffff813b954a>] ? trace_hardirqs_off_thunk+0x3a/0x3c
> [    2.414652]  [<ffffffff81783fae>] do_page_fault+0xe/0x10
> [    2.415330]  [<ffffffff81780552>] page_fault+0x22/0x30
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 82166bf974e1..974eb9eea2c0 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1166,8 +1166,10 @@ alloc:
>   		} else {
>   			ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
>   					pmd, orig_pmd, page, haddr);
> -			if (ret & VM_FAULT_OOM)
> +			if (ret & VM_FAULT_OOM) {
>   				split_huge_page(page);
> +				ret |= VM_FAULT_FALLBACK;
> +			}
>   			put_page(page);
>   		}
>   		count_vm_event(THP_FAULT_FALLBACK);
> @@ -1179,9 +1181,12 @@ alloc:
>   		if (page) {
>   			split_huge_page(page);
>   			put_page(page);
> +			ret |= VM_FAULT_FALLBACK;
> +		} else {
> +			ret = do_huge_pmd_wp_zero_page_fallback(mm, vma,
> +					address, pmd, orig_pmd, haddr);
>   		}
>   		count_vm_event(THP_FAULT_FALLBACK);
> -		ret |= VM_FAULT_OOM;
>   		goto out;
>   	}
>   
> diff --git a/mm/memory.c b/mm/memory.c
> index be6a0c0d4ae0..3b57b7864667 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3703,7 +3703,6 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   	if (unlikely(is_vm_hugetlb_page(vma)))
>   		return hugetlb_fault(mm, vma, address, flags);
>   
> -retry:
>   	pgd = pgd_offset(mm, address);
>   	pud = pud_alloc(mm, pgd, address);
>   	if (!pud)
> @@ -3741,20 +3740,13 @@ retry:
>   			if (dirty && !pmd_write(orig_pmd)) {
>   				ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
>   							  orig_pmd);
> -				/*
> -				 * If COW results in an oom, the huge pmd will
> -				 * have been split, so retry the fault on the
> -				 * pte for a smaller charge.
> -				 */
> -				if (unlikely(ret & VM_FAULT_OOM))
> -					goto retry;
> -				return ret;
> +				if (!(ret & VM_FAULT_FALLBACK))
> +					return ret;
>   			} else {
>   				huge_pmd_set_accessed(mm, vma, address, pmd,
>   						      orig_pmd, dirty);
> +				return 0;
>   			}
> -
> -			return 0;
>   		}
>   	}
>   
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2014-02-12  1:07 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-10  0:25 mm: memcg: A infinite loop in __handle_mm_fault() Mizuma, Masayoshi
2014-02-10 11:19 ` Michal Hocko
2014-02-10 11:51   ` Mizuma, Masayoshi
2014-02-10 12:56   ` Kirill A. Shutemov
2014-02-10 13:52     ` Michal Hocko
2014-02-12  1:04     ` Mizuma, Masayoshi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox