Re: [RFC PATCH] mm/swap: fix system stuck due to infinite loop

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Stillinux <stillinux@gmail.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	liuzhengyuan@kylinos.cn,  liuyun01@kylinos.cn,
	Johannes Weiner <hannes@cmpxchg.org>,
	Hugh Dickins <hughd@google.com>
Subject: Re: [RFC PATCH] mm/swap: fix system stuck due to infinite loop
Date: Sun, 4 Apr 2021 17:26:22 +0800	[thread overview]
Message-ID: <CAKN5gCgh31rhvnoVSXVrOzony+KcRrY0UwzdgT1WZq84pD2EnA@mail.gmail.com> (raw)
In-Reply-To: <20210402174447.2abccc77cdca5cad67756d55@linux-foundation.org>

[-- Attachment #1: Type: text/plain, Size: 11188 bytes --]

> I'm not understanding why swapcache_prepare() repeatedly returns
-EEXIST in this situation?

We found this case which led to system stuck on our several arm64 PC. and
 We using sysrq grabbed a copy of vmcore based on the 5.4.18 kernel
version. Later, we also ran the 5.11.0 version, which can reproduce this
case. According to vmcore analysis, the latest kernel code still preserves
this logic.

Our analysis steps are as follows:
1. We found a large number of D tasks, and the kernel stack is basically
stuck in io_schedule after rq_qos_wait. We analyzed the rq_wb structure,
and we found that inflight in rq_wait[0] is 1, and there is one request
that has not been completed, thus blk-wbt block the io operation of other
tasks.

crash> ps -S
  RU: 5
  IN: 468
  ID: 65
  UN: 117
  ZO: 4

  ==============
  rq_wait = {{
wait = {

inflight = {
  counter = 1
}

2, We tracked down that the request was pluged in the blk_plug of the task
938

crash> ps -A
   PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
>     0      0   1  ffffffa0f5e0b4c0  RU   0.0       0      0  [swapper/1]
>     0      0   2  ffffffa0f5ec0040  RU   0.0       0      0  [swapper/2]
>     0      0   3  ffffffa0f5ec6940  RU   0.0       0      0  [swapper/3]
>   938    929   0  ffffffa0eb00c640  RU   0.0 2698236  15092  qaxsafed
crash> task 938 | grep TASK
PID: 938    TASK: ffffffa0eb00c640  CPU: 0   COMMAND: "qaxsafed"
crash> struct task_struct ffffffa0eb00c640 | grep plug
  plug = 0xffffffa0e918fbe8,
crash> struct blk_plug 0xffffffa0e918fbe8
struct blk_plug {
  mq_list = {
    next = 0xffffffa0e7f81a88,
    prev = 0xffffffa0e7f81a88
  },
  cb_list = {
    next = 0xffffffa0e918fbf8,
    prev = 0xffffffa0e918fbf8
  },
  rq_count = 1,
  multiple_queues = false
}

3, We grabbed the allocation time of the request and found that it was
allocated in 1062 seconds, and then added to the plug of the 938 process,
but combined with the dmesg time information, the last scheduling time of
task 938 has run to 2083 seconds,  plus for the scheduling information of
other processes of cpu0, we confirm that the scheduling is normal, and
there is a problem if the task 938 does not do the brushing out the request
on the plug for so long.

crash> struct request ffffffa0e7f81a40 | grep start_time_ns
  start_time_ns = 1062021311093,
  io_start_time_ns = 0,
crash> log | tail
[ 2183.803249]  do_swap_page+0x1e4/0x968
[ 2183.808028]  __handle_mm_fault+0xc18/0x10a8
[ 2183.813328]  handle_mm_fault+0x144/0x198
[ 2183.818368]  do_page_fault+0x2d0/0x518
[ 2183.823235]  do_translation_fault+0x3c/0x50
[ 2183.828535]  do_mem_abort+0x3c/0x98
[ 2183.833140]  el0_da+0x1c/0x20
[ 2183.837229] SMP: stopping secondary CPUs
[ 2183.842451] Starting crashdump kernel...
[ 2183.876226] Bye!
crash> task 938 | grep "exec_start\|sum_exec_runtime"
    exec_start = 2183587475680,
    sum_exec_runtime = 1120286568937,
    prev_sum_exec_runtime = 1120278576499,
  last_sum_exec_runtime = 0,

4. We analyzed the scheduling because schedule->sched_submit_work will be
judged based on task->state, and the request in the plug will not be
flushed. then we know that the request data on the plug of the runnable
process needs to be flushed out by itself.

5. We combine the call stack of the task 938, the X0 register value saved
by its interrupt context, and the parameter passed in with get_swap_device
as the swp_entry value, which we convert to its entry value.

get_swap_device function and its save context list as follow, and we dis
 get_swap_device function :

struct swap_info_struct *get_swap_device(swp_entry_t entry)
====================================
#27 [ffffffa0e918fa90] el1_irq+0xb4 at ffffffc0100833b4
     PC: ffffffc01022df04  [get_swap_device+36]
     LR: ffffffc01022dfb4  [__swap_duplicate+28]
     SP: ffffffa0e918faa0  PSTATE: 20000005
    X29: ffffffa0e918faa0  X28: ffffffc01151e000  X27: 0000000000000000
    X26: ffffffa0f0d8cfc8  X25: 0000007f9e57a000  X24: 0000000000100cca
    X23: 0000000000000040  X22: 00000000001d1ca6  X21: 0000000000007b40
    X20: 00000000001d1ca6  X19: ffffffff00502800  X18: 0000000000000000
    X17: 0000000000000000  X16: 0000000000000000  X15: 0000000000000000
    X14: 0000000000000000  X13: 0000000000000001  X12: 0000000000000000
    X11: 0000000000000001  X10: 0000000000000bc0   X9: ffffffa0e918fa50
     X8: ffffffa0eb00d260   X7: ffffffa0e9875600   X6: 00000000ffffffff
     X5: 00000000000002af   X4: ffffffa0e918fa80   X3: 0000000000000002
     X2: 0000000000000000   X1: ffffffc0117caf60   X0:
0SWAP_HAS_CACHE0000000001d1ca6<<<<---
    ffffffa0e918fa90: ffffffa0e918faa0 get_swap_device+36
#28 [ffffffa0e918faa0] get_swap_device+0x20 at ffffffc01022df00
    ffffffa0e918faa0: ffffffa0e918fab0 __swap_duplicate+28
#29 [ffffffa0e918fab0] __swap_duplicate+0x18 at ffffffc01022dfb0

============================
crash> dis -rx get_swap_device+36information
0xffffffc01022dee0 <get_swap_device>: cbz x0, 0xffffffc01022df80
<get_swap_device+0xa0>
0xffffffc01022dee4 <get_swap_device+0x4>: adrp x1, 0xffffffc0117ca000
<memblock_reserved_init_regions+0x1500>
0xffffffc01022dee8 <get_swap_device+0x8>: add x1, x1, #0xf60
0xffffffc01022deec <get_swap_device+0xc>: stp x29, x30, [sp,#-16]!
0xffffffc01022def0 <get_swap_device+0x10>: lsr x2, x0, #58
0xffffffc01022def4 <get_swapWe found that this case is probabilistic system
stuck  on our arm64 machine._device+0x14>: mov x29, sp
0xffffffc01022def8 <get_swap_device+0x18>: ldr w3, [x1,#4]
0xffffffc01022defc <get_swap_device+0x1c>: cmp w3, w2
0xffffffc01022df00 <get_swap_device+0x20>: b.ls 0xffffffc01022df40
<get_swap_device+0x60>
0xffffffc01022df04 <get_swap_device+0x24>: dmb ishld

6. Combined with the swap information, we look up the value of swap_map
corresponding to the entry, and we find that it is 0x41<SWAP_HAS_CACHE|
0x1>.

crash> swap
SWAP_INFO_STRUCT    TYPE       SIZE       USED     PCT  PRI  FILENAME
ffffffa0ebae9a00  PARTITION  9693180k   2713508k   27%   -2  /dev/sda6
crash> struct swap_info_struct ffffffa0ebae9a00 | grep swap_map
  swap_map = 0xffffffc012e01000 "?",
  frontswap_map = 0xffffffa0ea580000,
crash> px (0xffffffc012e01000+0x1d1ca6)
$4 = 0xffffffc012fd2ca6
crash> rd -8  0xffffffc012fd2ca6
ffffffc012fd2ca6:  41                                                A

7, We bring 0x41 into the __read_swap_cache_async function, which will
return -EEXIST in swapcache_prepare, if the value in swap_map corresponding
to entry is always 0x41, wapcache_prepare
It will always return -EEXIST, and the task 938 will do infinite loop.

8. The value of swap_map corresponding to the entry is 0x41, which means
that the swap_cache page corresponding to the entry has not been added
successfully. If added successfully, __read_swap_cache_async can get the
page from swap_address_space, without infinite loop.

9. Based on the above information, we guess the swap_cache page
corresponding to the entry of the task 938  has not been added
successfully, and it may also be blocked. Looking at the stacks of other
tasks, it is found that other tasks have triggered the shrink flush io of
direct memory reclaim in
swap_cluster_readahead->read_swap_cache_async->swap_readpage->get_swap_bio,
but it was blocked by blk-wbt because a request was not completed above.

AFAIK, we guess that there is a potential logical infinite loop, that is
the root cause of our system's stuck. The task 938 waits for the page add
to swapcache to join successfully through swap_prepare loop, but the task
of reading data from the swap area to the page in the swapcache is blocked
by io when access blk-wbt or hctx->tags resources, but the task 938 has not
been flushed out because of the request held by its blk-plug, which is the
cause of other task io blocking.

> And how does the switch to GFP_NOIO fix this?  Simply by avoiding
direct reclaim altogether?

In our kernel stack information, our crawling tasks kernel stack may block
the 938 task. By crawling swap_cluster_readahead, we found that most of
them are stuck in
swap_cluster_readahead->read_swap_cache_async->swap_readpage->get_swap_bio.
When get_swap_bio does direct memory reclaim, it enters the D state when
flushing io stuck by task 938 blk_plug request. Therefore, from an
engineering perspective, we just only solve this case urgently, which can
effectively reduce the probability of the stuck situation, can realize the
memory pressure, reduce the readahead reading, and read the formal orig_pte
using read_swap_cache_async for the second time. Remedial opportunities,
but there are still risks. If the entry is not the entry corresponding to
orig_pte, there is no big problem, but if it is the entry corresponding to
orig_pte, returning -ENOMEM may cause serious problems. It is beyond my
ability to solve this potential circular problem fundamentally, so I
request the community to discuss and research to solve the problem
fundamentally.

On Sat, Apr 3, 2021 at 8:44 AM Andrew Morton <akpm@linux-foundation.org>
wrote:

> On Fri, 2 Apr 2021 15:03:37 +0800 Stillinux <stillinux@gmail.com> wrote:
>
> > In the case of high system memory and load pressure, we ran ltp test
> > and found that the system was stuck, the direct memory reclaim was
> > all stuck in io_schedule, the waiting request was stuck in the blk_plug
> > flow of one process, and this process fell into an infinite loop.
> > not do the action of brushing out the request.
> >
> > The call flow of this process is swap_cluster_readahead.
> > Use blk_start/finish_plug for blk_plug operation,
> > flow swap_cluster_readahead->__read_swap_cache_async->swapcache_prepare.
> > When swapcache_prepare return -EEXIST, it will fall into an infinite
> loop,
> > even if cond_resched is called, but according to the schedule,
> > sched_submit_work will be based on tsk->state, and will not flash out
> > the blk_plug request, so will hang io, causing the overall system  hang.
> >
> > For the first time involving the swap part, there is no good way to fix
> > the problem from the fundamental problem. In order to solve the
> > engineering situation, we chose to make swap_cluster_readahead aware of
> > the memory pressure situation as soon as possible, and do io_schedule to
> > flush out the blk_plug request, thereby changing the allocation flag in
> > swap_readpage to GFP_NOIO , No longer do the memory reclaim of flush io.
> > Although system operating normally, but not the most fundamental way.
> >
>
> Thanks.
>
> I'm not understanding why swapcache_prepare() repeatedly returns
> -EEXIST in this situation?
>
> And how does the switch to GFP_NOIO fix this?  Simply by avoiding
> direct reclaim altogether?
>
> > ---
> >  mm/page_io.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/page_io.c b/mm/page_io.c
> > index c493ce9ebcf5..87392ffabb12 100644
> > --- a/mm/page_io.c
> > +++ b/mm/page_io.c
> > @@ -403,7 +403,7 @@ int swap_readpage(struct page *page, bool
> synchronous)
> >       }
> >
> >       ret = 0;
> > -     bio = bio_alloc(GFP_KERNEL, 1);
> > +     bio = bio_alloc(GFP_NOIO, 1);
> >       bio_set_dev(bio, sis->bdev);
> >       bio->bi_opf = REQ_OP_READ;
> >       bio->bi_iter.bi_sector = swap_page_sector(page);
>
>

[-- Attachment #2: Type: text/html, Size: 12866 bytes --]

next prev parent reply	other threads:[~2021-04-04  9:26 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-02  7:03 Stillinux
2021-04-03  0:44 ` Andrew Morton
2021-04-04  9:26   ` Stillinux [this message]
     [not found] ` <20210406065944.08d8aa76@mail.inbox.lv>
2021-04-06  0:15   ` [PATCH] mm/vmscan: add sysctl knobs for protecting the specified kernel test robot
2021-04-06  1:16   ` kernel test robot
2021-04-06 22:49   ` [RFC PATCH] mm/swap: fix system stuck due to infinite loop Stillinux

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAKN5gCgh31rhvnoVSXVrOzony+KcRrY0UwzdgT1WZq84pD2EnA@mail.gmail.com \
    --to=stillinux@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=liuyun01@kylinos.cn \
    --cc=liuzhengyuan@kylinos.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox