From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f200.google.com (mail-io0-f200.google.com [209.85.223.200]) by kanga.kvack.org (Postfix) with ESMTP id 40FE86B0033 for ; Sun, 3 Dec 2017 08:23:08 -0500 (EST) Received: by mail-io0-f200.google.com with SMTP id z195so13327244iof.21 for ; Sun, 03 Dec 2017 05:23:08 -0800 (PST) Received: from huawei.com (szxga02-in.huawei.com. [45.249.212.188]) by mx.google.com with ESMTPS id 1si8062403iow.145.2017.12.03.05.23.06 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 03 Dec 2017 05:23:07 -0800 (PST) From: gengdongjiu Subject: [question] handle the page table RAS error Date: Sun, 3 Dec 2017 13:22:25 +0000 Message-ID: <0184EA26B2509940AA629AE1405DD7F2019C8B36@DGGEMA503-MBS.china.huawei.com> Content-Language: zh-CN Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Sender: owner-linux-mm@kvack.org List-ID: To: "tony.luck@intel.com" , Naoya Horiguchi , "ak@linux.intel.com" , "npiggin@gmail.com" , "vbabka@suse.cz" , "akpm@linux-foundation.org" Cc: "linux-mm@kvack.org" , Linux Kernel Mailing List , "wangxiongfeng (C)" , Huangshaoyu , Wuquanming Hi all, Sorry to disturb you. Now the ARM64 has supported the RAS, when enabling= this feature, we encounter a issue. If the user space application happen p= age table RAS error, Memory error handler(memory_failure()) will do nothing except make a poison= ed page flag, and fault handler in arch/arm64/mm/fault.c will deliver a sig= nal to kill this application. when this application exit, it will call unmap_vmas () to rele= ase his vma resource, but here it will touch the error page table again, th= en will trigger RAS error again, so this application cannot be killed and system will be panic, the log is show= n in [2]. As shown the stack in [1], unmap_page_range() will touch the error page tab= le, so system will panic, does this panic behavior is expected? How the x8= 6 handle the page table RAS error? If user space application happen page table RAS error, I think t= he expected behavior should be killing the application instead of panic OS.= In current code, when release=20 application vma resource, I do not see it will check whether table page is = poisoned, could you give me some suggestion about how to handle this case? = Thanks a lot.=20 [1]: get_signal() do_group_exit() mmput() exit_mmap() unmap_vmas() unmap_single_vma() unmap_page_range() =09 [2] [ 676.669053] Synchronous External Abort: level 0 (translation table walk)= (0x82000214) at 0x0000000033ff7008 [ 676.686469] Memory failure: 0xcd4b: already hardware poisoned=20 [ 676.700652] Synchronous External Abort: synchronous external abort (0x96= 000410) at 0x0000000033ff7008 [ 676.723301] Internal error: : 96000410 [#1] PREEMPT SMP=20 [ 676.723616] Modules linked in: inject_memory_error(O) [ 676.724601] CPU: 0 PID: 1506 Comm: mca-recover Tainted: G O = 4.14.0-rc8-00019-g5b5c6f4-dirty #109 [ 676.724844] task: ffff80000cd41d00 task.stack: ffff000009b30000=20 [ 676.726616] PC is at unmap_page_range+0x78/0x6fc=20 [ 676.726960] LR is at unmap_single_vma+0x88/0xdc=20 [ 676.727122] pc : [] lr : [] pstate: = 80400149 [ 676.727227] sp : ffff000009b339b0 [ 676.727348] x29: ffff000009b339b0 x28: ffff80000cd41d00=20 [ 676.727653] x27: 0000000000000000 x26: ffff80000cd42410=20 [ 676.727919] x25: ffff80000cd41d00 x24: ffff80000cd1e180=20 [ 676.728161] x23: ffff80000ce22300 x22: 0000000000000000=20 [ 676.728407] x21: ffff000009b33b28 x20: 0000000000400000=20 [ 676.728642] x19: ffff80000cd1e180 x18: 000000000000016d=20 [ 676.728875] x17: 0000000000000190 x16: 0000000000000064=20 [ 676.729117] x15: 0000000000000339 x14: 0000000000000000=20 [ 676.729344] x13: 00000000000061a8 x12: 0000000000000339=20 [ 676.729582] x11: 0000000000000018 x10: 0000000000000a80=20 [ 676.729829] x9 : ffff000009b33c60 x8 : ffff80000cd427e0=20 [ 676.730065] x7 : ffff000009b33de8 x6 : 00000000004a2000=20 [ 676.730287] x5 : 0000000000400000 x4 : ffff80000cd4b000=20 [ 676.730517] x3 : 00000000004a1fff x2 : 0000008000000000=20 [ 676.730741] x1 : 0000007fffffffff x0 : 0000008000000000=20 [ 676.731101] Process mca-recover (pid: 1506, stack limit =3D 0xffff000009= b30000)=20 [ 676.731281] Call trace: [ 676.734196] [] unmap_page_range+0x78/0x6fc=20 [ 676.734539] [] unmap_single_vma+0x88/0xdc=20 [ 676.734892] [] unmap_vmas+0x68/0xb4=20 [ 676.735456] [] exit_mmap+0x90/0x140=20 [ 676.736468] [] mmput+0x60/0x118=20 [ 676.736791] [] do_exit+0x240/0x9cc=20 [ 676.736997] [] do_group_exit+0x38/0x98=20 [ 676.737384] [] get_signal+0x1ec/0x548=20 [ 676.738313] [] do_signal+0x7c/0x668=20 [ 676.738617] [] do_notify_resume+0xcc/0x114=20 [ 676.740983] [] work_pending+0x8/0x10=20 [ 676.741360] Code: f94043a4 f9404ba2 f94037a3 d1000441 (f9400080)=20 [ 676.741745] ---[ end trace e42d453027313552 ]---=20 [ 676.804174] Fixing recursive fault but reboot is needed! [ 677.462082] Memory failure: 0xcd4b: already hardware poisoned -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org