linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [Qestion] UCE on pud-sized hugepage lead to kernel panic on lts5.10
@ 2022-12-14  1:33 mawupeng
  2022-12-14  8:52 ` Greg KH
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: mawupeng @ 2022-12-14  1:33 UTC (permalink / raw)
  To: naoya.horiguchi
  Cc: mawupeng1, catalin.marinas, gregkh, akpm, linux-mm, linux-kernel

On current arm64 stable 5.10(v5.10.158). If a UCE happnes pud-sized
hugepage, kernel will panic since current memory failure can not handle
this kind of memory failure since commit 31286a8484a8 ("mm: hwpoison:
disable memory error handling on 1GB hugepage")

The latest kernel(v6.0) can handle this UCE since commit 6f4614886baa ("mm,
hwpoison: enable memory error handling on 1GB hugepage"). We are trying to
backport this patchset to stable 5.10, however too many other patches
should be backport since there are huge difference between 5.10 and 6.0.
The full patch list will be shown at the end of this mail.

We do not think backport all of these patches is doable for stable 5.10. Is
there any better way to fix this problem.

The kernel panic call trace:

  Kernel panic - not syncing: Fatal hardware error!
  CPU: 0 PID: 15 Comm: kworker/0:1 Not tainted 5.10.158_stable_5_10 #1
  Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V3.26.01 06/14/2019
  Workqueue: kacpi_notify acpi_os_execute_deferred
  Call trace:
   dump_backtrace+0x0/0x1ec
   show_stack+0x20/0x30
   dump_stack+0xd0/0x128
   panic+0x154/0x36c
   __raw_spin_lock_irqsave.constprop.0+0x0/0xb0
   ghes_proc+0x148/0x200
   ghes_notify_hed+0x58/0xd4
   blocking_notifier_call_chain+0x74/0xb0
   acpi_hed_notify+0x28/0x3c
   acpi_device_notify+0x24/0x30
   acpi_ev_notify_dispatch+0x68/0x78
   acpi_os_execute_deferred+0x24/0x3c
   process_one_work+0x1d4/0x4b0
   worker_thread+0x180/0x430
   kthread+0x118/0x120
   ret_from_fork+0x10/0x18
  SMP: stopping secondary CPUs
  Kernel Offset: 0x4ed64eb80000 from 0xffff800010000000
  PHYS_OFFSET: 0xffffd24300000000
  CPU features: 0x00000002,62208a38
  Memory Limit: none
  Rebooting in 30 seconds..

Our backport list(bug fixes not included):

  mm,hwpoison: drain pcplists before bailing out for non-buddy zero-refcount page
  mm,hwpoison: take free pages off the buddy freelists
  mm,hwpoison: drop unneeded pcplist draining
  mm,hwpoison: refactor get_any_page
  mm,hwpoison: disable pcplists before grabbing a refcount
  mm,hwpoison: remove drain_all_pages from shake_page
  hugetlb: use page.private for hugetlb specific page flags
  hugetlb: convert page_huge_active() HPageMigratable flag
  hugetlb: convert PageHugeTemporary() to HPageTemporary flag
  hugetlb: convert PageHugeFreed to HPageFreed flag
  mm,hwpoison: fix race with hugetlb page allocation
  mm: hugetlb: gather discrete indexes of tail page
  hugetlb: create remove_hugetlb_page() to separate functionality
  mm,hwpoison: make get_hwpoison_page() call get_any_page()
  mm/hwpoison: disable pcp for page_handle_poison()
  mm/hwpoison: mf_mutex for soft offline and unpoison
  mm/hwpoison: remove MF_MSG_BUDDY_2ND and MF_MSG_POISONED_HUGE
  mm/hwpoison: fix unpoison_memory()
  mm/memory-failure.c: fix race with changing page compound again
  mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()
  mm/hugetlb: check gigantic_page_runtime_supported() in return_unused_surplus_pages()
  mm/hugetlb: make pud_huge() and follow_huge_pud() aware of non-present pud entry
  mm, hwpoison, hugetlb: support saving mechanism of raw error pages
  mm/memory-failure.c: simplify num_poisoned_pages_dec
  mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage
  mm, hwpoison: set PG_hwpoison for busy hugetlb pages
  mm, hwpoison: make __page_handle_poison returns int
  mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qestion] UCE on pud-sized hugepage lead to kernel panic on lts5.10
  2022-12-14  1:33 [Qestion] UCE on pud-sized hugepage lead to kernel panic on lts5.10 mawupeng
@ 2022-12-14  8:52 ` Greg KH
  2022-12-14 17:04 ` Matthew Wilcox
  2022-12-15  1:01 ` HORIGUCHI NAOYA(堀口 直也)
  2 siblings, 0 replies; 5+ messages in thread
From: Greg KH @ 2022-12-14  8:52 UTC (permalink / raw)
  To: mawupeng; +Cc: naoya.horiguchi, catalin.marinas, akpm, linux-mm, linux-kernel

On Wed, Dec 14, 2022 at 09:33:10AM +0800, mawupeng wrote:
> On current arm64 stable 5.10(v5.10.158). If a UCE happnes pud-sized
> hugepage, kernel will panic since current memory failure can not handle
> this kind of memory failure since commit 31286a8484a8 ("mm: hwpoison:
> disable memory error handling on 1GB hugepage")
> 
> The latest kernel(v6.0) can handle this UCE since commit 6f4614886baa ("mm,
> hwpoison: enable memory error handling on 1GB hugepage"). We are trying to
> backport this patchset to stable 5.10, however too many other patches
> should be backport since there are huge difference between 5.10 and 6.0.
> The full patch list will be shown at the end of this mail.
> 
> We do not think backport all of these patches is doable for stable 5.10. Is
> there any better way to fix this problem.

Please just upgrade to 6.0 (or 6.1 as 6.0 will be end-of-life in a week
or so).  That is the simplest and easiest way to resolve the issue,
right?

Or what about 5.15.y?

thanks,

greg k-h


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qestion] UCE on pud-sized hugepage lead to kernel panic on lts5.10
  2022-12-14  1:33 [Qestion] UCE on pud-sized hugepage lead to kernel panic on lts5.10 mawupeng
  2022-12-14  8:52 ` Greg KH
@ 2022-12-14 17:04 ` Matthew Wilcox
  2022-12-15  1:34   ` mawupeng
  2022-12-15  1:01 ` HORIGUCHI NAOYA(堀口 直也)
  2 siblings, 1 reply; 5+ messages in thread
From: Matthew Wilcox @ 2022-12-14 17:04 UTC (permalink / raw)
  To: mawupeng
  Cc: naoya.horiguchi, catalin.marinas, gregkh, akpm, linux-mm, linux-kernel

On Wed, Dec 14, 2022 at 09:33:10AM +0800, mawupeng wrote:
> On current arm64 stable 5.10(v5.10.158). If a UCE happnes pud-sized

What's a UCE?


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qestion] UCE on pud-sized hugepage lead to kernel panic on lts5.10
  2022-12-14  1:33 [Qestion] UCE on pud-sized hugepage lead to kernel panic on lts5.10 mawupeng
  2022-12-14  8:52 ` Greg KH
  2022-12-14 17:04 ` Matthew Wilcox
@ 2022-12-15  1:01 ` HORIGUCHI NAOYA(堀口 直也)
  2 siblings, 0 replies; 5+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-12-15  1:01 UTC (permalink / raw)
  To: mawupeng; +Cc: catalin.marinas, gregkh, akpm, linux-mm, linux-kernel

On Wed, Dec 14, 2022 at 09:33:10AM +0800, mawupeng wrote:
> On current arm64 stable 5.10(v5.10.158). If a UCE happnes pud-sized
> hugepage, kernel will panic since current memory failure can not handle
> this kind of memory failure since commit 31286a8484a8 ("mm: hwpoison:
> disable memory error handling on 1GB hugepage")
> 
> The latest kernel(v6.0) can handle this UCE since commit 6f4614886baa ("mm,
> hwpoison: enable memory error handling on 1GB hugepage"). We are trying to
> backport this patchset to stable 5.10, however too many other patches
> should be backport since there are huge difference between 5.10 and 6.0.
> The full patch list will be shown at the end of this mail.
> 
> We do not think backport all of these patches is doable for stable 5.10. Is
> there any better way to fix this problem.

Sorry, I have no idea about this. I think that backporting to stable kernel
is done only for small bug fixes, which is not the case for enablement of
handling uncorrected error on 1GB hugepages.  So as Greg commented, using
latest (stable) kernel seems to me the second best.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qestion] UCE on pud-sized hugepage lead to kernel panic on lts5.10
  2022-12-14 17:04 ` Matthew Wilcox
@ 2022-12-15  1:34   ` mawupeng
  0 siblings, 0 replies; 5+ messages in thread
From: mawupeng @ 2022-12-15  1:34 UTC (permalink / raw)
  To: willy
  Cc: mawupeng1, naoya.horiguchi, catalin.marinas, gregkh, akpm,
	linux-mm, linux-kernel



On 2022/12/15 1:04, Matthew Wilcox wrote:
> On Wed, Dec 14, 2022 at 09:33:10AM +0800, mawupeng wrote:
>> On current arm64 stable 5.10(v5.10.158). If a UCE happnes pud-sized
> 
> What's a UCE?
>

uncorrected error.


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-12-15  1:34 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-14  1:33 [Qestion] UCE on pud-sized hugepage lead to kernel panic on lts5.10 mawupeng
2022-12-14  8:52 ` Greg KH
2022-12-14 17:04 ` Matthew Wilcox
2022-12-15  1:34   ` mawupeng
2022-12-15  1:01 ` HORIGUCHI NAOYA(堀口 直也)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox