linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [bug] aarch64 host no longer boots after 767507654c22 ("arch_numa: switch over to numa_memblks")
@ 2024-10-29 12:47 Jan Stancek
  2024-10-29 15:07 ` Zi Yan
  0 siblings, 1 reply; 7+ messages in thread
From: Jan Stancek @ 2024-10-29 12:47 UTC (permalink / raw)
  To: rppt, ziy, linux-mm, Linux ARM
  Cc: Jonathan.Cameron, dan.j.williams, David Hildenbrand

Hi,

I'm seeing a regression on Nvidia IGX system, which no longer boots.

bisect points at commit 767507654c22 ("arch_numa: switch over to numa_memblks").
It hangs very early, with 4k or 64k pages, with no kernel messages printed:

EFI stub: Booting Linux Kernel...
EFI stub: Using DTB from configuration table
EFI stub: Exiting boot services...
<hangs here>

Here's a log from successful boot with previous commit:
https://people.redhat.com/jstancek/aarch64_numa_boot/console-log-good.txt
and config: https://people.redhat.com/jstancek/aarch64_numa_boot/config

# lscpu
Architecture:             aarch64
  CPU op-mode(s):         32-bit, 64-bit
  Byte Order:             Little Endian
CPU(s):                   12
  On-line CPU(s) list:    0-11
Vendor ID:                ARM
  BIOS Vendor ID:         NVIDIA
  Model name:             Cortex-A78AE
    BIOS Model name:      Not Specified Not Specified CPU @ 0.0GHz
    BIOS CPU family:      257
    Model:                1
    Thread(s) per core:   1
    Core(s) per cluster:  12
    Socket(s):            1
    Cluster(s):           1
    Stepping:             r0p1
    CPU(s) scaling MHz:   100%
    CPU max MHz:          1971.2000
    CPU min MHz:          115.2000
    BogoMIPS:             62.50
    Flags:                fp asimd evtstrm aes pmull sha1 sha2 crc32
atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp uscat ilrcpc
flagm paca pacg
Caches (sum of all):
  L1d:                    768 KiB (12 instances)
  L1i:                    768 KiB (12 instances)
  L2:                     3 MiB (12 instances)
  L3:                     6 MiB (3 instances)
NUMA:
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-11
Vulnerabilities:
  Gather data sampling:   Not affected
  Itlb multihit:          Not affected
  L1tf:                   Not affected
  Mds:                    Not affected
  Meltdown:               Not affected
  Mmio stale data:        Not affected
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Not affected
  Spec store bypass:      Mitigation; Speculative Store Bypass
disabled via prctl
  Spectre v1:             Mitigation; __user pointer sanitization
  Spectre v2:             Mitigation; CSV2, BHB
  Srbds:                  Not affected
  Tsx async abort:        Not affected

Regards,
Jan



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [bug] aarch64 host no longer boots after 767507654c22 ("arch_numa: switch over to numa_memblks")
  2024-10-29 12:47 [bug] aarch64 host no longer boots after 767507654c22 ("arch_numa: switch over to numa_memblks") Jan Stancek
@ 2024-10-29 15:07 ` Zi Yan
  2024-10-29 15:43   ` Jan Stancek
  0 siblings, 1 reply; 7+ messages in thread
From: Zi Yan @ 2024-10-29 15:07 UTC (permalink / raw)
  To: Jan Stancek
  Cc: rppt, linux-mm, Linux ARM, Jonathan.Cameron, dan.j.williams,
	David Hildenbrand, linux-tegra, Thierry Reding, Jonathan Hunter

+tegra mailing list and maintainers

On 29 Oct 2024, at 8:47, Jan Stancek wrote:

> Hi,
>
> I'm seeing a regression on Nvidia IGX system, which no longer boots.
>
> bisect points at commit 767507654c22 ("arch_numa: switch over to numa_memblks").
> It hangs very early, with 4k or 64k pages, with no kernel messages printed:
>
> EFI stub: Booting Linux Kernel...
> EFI stub: Using DTB from configuration table
> EFI stub: Exiting boot services...
> <hangs here>
>

Is it possible to have earlycon output? It is hard to debug without any
information except kernel fails to boot.

Since the previous commit boots and I assume both kernels are compiled
with the same gcc toolchain, this should not be caused by the binuils
bug in 2.42[1]. Is your binutils version 2.42?

Thanks.


[1] https://sourceware.org/bugzilla/show_bug.cgi?id=31924

> Here's a log from successful boot with previous commit:
> https://people.redhat.com/jstancek/aarch64_numa_boot/console-log-good.txt
> and config: https://people.redhat.com/jstancek/aarch64_numa_boot/config
>
> # lscpu
> Architecture:             aarch64
>   CPU op-mode(s):         32-bit, 64-bit
>   Byte Order:             Little Endian
> CPU(s):                   12
>   On-line CPU(s) list:    0-11
> Vendor ID:                ARM
>   BIOS Vendor ID:         NVIDIA
>   Model name:             Cortex-A78AE
>     BIOS Model name:      Not Specified Not Specified CPU @ 0.0GHz
>     BIOS CPU family:      257
>     Model:                1
>     Thread(s) per core:   1
>     Core(s) per cluster:  12
>     Socket(s):            1
>     Cluster(s):           1
>     Stepping:             r0p1
>     CPU(s) scaling MHz:   100%
>     CPU max MHz:          1971.2000
>     CPU min MHz:          115.2000
>     BogoMIPS:             62.50
>     Flags:                fp asimd evtstrm aes pmull sha1 sha2 crc32
> atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp uscat ilrcpc
> flagm paca pacg
> Caches (sum of all):
>   L1d:                    768 KiB (12 instances)
>   L1i:                    768 KiB (12 instances)
>   L2:                     3 MiB (12 instances)
>   L3:                     6 MiB (3 instances)
> NUMA:
>   NUMA node(s):           1
>   NUMA node0 CPU(s):      0-11
> Vulnerabilities:
>   Gather data sampling:   Not affected
>   Itlb multihit:          Not affected
>   L1tf:                   Not affected
>   Mds:                    Not affected
>   Meltdown:               Not affected
>   Mmio stale data:        Not affected
>   Reg file data sampling: Not affected
>   Retbleed:               Not affected
>   Spec rstack overflow:   Not affected
>   Spec store bypass:      Mitigation; Speculative Store Bypass
> disabled via prctl
>   Spectre v1:             Mitigation; __user pointer sanitization
>   Spectre v2:             Mitigation; CSV2, BHB
>   Srbds:                  Not affected
>   Tsx async abort:        Not affected
>
> Regards,
> Jan


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [bug] aarch64 host no longer boots after 767507654c22 ("arch_numa: switch over to numa_memblks")
  2024-10-29 15:07 ` Zi Yan
@ 2024-10-29 15:43   ` Jan Stancek
  2024-10-29 16:20     ` Mike Rapoport
  0 siblings, 1 reply; 7+ messages in thread
From: Jan Stancek @ 2024-10-29 15:43 UTC (permalink / raw)
  To: Zi Yan
  Cc: rppt, linux-mm, Linux ARM, Jonathan.Cameron, dan.j.williams,
	David Hildenbrand, linux-tegra, Thierry Reding, Jonathan Hunter

On Tue, Oct 29, 2024 at 4:07 PM Zi Yan <ziy@nvidia.com> wrote:
>
> +tegra mailing list and maintainers
>
> On 29 Oct 2024, at 8:47, Jan Stancek wrote:
>
> > Hi,
> >
> > I'm seeing a regression on Nvidia IGX system, which no longer boots.
> >
> > bisect points at commit 767507654c22 ("arch_numa: switch over to numa_memblks").
> > It hangs very early, with 4k or 64k pages, with no kernel messages printed:
> >
> > EFI stub: Booting Linux Kernel...
> > EFI stub: Using DTB from configuration table
> > EFI stub: Exiting boot services...
> > <hangs here>
> >
>
> Is it possible to have earlycon output? It is hard to debug without any
> information except kernel fails to boot.

I know it was a long shot, so far I haven't had luck getting it to work.

>
> Since the previous commit boots and I assume both kernels are compiled
> with the same gcc toolchain, this should not be caused by the binuils
> bug in 2.42[1]. Is your binutils version 2.42?

Yes, both are compiled locally, with binutils 2.41

>
> Thanks.
>
>
> [1] https://sourceware.org/bugzilla/show_bug.cgi?id=31924
>
> > Here's a log from successful boot with previous commit:
> > https://people.redhat.com/jstancek/aarch64_numa_boot/console-log-good.txt
> > and config: https://people.redhat.com/jstancek/aarch64_numa_boot/config
> >
> > # lscpu
> > Architecture:             aarch64
> >   CPU op-mode(s):         32-bit, 64-bit
> >   Byte Order:             Little Endian
> > CPU(s):                   12
> >   On-line CPU(s) list:    0-11
> > Vendor ID:                ARM
> >   BIOS Vendor ID:         NVIDIA
> >   Model name:             Cortex-A78AE
> >     BIOS Model name:      Not Specified Not Specified CPU @ 0.0GHz
> >     BIOS CPU family:      257
> >     Model:                1
> >     Thread(s) per core:   1
> >     Core(s) per cluster:  12
> >     Socket(s):            1
> >     Cluster(s):           1
> >     Stepping:             r0p1
> >     CPU(s) scaling MHz:   100%
> >     CPU max MHz:          1971.2000
> >     CPU min MHz:          115.2000
> >     BogoMIPS:             62.50
> >     Flags:                fp asimd evtstrm aes pmull sha1 sha2 crc32
> > atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp uscat ilrcpc
> > flagm paca pacg
> > Caches (sum of all):
> >   L1d:                    768 KiB (12 instances)
> >   L1i:                    768 KiB (12 instances)
> >   L2:                     3 MiB (12 instances)
> >   L3:                     6 MiB (3 instances)
> > NUMA:
> >   NUMA node(s):           1
> >   NUMA node0 CPU(s):      0-11
> > Vulnerabilities:
> >   Gather data sampling:   Not affected
> >   Itlb multihit:          Not affected
> >   L1tf:                   Not affected
> >   Mds:                    Not affected
> >   Meltdown:               Not affected
> >   Mmio stale data:        Not affected
> >   Reg file data sampling: Not affected
> >   Retbleed:               Not affected
> >   Spec rstack overflow:   Not affected
> >   Spec store bypass:      Mitigation; Speculative Store Bypass
> > disabled via prctl
> >   Spectre v1:             Mitigation; __user pointer sanitization
> >   Spectre v2:             Mitigation; CSV2, BHB
> >   Srbds:                  Not affected
> >   Tsx async abort:        Not affected
> >
> > Regards,
> > Jan
>
>
> Best Regards,
> Yan, Zi
>



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [bug] aarch64 host no longer boots after 767507654c22 ("arch_numa: switch over to numa_memblks")
  2024-10-29 15:43   ` Jan Stancek
@ 2024-10-29 16:20     ` Mike Rapoport
  2024-10-29 21:03       ` Jan Stancek
  0 siblings, 1 reply; 7+ messages in thread
From: Mike Rapoport @ 2024-10-29 16:20 UTC (permalink / raw)
  To: Jan Stancek
  Cc: Zi Yan, linux-mm, Linux ARM, Jonathan.Cameron, dan.j.williams,
	David Hildenbrand, linux-tegra, Thierry Reding, Jonathan Hunter

On Tue, Oct 29, 2024 at 04:43:39PM +0100, Jan Stancek wrote:
> On Tue, Oct 29, 2024 at 4:07 PM Zi Yan <ziy@nvidia.com> wrote:
> >
> > +tegra mailing list and maintainers
> >
> > On 29 Oct 2024, at 8:47, Jan Stancek wrote:
> >
> > > Hi,
> > >
> > > I'm seeing a regression on Nvidia IGX system, which no longer boots.
> > >
> > > bisect points at commit 767507654c22 ("arch_numa: switch over to numa_memblks").
> > > It hangs very early, with 4k or 64k pages, with no kernel messages printed:
> > >
> > > EFI stub: Booting Linux Kernel...
> > > EFI stub: Using DTB from configuration table
> > > EFI stub: Exiting boot services...
> > > <hangs here>
> > >
> >
> > Is it possible to have earlycon output? It is hard to debug without any
> > information except kernel fails to boot.
> 
> I know it was a long shot, so far I haven't had luck getting it to work.

Does it boot with numa=off and numa=fake?

In the log from successful boot it seems there is no NUMA information in
the device tree, can you send the device tree as well please?

> > Since the previous commit boots and I assume both kernels are compiled
> > with the same gcc toolchain, this should not be caused by the binuils
> > bug in 2.42[1]. Is your binutils version 2.42?
> 
> Yes, both are compiled locally, with binutils 2.41
> 
> >
> > Thanks.
> >
> >
> > [1] https://sourceware.org/bugzilla/show_bug.cgi?id=31924
> >
> > > Here's a log from successful boot with previous commit:
> > > https://people.redhat.com/jstancek/aarch64_numa_boot/console-log-good.txt
> > > and config: https://people.redhat.com/jstancek/aarch64_numa_boot/config
> > >
> > > # lscpu
> > > Architecture:             aarch64
> > >   CPU op-mode(s):         32-bit, 64-bit
> > >   Byte Order:             Little Endian
> > > CPU(s):                   12
> > >   On-line CPU(s) list:    0-11
> > > Vendor ID:                ARM
> > >   BIOS Vendor ID:         NVIDIA
> > >   Model name:             Cortex-A78AE
> > >     BIOS Model name:      Not Specified Not Specified CPU @ 0.0GHz
> > >     BIOS CPU family:      257
> > >     Model:                1
> > >     Thread(s) per core:   1
> > >     Core(s) per cluster:  12
> > >     Socket(s):            1
> > >     Cluster(s):           1
> > >     Stepping:             r0p1
> > >     CPU(s) scaling MHz:   100%
> > >     CPU max MHz:          1971.2000
> > >     CPU min MHz:          115.2000
> > >     BogoMIPS:             62.50
> > >     Flags:                fp asimd evtstrm aes pmull sha1 sha2 crc32
> > > atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp uscat ilrcpc
> > > flagm paca pacg
> > > Caches (sum of all):
> > >   L1d:                    768 KiB (12 instances)
> > >   L1i:                    768 KiB (12 instances)
> > >   L2:                     3 MiB (12 instances)
> > >   L3:                     6 MiB (3 instances)
> > > NUMA:
> > >   NUMA node(s):           1
> > >   NUMA node0 CPU(s):      0-11
> > > Vulnerabilities:
> > >   Gather data sampling:   Not affected
> > >   Itlb multihit:          Not affected
> > >   L1tf:                   Not affected
> > >   Mds:                    Not affected
> > >   Meltdown:               Not affected
> > >   Mmio stale data:        Not affected
> > >   Reg file data sampling: Not affected
> > >   Retbleed:               Not affected
> > >   Spec rstack overflow:   Not affected
> > >   Spec store bypass:      Mitigation; Speculative Store Bypass
> > > disabled via prctl
> > >   Spectre v1:             Mitigation; __user pointer sanitization
> > >   Spectre v2:             Mitigation; CSV2, BHB
> > >   Srbds:                  Not affected
> > >   Tsx async abort:        Not affected
> > >
> > > Regards,
> > > Jan
> >
> >
> > Best Regards,
> > Yan, Zi
> >
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [bug] aarch64 host no longer boots after 767507654c22 ("arch_numa: switch over to numa_memblks")
  2024-10-29 16:20     ` Mike Rapoport
@ 2024-10-29 21:03       ` Jan Stancek
  2024-10-30 13:08         ` Mike Rapoport
  0 siblings, 1 reply; 7+ messages in thread
From: Jan Stancek @ 2024-10-29 21:03 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Zi Yan, linux-mm, Linux ARM, Jonathan.Cameron, dan.j.williams,
	David Hildenbrand, linux-tegra, Thierry Reding, Jonathan Hunter

On Tue, Oct 29, 2024 at 5:24 PM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Tue, Oct 29, 2024 at 04:43:39PM +0100, Jan Stancek wrote:
> > On Tue, Oct 29, 2024 at 4:07 PM Zi Yan <ziy@nvidia.com> wrote:
> > >
> > > +tegra mailing list and maintainers
> > >
> > > On 29 Oct 2024, at 8:47, Jan Stancek wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm seeing a regression on Nvidia IGX system, which no longer boots.
> > > >
> > > > bisect points at commit 767507654c22 ("arch_numa: switch over to numa_memblks").
> > > > It hangs very early, with 4k or 64k pages, with no kernel messages printed:
> > > >
> > > > EFI stub: Booting Linux Kernel...
> > > > EFI stub: Using DTB from configuration table
> > > > EFI stub: Exiting boot services...
> > > > <hangs here>
> > > >
> > >
> > > Is it possible to have earlycon output? It is hard to debug without any
> > > information except kernel fails to boot.
> >
> > I know it was a long shot, so far I haven't had luck getting it to work.
>
> Does it boot with numa=off and numa=fake?

No, it doesn't.

>
> In the log from successful boot it seems there is no NUMA information in
> the device tree, can you send the device tree as well please?

https://people.redhat.com/jstancek/aarch64_numa_boot/device_tree

Regards,
Jan




>
> > > Since the previous commit boots and I assume both kernels are compiled
> > > with the same gcc toolchain, this should not be caused by the binuils
> > > bug in 2.42[1]. Is your binutils version 2.42?
> >
> > Yes, both are compiled locally, with binutils 2.41
> >
> > >
> > > Thanks.
> > >
> > >
> > > [1] https://sourceware.org/bugzilla/show_bug.cgi?id=31924
> > >
> > > > Here's a log from successful boot with previous commit:
> > > > https://people.redhat.com/jstancek/aarch64_numa_boot/console-log-good.txt
> > > > and config: https://people.redhat.com/jstancek/aarch64_numa_boot/config
> > > >
> > > > # lscpu
> > > > Architecture:             aarch64
> > > >   CPU op-mode(s):         32-bit, 64-bit
> > > >   Byte Order:             Little Endian
> > > > CPU(s):                   12
> > > >   On-line CPU(s) list:    0-11
> > > > Vendor ID:                ARM
> > > >   BIOS Vendor ID:         NVIDIA
> > > >   Model name:             Cortex-A78AE
> > > >     BIOS Model name:      Not Specified Not Specified CPU @ 0.0GHz
> > > >     BIOS CPU family:      257
> > > >     Model:                1
> > > >     Thread(s) per core:   1
> > > >     Core(s) per cluster:  12
> > > >     Socket(s):            1
> > > >     Cluster(s):           1
> > > >     Stepping:             r0p1
> > > >     CPU(s) scaling MHz:   100%
> > > >     CPU max MHz:          1971.2000
> > > >     CPU min MHz:          115.2000
> > > >     BogoMIPS:             62.50
> > > >     Flags:                fp asimd evtstrm aes pmull sha1 sha2 crc32
> > > > atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp uscat ilrcpc
> > > > flagm paca pacg
> > > > Caches (sum of all):
> > > >   L1d:                    768 KiB (12 instances)
> > > >   L1i:                    768 KiB (12 instances)
> > > >   L2:                     3 MiB (12 instances)
> > > >   L3:                     6 MiB (3 instances)
> > > > NUMA:
> > > >   NUMA node(s):           1
> > > >   NUMA node0 CPU(s):      0-11
> > > > Vulnerabilities:
> > > >   Gather data sampling:   Not affected
> > > >   Itlb multihit:          Not affected
> > > >   L1tf:                   Not affected
> > > >   Mds:                    Not affected
> > > >   Meltdown:               Not affected
> > > >   Mmio stale data:        Not affected
> > > >   Reg file data sampling: Not affected
> > > >   Retbleed:               Not affected
> > > >   Spec rstack overflow:   Not affected
> > > >   Spec store bypass:      Mitigation; Speculative Store Bypass
> > > > disabled via prctl
> > > >   Spectre v1:             Mitigation; __user pointer sanitization
> > > >   Spectre v2:             Mitigation; CSV2, BHB
> > > >   Srbds:                  Not affected
> > > >   Tsx async abort:        Not affected
> > > >
> > > > Regards,
> > > > Jan
> > >
> > >
> > > Best Regards,
> > > Yan, Zi
> > >
> >
>
> --
> Sincerely yours,
> Mike.
>



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [bug] aarch64 host no longer boots after 767507654c22 ("arch_numa: switch over to numa_memblks")
  2024-10-29 21:03       ` Jan Stancek
@ 2024-10-30 13:08         ` Mike Rapoport
  2024-10-30 21:50           ` Jan Stancek
  0 siblings, 1 reply; 7+ messages in thread
From: Mike Rapoport @ 2024-10-30 13:08 UTC (permalink / raw)
  To: Jan Stancek
  Cc: Zi Yan, linux-mm, Linux ARM, Jonathan.Cameron, dan.j.williams,
	David Hildenbrand, linux-tegra, Thierry Reding, Jonathan Hunter

On Tue, Oct 29, 2024 at 10:03:31PM +0100, Jan Stancek wrote:
> On Tue, Oct 29, 2024 at 5:24 PM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > On Tue, Oct 29, 2024 at 04:43:39PM +0100, Jan Stancek wrote:
> > > On Tue, Oct 29, 2024 at 4:07 PM Zi Yan <ziy@nvidia.com> wrote:
> > > >
> > > > +tegra mailing list and maintainers
> > > >
> > > > On 29 Oct 2024, at 8:47, Jan Stancek wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I'm seeing a regression on Nvidia IGX system, which no longer boots.
> > > > >
> > > > > bisect points at commit 767507654c22 ("arch_numa: switch over to numa_memblks").
> > > > > It hangs very early, with 4k or 64k pages, with no kernel messages printed:
> > > > >
> > > > > EFI stub: Booting Linux Kernel...
> > > > > EFI stub: Using DTB from configuration table
> > > > > EFI stub: Exiting boot services...
> > > > > <hangs here>
> > > > >
> > > >
> > > > Is it possible to have earlycon output? It is hard to debug without any
> > > > information except kernel fails to boot.
> > >
> > > I know it was a long shot, so far I haven't had luck getting it to work.
> >
> > Does it boot with numa=off and numa=fake?
> 
> No, it doesn't.
 
No ideas without the logs, sorry.

> > In the log from successful boot it seems there is no NUMA information in
> > the device tree, can you send the device tree as well please?
> 
> https://people.redhat.com/jstancek/aarch64_numa_boot/device_tree
> 
> Regards,
> Jan

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [bug] aarch64 host no longer boots after 767507654c22 ("arch_numa: switch over to numa_memblks")
  2024-10-30 13:08         ` Mike Rapoport
@ 2024-10-30 21:50           ` Jan Stancek
  0 siblings, 0 replies; 7+ messages in thread
From: Jan Stancek @ 2024-10-30 21:50 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Zi Yan, linux-mm, Linux ARM, Jonathan.Cameron, dan.j.williams,
	David Hildenbrand, linux-tegra, Thierry Reding, Jonathan Hunter

On Wed, Oct 30, 2024 at 2:12 PM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Tue, Oct 29, 2024 at 10:03:31PM +0100, Jan Stancek wrote:
> > On Tue, Oct 29, 2024 at 5:24 PM Mike Rapoport <rppt@kernel.org> wrote:
> > >
> > > On Tue, Oct 29, 2024 at 04:43:39PM +0100, Jan Stancek wrote:
> > > > On Tue, Oct 29, 2024 at 4:07 PM Zi Yan <ziy@nvidia.com> wrote:
> > > > >
> > > > > +tegra mailing list and maintainers
> > > > >
> > > > > On 29 Oct 2024, at 8:47, Jan Stancek wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I'm seeing a regression on Nvidia IGX system, which no longer boots.
> > > > > >
> > > > > > bisect points at commit 767507654c22 ("arch_numa: switch over to numa_memblks").
> > > > > > It hangs very early, with 4k or 64k pages, with no kernel messages printed:
> > > > > >
> > > > > > EFI stub: Booting Linux Kernel...
> > > > > > EFI stub: Using DTB from configuration table
> > > > > > EFI stub: Exiting boot services...
> > > > > > <hangs here>
> > > > > >
> > > > >
> > > > > Is it possible to have earlycon output? It is hard to debug without any
> > > > > information except kernel fails to boot.
> > > >
> > > > I know it was a long shot, so far I haven't had luck getting it to work.
> > >
> > > Does it boot with numa=off and numa=fake?
> >
> > No, it doesn't.
>
> No ideas without the logs, sorry.

With some trial & error I narrowed it down, but as it turns out
the fix has already landed upstream today:
d95fb348f016 ("mm: numa_clear_kernel_node_hotplug: Add NUMA_NO_NODE
check for node id")

Regards,
Jan



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-10-30 21:50 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-29 12:47 [bug] aarch64 host no longer boots after 767507654c22 ("arch_numa: switch over to numa_memblks") Jan Stancek
2024-10-29 15:07 ` Zi Yan
2024-10-29 15:43   ` Jan Stancek
2024-10-29 16:20     ` Mike Rapoport
2024-10-29 21:03       ` Jan Stancek
2024-10-30 13:08         ` Mike Rapoport
2024-10-30 21:50           ` Jan Stancek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox