Re: Linux 6.18 amdgpu build error

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: Linux 6.18 amdgpu build error
       [not found]         ` <1f31b86d-283c-4878-92d0-ab90aed0c58d@roeck-us.net>
@ 2025-12-04  2:34           ` Shuah Khan
  2025-12-04  6:05             ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 8+ messages in thread
From: Shuah Khan @ 2025-12-04  2:34 UTC (permalink / raw)
  To: Linus Torvalds, akpm, david
  Cc: Alexander Deucher, Linux Kernel Mailing List, amd-gfx, dri-devel,
	Guenter Roeck, Linux Memory Management List, Shuah Khan

On 12/3/25 18:06, Guenter Roeck wrote:
> On 12/3/25 14:16, Shuah Khan wrote:

>>
>> CONFIG_RANDSTRUCT is disabled and so are the GCC_PLUGINS in my config.
> 
> I guess that would have been too easy...
> 
>> I am also seeing issues with cloning kernel.org repos on my system after
>> a recent update:
>>
>> remote: Enumerating objects: 11177736, done.
>> remote: Counting objects: 100% (1231/1231), done.
>> remote: Compressing objects: 100% (624/624), done.
>> remote: Total 11177736 (delta 855), reused 781 (delta 606), pack-reused 11176505 (from 1)
>> Receiving objects: 100% (11177736/11177736), 3.01 GiB | 7.10 MiB/s, done.
>> Resolving deltas: 100% (9198323/9198323), done.
>> fatal: did not receive expected object 0002003e951b5057c16de5a39140abcbf6e44e50
>> fatal: fetch-pack: invalid index-pack output
>>
> 

Linus, Andrew, and David,

Finally figured this out. I narrowed it to  to be the HAVE_GIGANTIC_FOLIOS
support that went into Linux 6.18-rc6 in this commit:

 From 39231e8d6ba7f794b566fd91ebd88c0834a23b98 Mon Sep 17 00:00:00 2001
From: "David Hildenbrand (Red Hat)" <david@kernel.org>
Date: Fri, 14 Nov 2025 22:49:20 +0100
Subject: [PATCH] mm: fix MAX_FOLIO_ORDER on powerpc configs with hugetlb

This appears to be large change than the powerpc scope. It broke my workflow
completely. I sent a revert so this doesn't cause problems for others.

I can reproduce this problem om two systems - with this commit

git fetch-pack fails when cloning large repos and make hangs
or errors out of Makefile.build with Error: 139. These failures are
random with git clone failing after fetching 1% of the objects, and
make hangs while compiling random files

These failures are random and confusing sending me down the path of
looking at tool chain. Without this commit, I can clone and build
kernels on the two systems I was seeing problems.

thanks,
-- Shuah




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Linux 6.18 amdgpu build error
  2025-12-04  2:34           ` Linux 6.18 amdgpu build error Shuah Khan
@ 2025-12-04  6:05             ` David Hildenbrand (Red Hat)
  2025-12-04 17:40               ` Shuah Khan
  0 siblings, 1 reply; 8+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-04  6:05 UTC (permalink / raw)
  To: Shuah Khan, Linus Torvalds, akpm
  Cc: Alexander Deucher, Linux Kernel Mailing List, amd-gfx, dri-devel,
	Guenter Roeck, Linux Memory Management List

On 12/4/25 03:34, Shuah Khan wrote:
> On 12/3/25 18:06, Guenter Roeck wrote:
>> On 12/3/25 14:16, Shuah Khan wrote:
> 
>>>
>>> CONFIG_RANDSTRUCT is disabled and so are the GCC_PLUGINS in my config.
>>
>> I guess that would have been too easy...
>>
>>> I am also seeing issues with cloning kernel.org repos on my system after
>>> a recent update:
>>>
>>> remote: Enumerating objects: 11177736, done.
>>> remote: Counting objects: 100% (1231/1231), done.
>>> remote: Compressing objects: 100% (624/624), done.
>>> remote: Total 11177736 (delta 855), reused 781 (delta 606), pack-reused 11176505 (from 1)
>>> Receiving objects: 100% (11177736/11177736), 3.01 GiB | 7.10 MiB/s, done.
>>> Resolving deltas: 100% (9198323/9198323), done.
>>> fatal: did not receive expected object 0002003e951b5057c16de5a39140abcbf6e44e50
>>> fatal: fetch-pack: invalid index-pack output
>>>
>>
> 
> Linus, Andrew, and David,
> 
> Finally figured this out. I narrowed it to  to be the HAVE_GIGANTIC_FOLIOS
> support that went into Linux 6.18-rc6 in this commit:
> 
>   From 39231e8d6ba7f794b566fd91ebd88c0834a23b98 Mon Sep 17 00:00:00 2001
> From: "David Hildenbrand (Red Hat)" <david@kernel.org>
> Date: Fri, 14 Nov 2025 22:49:20 +0100
> Subject: [PATCH] mm: fix MAX_FOLIO_ORDER on powerpc configs with hugetlb
> 

Unsuspected and confusing :(

Let me take a look at reply on the revert.

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Linux 6.18 amdgpu build error
  2025-12-04  6:05             ` David Hildenbrand (Red Hat)
@ 2025-12-04 17:40               ` Shuah Khan
  2025-12-04 19:36                 ` Linus Torvalds
  0 siblings, 1 reply; 8+ messages in thread
From: Shuah Khan @ 2025-12-04 17:40 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat), Linus Torvalds, akpm
  Cc: Alexander Deucher, Linux Kernel Mailing List, amd-gfx, dri-devel,
	Guenter Roeck, Linux Memory Management List, Shuah Khan

On 12/3/25 23:05, David Hildenbrand (Red Hat) wrote:
> On 12/4/25 03:34, Shuah Khan wrote:
>> On 12/3/25 18:06, Guenter Roeck wrote:
>>> On 12/3/25 14:16, Shuah Khan wrote:
>>
>>>>
>>>> CONFIG_RANDSTRUCT is disabled and so are the GCC_PLUGINS in my config.
>>>
>>> I guess that would have been too easy...
>>>
>>>> I am also seeing issues with cloning kernel.org repos on my system after
>>>> a recent update:
>>>>
>>>> remote: Enumerating objects: 11177736, done.
>>>> remote: Counting objects: 100% (1231/1231), done.
>>>> remote: Compressing objects: 100% (624/624), done.
>>>> remote: Total 11177736 (delta 855), reused 781 (delta 606), pack-reused 11176505 (from 1)
>>>> Receiving objects: 100% (11177736/11177736), 3.01 GiB | 7.10 MiB/s, done.
>>>> Resolving deltas: 100% (9198323/9198323), done.
>>>> fatal: did not receive expected object 0002003e951b5057c16de5a39140abcbf6e44e50
>>>> fatal: fetch-pack: invalid index-pack output
>>>>
>>>
>>
>> Linus, Andrew, and David,
>>
>> Finally figured this out. I narrowed it to  to be the HAVE_GIGANTIC_FOLIOS
>> support that went into Linux 6.18-rc6 in this commit:
>>
>>   From 39231e8d6ba7f794b566fd91ebd88c0834a23b98 Mon Sep 17 00:00:00 2001
>> From: "David Hildenbrand (Red Hat)" <david@kernel.org>
>> Date: Fri, 14 Nov 2025 22:49:20 +0100
>> Subject: [PATCH] mm: fix MAX_FOLIO_ORDER on powerpc configs with hugetlb
>>
> 
> Unsuspected and confusing :(

This commit has impact on all architectures, not a narrow scoped
powerpc only thing -  it enables HAVE_GIGANTIC_FOLIOS on x86_64
and changes the common code that determines MAX_FOLIO_ORDER in
include/linux/mm.h

> 
> Let me take a look at reply on the revert.
> 

Sounds good. Reverting or finding a fix is good with me. It definitely
impacted two of my systems and the problem was introduced in
Linux 6.18-rc6 and is in Linux 6.18.

thanks,
-- Shuah



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Linux 6.18 amdgpu build error
  2025-12-04 17:40               ` Shuah Khan
@ 2025-12-04 19:36                 ` Linus Torvalds
  2025-12-04 19:45                   ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2025-12-04 19:36 UTC (permalink / raw)
  To: Shuah Khan
  Cc: David Hildenbrand (Red Hat),
	akpm, Alexander Deucher, Linux Kernel Mailing List, amd-gfx,
	dri-devel, Guenter Roeck, Linux Memory Management List

On Thu, 4 Dec 2025 at 09:40, Shuah Khan <skhan@linuxfoundation.org> wrote:
>
> This commit has impact on all architectures, not a narrow scoped
> powerpc only thing -  it enables HAVE_GIGANTIC_FOLIOS on x86_64
> and changes the common code that determines MAX_FOLIO_ORDER in
> include/linux/mm.h

So I suspect your bisection might not have worked out, and there might
be two different things going on.

In particular, hugepages were broken in 6.18-rc6 due to commit
adfb6609c680 ("mm/huge_memory: initialise the tags of the huge zero
folio").

That was then fixed for rc7 (and obviously final 6.18) by commit
5bebe8de19264 ("mm/huge_memory: Fix initialization of huge zero
folio"), but the breakage up until that time was a bit random.

End result: if you ever ended up bisecting into that broken range
between those two commits, you would get failures on some loads (but
not reliably), and your bisection would end up pointing to some random
thing.

But as mentioned, that particular problem would have been fixed in rc7
and in final 6.18, so any issues you saw with the final build would
have been due to something else.

Can I ask you to try to re-do the bisection, but with that commit
5bebe8de19264 applied by hand - if it wasn't already there - every
time you build a kernel that has adfb6609c680?

That way the bisection wouldn't be affected by that other known bug.

                    Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Linux 6.18 amdgpu build error
  2025-12-04 19:36                 ` Linus Torvalds
@ 2025-12-04 19:45                   ` David Hildenbrand (Red Hat)
  2025-12-04 23:20                     ` Shuah Khan
  0 siblings, 1 reply; 8+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-04 19:45 UTC (permalink / raw)
  To: Linus Torvalds, Shuah Khan
  Cc: akpm, Alexander Deucher, Linux Kernel Mailing List, amd-gfx,
	dri-devel, Guenter Roeck, Linux Memory Management List

On 12/4/25 20:36, Linus Torvalds wrote:
> On Thu, 4 Dec 2025 at 09:40, Shuah Khan <skhan@linuxfoundation.org> wrote:
>>
>> This commit has impact on all architectures, not a narrow scoped
>> powerpc only thing -  it enables HAVE_GIGANTIC_FOLIOS on x86_64
>> and changes the common code that determines MAX_FOLIO_ORDER in
>> include/linux/mm.h
> 
> So I suspect your bisection might not have worked out, and there might
> be two different things going on.
> 
> In particular, hugepages were broken in 6.18-rc6 due to commit
> adfb6609c680 ("mm/huge_memory: initialise the tags of the huge zero
> folio").
> 
> That was then fixed for rc7 (and obviously final 6.18) by commit
> 5bebe8de19264 ("mm/huge_memory: Fix initialization of huge zero
> folio"), but the breakage up until that time was a bit random.
> 
> End result: if you ever ended up bisecting into that broken range
> between those two commits, you would get failures on some loads (but
> not reliably), and your bisection would end up pointing to some random
> thing.
> 
> But as mentioned, that particular problem would have been fixed in rc7
> and in final 6.18, so any issues you saw with the final build would
> have been due to something else.
> 
> Can I ask you to try to re-do the bisection, but with that commit
> 5bebe8de19264 applied by hand - if it wasn't already there - every
> time you build a kernel that has adfb6609c680?

Right, that's what I also proposed in [1].

I cannot make sense of how 39231e8d6ba could possibly trigger it given 
that it only affects the value of MAX_FOLIO_ORDER --- which is primarily 
used for safety checks and snapshot_page(), nothing that could explain 
changed application behavior, really.

But while Shuah is retesting, I'll go have a yet another look.

[1] 
https://lore.kernel.org/all/78af7da4-d213-42c6-8ca6-c2bdca81f233@linuxfoundation.org/

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Linux 6.18 amdgpu build error
  2025-12-04 19:45                   ` David Hildenbrand (Red Hat)
@ 2025-12-04 23:20                     ` Shuah Khan
  2025-12-04 23:23                       ` Linus Torvalds
  0 siblings, 1 reply; 8+ messages in thread
From: Shuah Khan @ 2025-12-04 23:20 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat), Linus Torvalds
  Cc: akpm, Alexander Deucher, Linux Kernel Mailing List, amd-gfx,
	dri-devel, Guenter Roeck, Linux Memory Management List,
	Shuah Khan

On 12/4/25 12:45, David Hildenbrand (Red Hat) wrote:
> On 12/4/25 20:36, Linus Torvalds wrote:
>> On Thu, 4 Dec 2025 at 09:40, Shuah Khan <skhan@linuxfoundation.org> wrote:
>>>
>>> This commit has impact on all architectures, not a narrow scoped
>>> powerpc only thing -  it enables HAVE_GIGANTIC_FOLIOS on x86_64
>>> and changes the common code that determines MAX_FOLIO_ORDER in
>>> include/linux/mm.h
>>
>> So I suspect your bisection might not have worked out, and there might
>> be two different things going on.
>>
>> In particular, hugepages were broken in 6.18-rc6 due to commit
>> adfb6609c680 ("mm/huge_memory: initialise the tags of the huge zero
>> folio").
>>
>> That was then fixed for rc7 (and obviously final 6.18) by commit
>> 5bebe8de19264 ("mm/huge_memory: Fix initialization of huge zero
>> folio"), but the breakage up until that time was a bit random.
>>

Both my systems were running rc6 - I was stuck in a state
where I was able to rebase to rc7 and then 6.18, but could
never build either one.

>> End result: if you ever ended up bisecting into that broken range
>> between those two commits, you would get failures on some loads (but
>> not reliably), and your bisection would end up pointing to some random
>> thing.
>>
>> But as mentioned, that particular problem would have been fixed in rc7
>> and in final 6.18, so any issues you saw with the final build would
>> have been due to something else.
>>
>> Can I ask you to try to re-do the bisection, but with that commit
>> 5bebe8de19264 applied by hand - if it wasn't already there - every
>> time you build a kernel that has adfb6609c680?

When I suspected rc6 to be the problem, I booted rc5 and compiled 6.18
after reverting 39231e8d6ba based on config file changes between rc5
and rc6.

> 
> Right, that's what I also proposed in [1].
> 
> I cannot make sense of how 39231e8d6ba could possibly trigger it given that it only affects the value of MAX_FOLIO_ORDER --- which is primarily used for safety checks and snapshot_page(), nothing that could explain changed application behavior, really.
> 
> But while Shuah is retesting, I'll go have a yet another look.

I retested on both systems on 6.18 making sure I have 5bebe8de19264
and 39231e8d6ba in there. I cloned linux_next and built it on both.

I didn't see any problems on 6.18. Having said that, It might make
sense to hold off on including 39231e8d6ba in 6.18 so there is more
time to test beyond 2 rc cycles. That is for you all to decide.

thanks,
-- Shuah


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Linux 6.18 amdgpu build error
  2025-12-04 23:20                     ` Shuah Khan
@ 2025-12-04 23:23                       ` Linus Torvalds
  2025-12-04 23:28                         ` Shuah Khan
  0 siblings, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2025-12-04 23:23 UTC (permalink / raw)
  To: Shuah Khan
  Cc: David Hildenbrand (Red Hat),
	akpm, Alexander Deucher, Linux Kernel Mailing List, amd-gfx,
	dri-devel, Guenter Roeck, Linux Memory Management List

On Thu, 4 Dec 2025 at 15:20, Shuah Khan <skhan@linuxfoundation.org> wrote:
>
> I didn't see any problems on 6.18.

Ahh. So it might be just that buggy commit adfb6609c680 then, and the
fix already being in rc7 (and final).

              Linus


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Linux 6.18 amdgpu build error
  2025-12-04 23:23                       ` Linus Torvalds
@ 2025-12-04 23:28                         ` Shuah Khan
  0 siblings, 0 replies; 8+ messages in thread
From: Shuah Khan @ 2025-12-04 23:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Hildenbrand (Red Hat),
	akpm, Alexander Deucher, Linux Kernel Mailing List, amd-gfx,
	dri-devel, Guenter Roeck, Linux Memory Management List,
	Shuah Khan

On 12/4/25 16:23, Linus Torvalds wrote:
> On Thu, 4 Dec 2025 at 15:20, Shuah Khan <skhan@linuxfoundation.org> wrote:
>>
>> I didn't see any problems on 6.18.
> 
> Ahh. So it might be just that buggy commit adfb6609c680 then, and the
> fix already being in rc7 (and final).
> 

Yes - correct.

thanks,
-- Shuah



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-12-04 23:28 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <74032153-813a-4a40-8363-cce264f4d5ea@linuxfoundation.org>
     [not found] ` <1eb24816-530b-4470-8e58-ce7d8297996c@roeck-us.net>
     [not found]   ` <0425d7b4-76e4-4057-83a5-a7b17a051c54@linuxfoundation.org>
     [not found]     ` <ec77d11a-7613-4b75-8c9e-f2bba1595f0f@roeck-us.net>
     [not found]       ` <9d520a1d-0b8d-4d30-b29f-230fc0f92b8a@linuxfoundation.org>
     [not found]         ` <1f31b86d-283c-4878-92d0-ab90aed0c58d@roeck-us.net>
2025-12-04  2:34           ` Linux 6.18 amdgpu build error Shuah Khan
2025-12-04  6:05             ` David Hildenbrand (Red Hat)
2025-12-04 17:40               ` Shuah Khan
2025-12-04 19:36                 ` Linus Torvalds
2025-12-04 19:45                   ` David Hildenbrand (Red Hat)
2025-12-04 23:20                     ` Shuah Khan
2025-12-04 23:23                       ` Linus Torvalds
2025-12-04 23:28                         ` Shuah Khan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox