* get_pageblock_bitmap() sometimes returns a NULL pointer which needs to be properly handled
@ 2024-05-02 16:02 Markus Gothe
2024-05-06 9:39 ` Baolin Wang
0 siblings, 1 reply; 7+ messages in thread
From: Markus Gothe @ 2024-05-02 16:02 UTC (permalink / raw)
To: linux-mm; +Cc: akpm
[-- Attachment #1.1: Type: text/plain, Size: 2028 bytes --]
Hi,
under some rare occasion I run into the following crash:
[ 41.417606] pstate: 80400005 (Nzcv daif +PAN -UAO)
[ 41.422406] pc : set_pfnblock_flags_mask+0x50/0x94
[ 41.427193] lr : compaction_alloc+0x220/0x804
[ 41.431544] sp : ffffffc01104bb10
[ 41.434852] x29: ffffffc01104bb10 x28: ffffffc010e5b500
[ 41.440165] x27: 0000000000098000 x26: ffffffc010e5b500
[ 41.445477] x25: 0000000000000066 x24: 0000000000090800
[ 41.450789] x23: 0000000000000200 x22: 0000000000084000
[ 41.456093] x21: ffffffc010e82000 x20: ffffffc010b88000
[ 41.461396] x19: ffffffc01104bd70 x18: 0000000000000000
[ 41.466700] x17: f1f24e35df34dda4 x16: 6b3f63a0e1157268
[ 41.472004] x15: 4b3990ec2568ada0 x14: 757ebc126939cb5f
[ 41.477308] x13: 9df9488aba179ccb x12: 0000000000000000
[ 41.482612] x11: 0000000000000000 x10: ffffffc010c5fc30
[ 41.487916] x9 : ffffff801eea7c00 x8 : 000000001bf00000
[ 41.493219] x7 : 0000000000000000 x6 : 000000000000003f
[ 41.498525] x5 : 0000000000000108 x4 : 1000000000000000
[ 41.503835] x3 : 0000000000000021 x2 : 000000000000003c
[ 41.509139] x1 : 0000000000000001 x0 : 0000000000000003
[ 41.514443] Call trace:
[ 41.516887] set_pfnblock_flags_mask+0x50/0x94
[ 41.521330] migrate_pages+0x90/0x7f0
[ 41.524992] compact_zone+0x854/0x9f0
[ 41.528647] kcompactd_do_work+0x168/0x230
[ 41.532734] kcompactd+0x58/0x140
[ 41.536043] kthread+0x120/0x124
[ 41.539263] ret_from_fork+0x10/0x24
[ 41.542835] Code: d346fc43 4b0000c2 8b030ce5 9ac22084 (f86378e0)
[ 41.548925] ---[ end trace 731400a587304db3 ]---
I've pin-pointed it down to pageblock_flags pointer being initialized to NULL under certain conditions. I don't know why this happens.
Maybe it is some obscure race condition which only shows up on my system.
Nonetheless, I've made a fix for this in the attached patch. It adds sanity checking and protects the system for crashing.
I think it is better to be safe than sorry.
--
Best Regards,
Markus
[-- Attachment #1.2: Type: text/html, Size: 2320 bytes --]
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-page_alloc.c-Sanity-check-for-NULL-pointers.patch --]
[-- Type: text/x-patch; name="0001-page_alloc.c-Sanity-check-for-NULL-pointers.patch", Size: 1248 bytes --]
From 2ff8267436362532848b73b7beab7fd36015b0c5 Mon Sep 17 00:00:00 2001
From: Markus Gothe <markus.gothe@genexis.eu>
Date: Thu, 2 May 2024 15:36:36 +0200
Subject: [PATCH] page_alloc.c: Sanity check for NULL pointers.
get_pageblock_bitmap() might in very rare cicumstances
return NULL which must be handled accordingly or
otherwise we will end up with a kernel crash.
---
mm/page_alloc.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 540af9a120e6..ab230e349862 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -486,6 +486,9 @@ static __always_inline unsigned long __get_pfnblock_flags_mask(struct page *page
unsigned long word;
bitmap = get_pageblock_bitmap(page, pfn);
+ if (unlikely(bitmap == NULL))
+ return 0;
+
bitidx = pfn_to_bitidx(page, pfn);
word_bitidx = bitidx / BITS_PER_LONG;
bitidx &= (BITS_PER_LONG-1);
@@ -528,6 +531,9 @@ void set_pfnblock_flags_mask(struct page *page, unsigned long flags,
BUILD_BUG_ON(MIGRATE_TYPES > (1 << PB_migratetype_bits));
bitmap = get_pageblock_bitmap(page, pfn);
+ if (unlikely(bitmap == NULL))
+ return;
+
bitidx = pfn_to_bitidx(page, pfn);
word_bitidx = bitidx / BITS_PER_LONG;
bitidx &= (BITS_PER_LONG-1);
--
2.43.2
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: get_pageblock_bitmap() sometimes returns a NULL pointer which needs to be properly handled
2024-05-02 16:02 get_pageblock_bitmap() sometimes returns a NULL pointer which needs to be properly handled Markus Gothe
@ 2024-05-06 9:39 ` Baolin Wang
2024-05-06 9:51 ` Markus Gothe
2024-05-06 9:54 ` David Hildenbrand
0 siblings, 2 replies; 7+ messages in thread
From: Baolin Wang @ 2024-05-06 9:39 UTC (permalink / raw)
To: Markus Gothe, linux-mm; +Cc: akpm, David Hildenbrand
Ccing David.
On 2024/5/3 00:02, Markus Gothe wrote:
> Hi,
> under some rare occasion I run into the following crash:
>
> [ 41.417606] pstate: 80400005 (Nzcv daif +PAN -UAO)
> [ 41.422406] pc : set_pfnblock_flags_mask+0x50/0x94
> [ 41.427193] lr : compaction_alloc+0x220/0x804
> [ 41.431544] sp : ffffffc01104bb10
> [ 41.434852] x29: ffffffc01104bb10 x28: ffffffc010e5b500
> [ 41.440165] x27: 0000000000098000 x26: ffffffc010e5b500
> [ 41.445477] x25: 0000000000000066 x24: 0000000000090800
> [ 41.450789] x23: 0000000000000200 x22: 0000000000084000
> [ 41.456093] x21: ffffffc010e82000 x20: ffffffc010b88000
> [ 41.461396] x19: ffffffc01104bd70 x18: 0000000000000000
> [ 41.466700] x17: f1f24e35df34dda4 x16: 6b3f63a0e1157268
> [ 41.472004] x15: 4b3990ec2568ada0 x14: 757ebc126939cb5f
> [ 41.477308] x13: 9df9488aba179ccb x12: 0000000000000000
> [ 41.482612] x11: 0000000000000000 x10: ffffffc010c5fc30
> [ 41.487916] x9 : ffffff801eea7c00 x8 : 000000001bf00000
> [ 41.493219] x7 : 0000000000000000 x6 : 000000000000003f
> [ 41.498525] x5 : 0000000000000108 x4 : 1000000000000000
> [ 41.503835] x3 : 0000000000000021 x2 : 000000000000003c
> [ 41.509139] x1 : 0000000000000001 x0 : 0000000000000003
> [ 41.514443] Call trace:
> [ 41.516887] set_pfnblock_flags_mask+0x50/0x94
> [ 41.521330] migrate_pages+0x90/0x7f0
> [ 41.524992] compact_zone+0x854/0x9f0
> [ 41.528647] kcompactd_do_work+0x168/0x230
> [ 41.532734] kcompactd+0x58/0x140
> [ 41.536043] kthread+0x120/0x124
> [ 41.539263] ret_from_fork+0x10/0x24
> [ 41.542835] Code: d346fc43 4b0000c2 8b030ce5 9ac22084 (f86378e0)
> [ 41.548925] ---[ end trace 731400a587304db3 ]---
>
>
> I've pin-pointed it down to pageblock_flags pointer being initialized to NULL under certain conditions. I don't know why this happens.
> Maybe it is some obscure race condition which only shows up on my system.
Is there memory hotplug in your test? It seems to be caused by the race
between memory hotplug and PFN walkers (such as compaction), which is
already a known issue.
> Nonetheless, I've made a fix for this in the attached patch. It adds sanity checking and protects the system for crashing.
> I think it is better to be safe than sorry.
>
> --
> Best Regards,
> Markus
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: get_pageblock_bitmap() sometimes returns a NULL pointer which needs to be properly handled
2024-05-06 9:39 ` Baolin Wang
@ 2024-05-06 9:51 ` Markus Gothe
2024-05-06 9:54 ` David Hildenbrand
1 sibling, 0 replies; 7+ messages in thread
From: Markus Gothe @ 2024-05-06 9:51 UTC (permalink / raw)
To: Baolin Wang, linux-mm; +Cc: akpm, David Hildenbrand
Hi Baolin,
no memory hotplug.
This happens usually during bootup when I've a WAN-link established.
BR,
Markus
On 06/05/2024 11.39, Baolin Wang wrote:
> [You don't often get email from baolin.wang@linux.alibaba.com. Learn
> why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> Ccing David.
>
> On 2024/5/3 00:02, Markus Gothe wrote:
>> Hi,
>> under some rare occasion I run into the following crash:
>>
>> [ 41.417606] pstate: 80400005 (Nzcv daif +PAN -UAO)
>> [ 41.422406] pc : set_pfnblock_flags_mask+0x50/0x94
>> [ 41.427193] lr : compaction_alloc+0x220/0x804
>> [ 41.431544] sp : ffffffc01104bb10
>> [ 41.434852] x29: ffffffc01104bb10 x28: ffffffc010e5b500
>> [ 41.440165] x27: 0000000000098000 x26: ffffffc010e5b500
>> [ 41.445477] x25: 0000000000000066 x24: 0000000000090800
>> [ 41.450789] x23: 0000000000000200 x22: 0000000000084000
>> [ 41.456093] x21: ffffffc010e82000 x20: ffffffc010b88000
>> [ 41.461396] x19: ffffffc01104bd70 x18: 0000000000000000
>> [ 41.466700] x17: f1f24e35df34dda4 x16: 6b3f63a0e1157268
>> [ 41.472004] x15: 4b3990ec2568ada0 x14: 757ebc126939cb5f
>> [ 41.477308] x13: 9df9488aba179ccb x12: 0000000000000000
>> [ 41.482612] x11: 0000000000000000 x10: ffffffc010c5fc30
>> [ 41.487916] x9 : ffffff801eea7c00 x8 : 000000001bf00000
>> [ 41.493219] x7 : 0000000000000000 x6 : 000000000000003f
>> [ 41.498525] x5 : 0000000000000108 x4 : 1000000000000000
>> [ 41.503835] x3 : 0000000000000021 x2 : 000000000000003c
>> [ 41.509139] x1 : 0000000000000001 x0 : 0000000000000003
>> [ 41.514443] Call trace:
>> [ 41.516887] set_pfnblock_flags_mask+0x50/0x94
>> [ 41.521330] migrate_pages+0x90/0x7f0
>> [ 41.524992] compact_zone+0x854/0x9f0
>> [ 41.528647] kcompactd_do_work+0x168/0x230
>> [ 41.532734] kcompactd+0x58/0x140
>> [ 41.536043] kthread+0x120/0x124
>> [ 41.539263] ret_from_fork+0x10/0x24
>> [ 41.542835] Code: d346fc43 4b0000c2 8b030ce5 9ac22084 (f86378e0)
>> [ 41.548925] ---[ end trace 731400a587304db3 ]---
>>
>>
>> I've pin-pointed it down to pageblock_flags pointer being initialized
>> to NULL under certain conditions. I don't know why this happens.
>> Maybe it is some obscure race condition which only shows up on my
>> system.
>
> Is there memory hotplug in your test? It seems to be caused by the race
> between memory hotplug and PFN walkers (such as compaction), which is
> already a known issue.
>
>> Nonetheless, I've made a fix for this in the attached patch. It adds
>> sanity checking and protects the system for crashing.
>> I think it is better to be safe than sorry.
>>
>> --
>> Best Regards,
>> Markus
>>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: get_pageblock_bitmap() sometimes returns a NULL pointer which needs to be properly handled
2024-05-06 9:39 ` Baolin Wang
2024-05-06 9:51 ` Markus Gothe
@ 2024-05-06 9:54 ` David Hildenbrand
2024-05-06 10:01 ` Markus Gothe
2024-05-06 10:42 ` Oscar Salvador
1 sibling, 2 replies; 7+ messages in thread
From: David Hildenbrand @ 2024-05-06 9:54 UTC (permalink / raw)
To: Baolin Wang, Markus Gothe, linux-mm; +Cc: akpm, Zi Yan
On 06.05.24 11:39, Baolin Wang wrote:
> Ccing David.
>
> On 2024/5/3 00:02, Markus Gothe wrote:
>> Hi,
>> under some rare occasion I run into the following crash:
>>
>> [ 41.417606] pstate: 80400005 (Nzcv daif +PAN -UAO)
>> [ 41.422406] pc : set_pfnblock_flags_mask+0x50/0x94
>> [ 41.427193] lr : compaction_alloc+0x220/0x804
>> [ 41.431544] sp : ffffffc01104bb10
>> [ 41.434852] x29: ffffffc01104bb10 x28: ffffffc010e5b500
>> [ 41.440165] x27: 0000000000098000 x26: ffffffc010e5b500
>> [ 41.445477] x25: 0000000000000066 x24: 0000000000090800
>> [ 41.450789] x23: 0000000000000200 x22: 0000000000084000
>> [ 41.456093] x21: ffffffc010e82000 x20: ffffffc010b88000
>> [ 41.461396] x19: ffffffc01104bd70 x18: 0000000000000000
>> [ 41.466700] x17: f1f24e35df34dda4 x16: 6b3f63a0e1157268
>> [ 41.472004] x15: 4b3990ec2568ada0 x14: 757ebc126939cb5f
>> [ 41.477308] x13: 9df9488aba179ccb x12: 0000000000000000
>> [ 41.482612] x11: 0000000000000000 x10: ffffffc010c5fc30
>> [ 41.487916] x9 : ffffff801eea7c00 x8 : 000000001bf00000
>> [ 41.493219] x7 : 0000000000000000 x6 : 000000000000003f
>> [ 41.498525] x5 : 0000000000000108 x4 : 1000000000000000
>> [ 41.503835] x3 : 0000000000000021 x2 : 000000000000003c
>> [ 41.509139] x1 : 0000000000000001 x0 : 0000000000000003
>> [ 41.514443] Call trace:
>> [ 41.516887] set_pfnblock_flags_mask+0x50/0x94
>> [ 41.521330] migrate_pages+0x90/0x7f0
>> [ 41.524992] compact_zone+0x854/0x9f0
>> [ 41.528647] kcompactd_do_work+0x168/0x230
>> [ 41.532734] kcompactd+0x58/0x140
>> [ 41.536043] kthread+0x120/0x124
>> [ 41.539263] ret_from_fork+0x10/0x24
>> [ 41.542835] Code: d346fc43 4b0000c2 8b030ce5 9ac22084 (f86378e0)
>> [ 41.548925] ---[ end trace 731400a587304db3 ]---
>>
>>
>> I've pin-pointed it down to pageblock_flags pointer being initialized to NULL under certain conditions. I don't know why this happens.
>> Maybe it is some obscure race condition which only shows up on my system.
>
> Is there memory hotplug in your test? It seems to be caused by the race
> between memory hotplug and PFN walkers (such as compaction), which is
> already a known issue.
I think I've never seen races with access to pageblocks but only with
access to the memmap.
Further, I'd not expect races during migrate_pages()? We're holding a
reference do all folios when calling migrate_pages(). So memory
offlining+removal would not be able to succeed until we drop these
references.
But, could it be that we failing during compaction_alloc() [lr :
compaction_alloc+0x220/0x804] and have an issue during
set_pfnblock_flags_mask() on a page that sits on the isolated freelist?
Similarly, memory hotunplug should not be able to mess up here.
[again, racing with memory hotunplug is unlikely]
On which kernel did we start seeing this issue?
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: get_pageblock_bitmap() sometimes returns a NULL pointer which needs to be properly handled
2024-05-06 9:54 ` David Hildenbrand
@ 2024-05-06 10:01 ` Markus Gothe
2024-05-06 10:45 ` David Hildenbrand
2024-05-06 10:42 ` Oscar Salvador
1 sibling, 1 reply; 7+ messages in thread
From: Markus Gothe @ 2024-05-06 10:01 UTC (permalink / raw)
To: David Hildenbrand, Baolin Wang, linux-mm; +Cc: akpm, Zi Yan
Hi David,
it is from a vendor SDK for an emedded system using Linux 5.4.55.
BR,
Markus
On 06/05/2024 11.54, David Hildenbrand wrote:
> On 06.05.24 11:39, Baolin Wang wrote:
>> Ccing David.
>>
>> On 2024/5/3 00:02, Markus Gothe wrote:
>>> Hi,
>>> under some rare occasion I run into the following crash:
>>>
>>> [ 41.417606] pstate: 80400005 (Nzcv daif +PAN -UAO)
>>> [ 41.422406] pc : set_pfnblock_flags_mask+0x50/0x94
>>> [ 41.427193] lr : compaction_alloc+0x220/0x804
>>> [ 41.431544] sp : ffffffc01104bb10
>>> [ 41.434852] x29: ffffffc01104bb10 x28: ffffffc010e5b500
>>> [ 41.440165] x27: 0000000000098000 x26: ffffffc010e5b500
>>> [ 41.445477] x25: 0000000000000066 x24: 0000000000090800
>>> [ 41.450789] x23: 0000000000000200 x22: 0000000000084000
>>> [ 41.456093] x21: ffffffc010e82000 x20: ffffffc010b88000
>>> [ 41.461396] x19: ffffffc01104bd70 x18: 0000000000000000
>>> [ 41.466700] x17: f1f24e35df34dda4 x16: 6b3f63a0e1157268
>>> [ 41.472004] x15: 4b3990ec2568ada0 x14: 757ebc126939cb5f
>>> [ 41.477308] x13: 9df9488aba179ccb x12: 0000000000000000
>>> [ 41.482612] x11: 0000000000000000 x10: ffffffc010c5fc30
>>> [ 41.487916] x9 : ffffff801eea7c00 x8 : 000000001bf00000
>>> [ 41.493219] x7 : 0000000000000000 x6 : 000000000000003f
>>> [ 41.498525] x5 : 0000000000000108 x4 : 1000000000000000
>>> [ 41.503835] x3 : 0000000000000021 x2 : 000000000000003c
>>> [ 41.509139] x1 : 0000000000000001 x0 : 0000000000000003
>>> [ 41.514443] Call trace:
>>> [ 41.516887] set_pfnblock_flags_mask+0x50/0x94
>>> [ 41.521330] migrate_pages+0x90/0x7f0
>>> [ 41.524992] compact_zone+0x854/0x9f0
>>> [ 41.528647] kcompactd_do_work+0x168/0x230
>>> [ 41.532734] kcompactd+0x58/0x140
>>> [ 41.536043] kthread+0x120/0x124
>>> [ 41.539263] ret_from_fork+0x10/0x24
>>> [ 41.542835] Code: d346fc43 4b0000c2 8b030ce5 9ac22084 (f86378e0)
>>> [ 41.548925] ---[ end trace 731400a587304db3 ]---
>>>
>>>
>>> I've pin-pointed it down to pageblock_flags pointer being
>>> initialized to NULL under certain conditions. I don't know why this
>>> happens.
>>> Maybe it is some obscure race condition which only shows up on my
>>> system.
>>
>> Is there memory hotplug in your test? It seems to be caused by the race
>> between memory hotplug and PFN walkers (such as compaction), which is
>> already a known issue.
>
> I think I've never seen races with access to pageblocks but only with
> access to the memmap.
>
> Further, I'd not expect races during migrate_pages()? We're holding a
> reference do all folios when calling migrate_pages(). So memory
> offlining+removal would not be able to succeed until we drop these
> references.
>
> But, could it be that we failing during compaction_alloc() [lr :
> compaction_alloc+0x220/0x804] and have an issue during
> set_pfnblock_flags_mask() on a page that sits on the isolated
> freelist? Similarly, memory hotunplug should not be able to mess up here.
>
> [again, racing with memory hotunplug is unlikely]
>
> On which kernel did we start seeing this issue?
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: get_pageblock_bitmap() sometimes returns a NULL pointer which needs to be properly handled
2024-05-06 9:54 ` David Hildenbrand
2024-05-06 10:01 ` Markus Gothe
@ 2024-05-06 10:42 ` Oscar Salvador
1 sibling, 0 replies; 7+ messages in thread
From: Oscar Salvador @ 2024-05-06 10:42 UTC (permalink / raw)
To: David Hildenbrand; +Cc: Baolin Wang, Markus Gothe, linux-mm, akpm, Zi Yan
On Mon, May 06, 2024 at 11:54:49AM +0200, David Hildenbrand wrote:
> But, could it be that we failing during compaction_alloc() [lr :
> compaction_alloc+0x220/0x804] and have an issue during
> set_pfnblock_flags_mask() on a page that sits on the isolated freelist?
> Similarly, memory hotunplug should not be able to mess up here.
But pageblock_flags is something that gets initialized at boot time and
never touched again (for already present sections).
There are two paths to get it:
section_to_usemap(__pfn_to_section(pfn))
or
page_zone(page)->pageblock_flags
Assuming the we somehow fail in compaction_alloc() and get a bad page or
NULL, we should crash before reaching to the point where we ask for
pageblock_flags.
--
Oscar Salvador
SUSE Labs
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: get_pageblock_bitmap() sometimes returns a NULL pointer which needs to be properly handled
2024-05-06 10:01 ` Markus Gothe
@ 2024-05-06 10:45 ` David Hildenbrand
0 siblings, 0 replies; 7+ messages in thread
From: David Hildenbrand @ 2024-05-06 10:45 UTC (permalink / raw)
To: Markus Gothe, Baolin Wang, linux-mm; +Cc: akpm, Zi Yan
On 06.05.24 12:01, Markus Gothe wrote:
> Hi David,
Hi!
> it is from a vendor SDK for an emedded system using Linux 5.4.55.
The current stable version of 5.4 is 5.4.275 ... can you reproduce on
more recent (unmodified upstream/stable) kernel as well?
If not, I'm afraid we cannot really help and you should contact the vendor!
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2024-05-06 10:45 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-02 16:02 get_pageblock_bitmap() sometimes returns a NULL pointer which needs to be properly handled Markus Gothe
2024-05-06 9:39 ` Baolin Wang
2024-05-06 9:51 ` Markus Gothe
2024-05-06 9:54 ` David Hildenbrand
2024-05-06 10:01 ` Markus Gothe
2024-05-06 10:45 ` David Hildenbrand
2024-05-06 10:42 ` Oscar Salvador
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox