* page type is 3, passed migratetype is 1 (nr=512)
@ 2024-05-27 8:58 Christoph Hellwig
2024-05-27 13:14 ` Christoph Hellwig
0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2024-05-27 8:58 UTC (permalink / raw)
To: Johannes Weiner; +Cc: Andy Shevchenko, Baolin Wang, linux-mm
Hi all,
when running xfstests on nfs against a local server I see warnings like
the ones above, which appear to have been added in commit
e0932b6c1f94 (mm: page_alloc: consolidate free page accounting").
Here is a typical backtrace:
generic/154 109s ... [ 617.027401] run fstests generic/154 at
2024-05-27 08:53:22
[ 628.242905] ------------[ cut here ]------------
[ 628.243233] page type is 3, passed migratetype is 1 (nr=512)
[ 628.243595] WARNING: CPU: 1 PID: 3608 at mm/page_alloc.c:645 expand+0x1c5/0x1f0
[ 628.244017] Modules linked in: crc32_pclmul i2c_piix4
[ 628.244321] CPU: 1 PID: 3608 Comm: nfsd Not tainted 6.10.0-rc1+ #2435
[ 628.244711] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 628.245270] RIP: 0010:expand+0x1c5/0x1f0
[ 628.245514] Code: 05 16 70 bf 02 01 e8 ca fc ff ff 8b 54 24 34 44 89 e1 48 c7 c7 80 a2 28 83 48 89 c6 b8 01 00 00 00 d3 e0 83
[ 628.246610] RSP: 0018:ffffc9000151f6f8 EFLAGS: 00010082
[ 628.246920] RAX: 0000000000000000 RBX: ffffffff83fa9480 RCX: 0000000000000000
[ 628.247337] RDX: 0000000000000003 RSI: 0000000000000027 RDI: 00000000ffffffff
[ 628.247760] RBP: 0000000000174200 R08: 00000000fffeffff R09: 0000000000000001
[ 628.248175] R10: 0000000000000000 R11: ffffffff83676200 R12: 0000000000000009
[ 628.248603] R13: 0000000000000200 R14: 0000000000000001 R15: ffffea0005d08000
[ 628.249022] FS: 0000000000000000(0000) GS:ffff888237d00000(0000) knlGS:0000000000000000
[ 628.249499] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 628.249841] CR2: 00005643a0160004 CR3: 000000000365a001 CR4: 0000000000770ef0
[ 628.250265] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 628.250697] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
[ 628.251121] PKRU: 55555554
[ 628.251285] Call Trace:
[ 628.251436] <TASK>
[ 628.251572] ? __warn+0x7b/0x120
[ 628.251769] ? expand+0x1c5/0x1f0
[ 628.251969] ? report_bug+0x191/0x1c0
[ 628.252202] ? handle_bug+0x3c/0x80
[ 628.252413] ? exc_invalid_op+0x17/0x70
[ 628.252646] ? asm_exc_invalid_op+0x1a/0x20
[ 628.252899] ? expand+0x1c5/0x1f0
[ 628.253100] ? expand+0x1c5/0x1f0
[ 628.253299] ? __free_one_page+0x5e4/0x750
[ 628.253549] get_page_from_freelist+0x305/0xf00
[ 628.253822] __alloc_pages_slowpath.constprop.0+0x42b/0xca0
[ 628.254148] __alloc_pages_noprof+0x2b9/0x2e0
[ 628.254408] __folio_alloc_noprof+0x10/0xa0
[ 628.254661] __filemap_get_folio+0x16b/0x370
[ 628.254915] iomap_write_begin+0x496/0x680
[ 628.255159] iomap_file_buffered_write+0x17f/0x440
[ 628.255441] ? __break_lease+0x469/0x710
[ 628.255682] xfs_file_buffered_write+0x7e/0x2a0
[ 628.255952] do_iter_readv_writev+0x107/0x200
[ 628.256213] vfs_iter_write+0x9e/0x240
[ 628.256436] nfsd_vfs_write+0x188/0x5a0
[ 628.256673] nfsd4_write+0x133/0x1c0
[ 628.256888] nfsd4_proc_compound+0x2e3/0x5f0
[ 628.257144] nfsd_dispatch+0xc8/0x210
[ 628.257363] svc_process_common+0x4cf/0x700
[ 628.257619] ? __pfx_nfsd_dispatch+0x10/0x10
[ 628.257873] svc_process+0x13e/0x190
[ 628.258086] svc_recv+0x838/0xa00
[ 628.258286] ? __pfx_nfsd+0x10/0x10
[ 628.258500] nfsd+0x82/0xf0
[ 628.258668] kthread+0xd9/0x110
[ 628.258900] ? __pfx_kthread+0x10/0x10
[ 628.259192] ret_from_fork+0x2c/0x50
[ 628.259477] ? __pfx_kthread+0x10/0x10
[ 628.259763] ret_from_fork_asm+0x1a/0x30
[ 628.260070] </TASK>
[ 628.260244] ---[ end trace 0000000000000000 ]---
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: page type is 3, passed migratetype is 1 (nr=512)
2024-05-27 8:58 page type is 3, passed migratetype is 1 (nr=512) Christoph Hellwig
@ 2024-05-27 13:14 ` Christoph Hellwig
2024-05-28 16:47 ` Johannes Weiner
0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2024-05-27 13:14 UTC (permalink / raw)
To: Johannes Weiner; +Cc: Andy Shevchenko, Baolin Wang, linux-mm
On Mon, May 27, 2024 at 01:58:25AM -0700, Christoph Hellwig wrote:
> Hi all,
>
> when running xfstests on nfs against a local server I see warnings like
> the ones above, which appear to have been added in commit
> e0932b6c1f94 (mm: page_alloc: consolidate free page accounting").
I've also reproduced this with xfstests on local xfs and no nfs in the
loop:
generic/176 214s ... [ 1204.507931] run fstests generic/176 at 2024-05-27 12:52:30
[ 1204.969286] XFS (nvme0n1): Mounting V5 Filesystem cd936307-415f-48a3-b99d-a2d52ae1f273
[ 1204.993621] XFS (nvme0n1): Ending clean mount
[ 1205.387032] XFS (nvme1n1): Mounting V5 Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
[ 1205.412322] XFS (nvme1n1): Ending clean mount
[ 1205.440388] XFS (nvme1n1): Unmounting Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
[ 1205.808063] XFS (nvme1n1): Mounting V5 Filesystem 7099b02d-9c58-4d1d-be1d-2cc472d12cd9
[ 1205.827290] XFS (nvme1n1): Ending clean mount
[ 1208.058931] ------------[ cut here ]------------
[ 1208.059613] page type is 3, passed migratetype is 1 (nr=512)
[ 1208.060402] WARNING: CPU: 0 PID: 509870 at mm/page_alloc.c:645 expand+0x1c5/0x1f0
[ 1208.061352] Modules linked in: i2c_i801 crc32_pclmul i2c_smbus [last unloaded: scsi_debug]
[ 1208.062344] CPU: 0 PID: 509870 Comm: xfs_io Not tainted 6.10.0-rc1+ #2437
[ 1208.063150] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 1208.064204] RIP: 0010:expand+0x1c5/0x1f0
[ 1208.064625] Code: 05 16 70 bf 02 01 e8 ca fc ff ff 8b 54 24 34 44 89 e1 48 c7 c7 80 a2 28 83 48 89 c6 b8 01 00 3
[ 1208.066555] RSP: 0018:ffffc90003b2b968 EFLAGS: 00010082
[ 1208.067111] RAX: 0000000000000000 RBX: ffffffff83fa9480 RCX: 0000000000000000
[ 1208.067872] RDX: 0000000000000005 RSI: 0000000000000027 RDI: 00000000ffffffff
[ 1208.068629] RBP: 00000000001f2600 R08: 00000000fffeffff R09: 0000000000000001
[ 1208.069336] R10: 0000000000000000 R11: ffffffff83676200 R12: 0000000000000009
[ 1208.070038] R13: 0000000000000200 R14: 0000000000000001 R15: ffffea0007c98000
[ 1208.070750] FS: 00007f72ca3d5780(0000) GS:ffff8881f9c00000(0000) knlGS:0000000000000000
[ 1208.071552] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1208.072121] CR2: 00007f72ca1fff38 CR3: 00000001aa0c6002 CR4: 0000000000770ef0
[ 1208.072829] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1208.073527] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
[ 1208.074225] PKRU: 55555554
[ 1208.074507] Call Trace:
[ 1208.074758] <TASK>
[ 1208.074977] ? __warn+0x7b/0x120
[ 1208.075308] ? expand+0x1c5/0x1f0
[ 1208.075652] ? report_bug+0x191/0x1c0
[ 1208.076043] ? handle_bug+0x3c/0x80
[ 1208.076400] ? exc_invalid_op+0x17/0x70
[ 1208.076782] ? asm_exc_invalid_op+0x1a/0x20
[ 1208.077203] ? expand+0x1c5/0x1f0
[ 1208.077543] ? expand+0x1c5/0x1f0
[ 1208.077878] __rmqueue_pcplist+0x3a9/0x730
[ 1208.078285] get_page_from_freelist+0x7a0/0xf00
[ 1208.078745] __alloc_pages_noprof+0x153/0x2e0
[ 1208.079181] __folio_alloc_noprof+0x10/0xa0
[ 1208.079603] __filemap_get_folio+0x16b/0x370
[ 1208.080030] iomap_write_begin+0x496/0x680
[ 1208.080441] iomap_file_buffered_write+0x17f/0x440
[ 1208.080916] xfs_file_buffered_write+0x7e/0x2a0
[ 1208.081374] vfs_write+0x262/0x440
[ 1208.081717] __x64_sys_pwrite64+0x8f/0xc0
[ 1208.082112] do_syscall_64+0x4f/0x120
[ 1208.082487] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 1208.082982] RIP: 0033:0x7f72ca4ce2b7
[ 1208.083350] Code: 08 89 3c 24 48 89 4c 24 18 e8 15 f4 f8 ff 4c 8b 54 24 18 48 8b 54 24 10 41 89 c0 48 8b 74 24 b
[ 1208.085126] RSP: 002b:00007ffe56d1a930 EFLAGS: 00000293 ORIG_RAX: 0000000000000012
[ 1208.085867] RAX: ffffffffffffffda RBX: 0000000154400000 RCX: 00007f72ca4ce2b7
[ 1208.086560] RDX: 0000000000400000 RSI: 00007f72c9401000 RDI: 0000000000000003
[ 1208.087248] RBP: 0000000154400000 R08: 0000000000000000 R09: 00007ffe56d1a9d0
[ 1208.087946] R10: 0000000154400000 R11: 0000000000000293 R12: 00000000ffffffff
[ 1208.088639] R13: 00000000abc00000 R14: 0000000000000000 R15: 0000000000000551
[ 1208.089340] </TASK>
[ 1208.089565] ---[ end trace 0000000000000000 ]---
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: page type is 3, passed migratetype is 1 (nr=512)
2024-05-27 13:14 ` Christoph Hellwig
@ 2024-05-28 16:47 ` Johannes Weiner
2024-05-29 5:43 ` Christoph Hellwig
2024-05-29 16:28 ` Johannes Weiner
0 siblings, 2 replies; 13+ messages in thread
From: Johannes Weiner @ 2024-05-28 16:47 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Andy Shevchenko, Baolin Wang, linux-mm
Hello,
On Mon, May 27, 2024 at 06:14:12AM -0700, Christoph Hellwig wrote:
> On Mon, May 27, 2024 at 01:58:25AM -0700, Christoph Hellwig wrote:
> > Hi all,
> >
> > when running xfstests on nfs against a local server I see warnings like
> > the ones above, which appear to have been added in commit
> > e0932b6c1f94 (mm: page_alloc: consolidate free page accounting").
>
> I've also reproduced this with xfstests on local xfs and no nfs in the
> loop:
>
> generic/176 214s ... [ 1204.507931] run fstests generic/176 at 2024-05-27 12:52:30
> [ 1204.969286] XFS (nvme0n1): Mounting V5 Filesystem cd936307-415f-48a3-b99d-a2d52ae1f273
> [ 1204.993621] XFS (nvme0n1): Ending clean mount
> [ 1205.387032] XFS (nvme1n1): Mounting V5 Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
> [ 1205.412322] XFS (nvme1n1): Ending clean mount
> [ 1205.440388] XFS (nvme1n1): Unmounting Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
> [ 1205.808063] XFS (nvme1n1): Mounting V5 Filesystem 7099b02d-9c58-4d1d-be1d-2cc472d12cd9
> [ 1205.827290] XFS (nvme1n1): Ending clean mount
> [ 1208.058931] ------------[ cut here ]------------
> [ 1208.059613] page type is 3, passed migratetype is 1 (nr=512)
> [ 1208.060402] WARNING: CPU: 0 PID: 509870 at mm/page_alloc.c:645 expand+0x1c5/0x1f0
> [ 1208.061352] Modules linked in: i2c_i801 crc32_pclmul i2c_smbus [last unloaded: scsi_debug]
> [ 1208.062344] CPU: 0 PID: 509870 Comm: xfs_io Not tainted 6.10.0-rc1+ #2437
> [ 1208.063150] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Thanks for the report.
Could you please send me your .config? I'll try to reproduce it
locally.
> [ 1208.064204] RIP: 0010:expand+0x1c5/0x1f0
> [ 1208.064625] Code: 05 16 70 bf 02 01 e8 ca fc ff ff 8b 54 24 34 44 89 e1 48 c7 c7 80 a2 28 83 48 89 c6 b8 01 00 3
> [ 1208.066555] RSP: 0018:ffffc90003b2b968 EFLAGS: 00010082
> [ 1208.067111] RAX: 0000000000000000 RBX: ffffffff83fa9480 RCX: 0000000000000000
> [ 1208.067872] RDX: 0000000000000005 RSI: 0000000000000027 RDI: 00000000ffffffff
> [ 1208.068629] RBP: 00000000001f2600 R08: 00000000fffeffff R09: 0000000000000001
> [ 1208.069336] R10: 0000000000000000 R11: ffffffff83676200 R12: 0000000000000009
> [ 1208.070038] R13: 0000000000000200 R14: 0000000000000001 R15: ffffea0007c98000
> [ 1208.070750] FS: 00007f72ca3d5780(0000) GS:ffff8881f9c00000(0000) knlGS:0000000000000000
> [ 1208.071552] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1208.072121] CR2: 00007f72ca1fff38 CR3: 00000001aa0c6002 CR4: 0000000000770ef0
> [ 1208.072829] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1208.073527] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
> [ 1208.074225] PKRU: 55555554
> [ 1208.074507] Call Trace:
> [ 1208.074758] <TASK>
> [ 1208.074977] ? __warn+0x7b/0x120
> [ 1208.075308] ? expand+0x1c5/0x1f0
> [ 1208.075652] ? report_bug+0x191/0x1c0
> [ 1208.076043] ? handle_bug+0x3c/0x80
> [ 1208.076400] ? exc_invalid_op+0x17/0x70
> [ 1208.076782] ? asm_exc_invalid_op+0x1a/0x20
> [ 1208.077203] ? expand+0x1c5/0x1f0
> [ 1208.077543] ? expand+0x1c5/0x1f0
> [ 1208.077878] __rmqueue_pcplist+0x3a9/0x730
Ok so the allocator is taking a larger buddy off the freelist to
satisfy a smaller request, then puts the remainder back on the list.
There is no warning from the del_page_from_free_list(), so the buddy
type and the type of the list it was taken from are coherent.
The warning happens when it expands the remainder of the buddy and
finds the tail block to be of a different type.
Specifically, it takes a movable buddy (type 1) off the movable list,
but finds a tail block of it marked highatomic (type 3).
I don't see how we could have merged those during freeing, because the
highatomic buddy would have failed migratetype_is_mergeable().
Ah, but there DOES seem to be an issue with how we reserve
highatomics: reserving and unreserving happens one pageblock at a
time, but MAX_ORDER is usually bigger. If we rmqueue() an order-10
request, reserve_highatomic_block() will only convert the first
order-9 block in it; the tail will remain the original type, which
will produce a buddy of mixed type blocks upon freeing.
This doesn't fully explain the warning here. We'd expect to see it the
other way round - passing an assumed type of 3 (HIGHATOMIC) for the
remainder that is actually 1 (MOVABLE). But the pageblock-based
reservations look fishy. I'll cook up a patch to make this
range-based. It might just fix it in a way I'm not seeing just yet.
> [ 1208.078285] get_page_from_freelist+0x7a0/0xf00
> [ 1208.078745] __alloc_pages_noprof+0x153/0x2e0
> [ 1208.079181] __folio_alloc_noprof+0x10/0xa0
> [ 1208.079603] __filemap_get_folio+0x16b/0x370
> [ 1208.080030] iomap_write_begin+0x496/0x680
> [ 1208.080441] iomap_file_buffered_write+0x17f/0x440
> [ 1208.080916] xfs_file_buffered_write+0x7e/0x2a0
> [ 1208.081374] vfs_write+0x262/0x440
> [ 1208.081717] __x64_sys_pwrite64+0x8f/0xc0
> [ 1208.082112] do_syscall_64+0x4f/0x120
> [ 1208.082487] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 1208.082982] RIP: 0033:0x7f72ca4ce2b7
> [ 1208.083350] Code: 08 89 3c 24 48 89 4c 24 18 e8 15 f4 f8 ff 4c 8b 54 24 18 48 8b 54 24 10 41 89 c0 48 8b 74 24 b
> [ 1208.085126] RSP: 002b:00007ffe56d1a930 EFLAGS: 00000293 ORIG_RAX: 0000000000000012
> [ 1208.085867] RAX: ffffffffffffffda RBX: 0000000154400000 RCX: 00007f72ca4ce2b7
> [ 1208.086560] RDX: 0000000000400000 RSI: 00007f72c9401000 RDI: 0000000000000003
> [ 1208.087248] RBP: 0000000154400000 R08: 0000000000000000 R09: 00007ffe56d1a9d0
> [ 1208.087946] R10: 0000000154400000 R11: 0000000000000293 R12: 00000000ffffffff
> [ 1208.088639] R13: 00000000abc00000 R14: 0000000000000000 R15: 0000000000000551
> [ 1208.089340] </TASK>
> [ 1208.089565] ---[ end trace 0000000000000000 ]---
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: page type is 3, passed migratetype is 1 (nr=512)
2024-05-28 16:47 ` Johannes Weiner
@ 2024-05-29 5:43 ` Christoph Hellwig
2024-05-29 16:28 ` Johannes Weiner
1 sibling, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2024-05-29 5:43 UTC (permalink / raw)
To: Johannes Weiner; +Cc: Christoph Hellwig, Andy Shevchenko, Baolin Wang, linux-mm
[-- Attachment #1: Type: text/plain, Size: 154 bytes --]
On Tue, May 28, 2024 at 12:47:56PM -0400, Johannes Weiner wrote:
> Could you please send me your .config? I'll try to reproduce it
> locally.
Attached.
[-- Attachment #2: config.hch.gz --]
[-- Type: application/gzip, Size: 39546 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: page type is 3, passed migratetype is 1 (nr=512)
2024-05-28 16:47 ` Johannes Weiner
2024-05-29 5:43 ` Christoph Hellwig
@ 2024-05-29 16:28 ` Johannes Weiner
2024-05-30 1:04 ` Johannes Weiner
1 sibling, 1 reply; 13+ messages in thread
From: Johannes Weiner @ 2024-05-29 16:28 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Andy Shevchenko, Baolin Wang, linux-mm
Thanks for the config, I was able to reproduce it with.
On Tue, May 28, 2024 at 12:48:05PM -0400, Johannes Weiner wrote:
> Ah, but there DOES seem to be an issue with how we reserve
> highatomics: reserving and unreserving happens one pageblock at a
> time, but MAX_ORDER is usually bigger. If we rmqueue() an order-10
> request, reserve_highatomic_block() will only convert the first
> order-9 block in it; the tail will remain the original type, which
> will produce a buddy of mixed type blocks upon freeing.
>
> This doesn't fully explain the warning here. We'd expect to see it the
> other way round - passing an assumed type of 3 (HIGHATOMIC) for the
> remainder that is actually 1 (MOVABLE). But the pageblock-based
> reservations look fishy. I'll cook up a patch to make this
> range-based. It might just fix it in a way I'm not seeing just yet.
tl;dr: With some debugging printks, I was able to see the
issue. Should be a straight-forward fix.
No order-10 allocations are necessary. Instead, smaller allocations
grab blocks for the highatomic pool. Unreserving is lazy, so as those
allocations get freed, they have a chance to merge together. Two
adjacent highatomic blocks can merge (MAX_ORDER > pageblock_order). On
unreserve, we now have an order-10 highatomic buddy, but only clear
the type on the first order-9 pageblock. A subsequent alloc + expand
will warn about this type inconsistency.
[ 411.188518] UNRESERVE: pfn=26000 order=10
[ 411.188739] ------------[ cut here ]------------
[ 411.188881] 26200: page type is 3, passed migratetype is 1 (nr=512)
[ 411.189097] WARNING: CPU: 1 PID: 10152 at mm/page_alloc.c:645 expand+0x1c8/0x1f0
I have a draft patch to make the highatomic reservations update all
blocks inside the range, not just the first one. I'll send it out as
soon as I have tested it properly.
Thanks
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: page type is 3, passed migratetype is 1 (nr=512)
2024-05-29 16:28 ` Johannes Weiner
@ 2024-05-30 1:04 ` Johannes Weiner
2024-05-30 1:51 ` Zi Yan
` (3 more replies)
0 siblings, 4 replies; 13+ messages in thread
From: Johannes Weiner @ 2024-05-30 1:04 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Andy Shevchenko, Baolin Wang, linux-mm, Andrew Morton,
Vlastimil Babka, Zi Yan, Mel Gorman
On Wed, May 29, 2024 at 12:28:35PM -0400, Johannes Weiner wrote:
> Thanks for the config, I was able to reproduce it with.
>
> On Tue, May 28, 2024 at 12:48:05PM -0400, Johannes Weiner wrote:
> > Ah, but there DOES seem to be an issue with how we reserve
> > highatomics: reserving and unreserving happens one pageblock at a
> > time, but MAX_ORDER is usually bigger. If we rmqueue() an order-10
> > request, reserve_highatomic_block() will only convert the first
> > order-9 block in it; the tail will remain the original type, which
> > will produce a buddy of mixed type blocks upon freeing.
> >
> > This doesn't fully explain the warning here. We'd expect to see it the
> > other way round - passing an assumed type of 3 (HIGHATOMIC) for the
> > remainder that is actually 1 (MOVABLE). But the pageblock-based
> > reservations look fishy. I'll cook up a patch to make this
> > range-based. It might just fix it in a way I'm not seeing just yet.
>
> tl;dr: With some debugging printks, I was able to see the
> issue. Should be a straight-forward fix.
>
> No order-10 allocations are necessary. Instead, smaller allocations
> grab blocks for the highatomic pool. Unreserving is lazy, so as those
> allocations get freed, they have a chance to merge together. Two
> adjacent highatomic blocks can merge (MAX_ORDER > pageblock_order). On
> unreserve, we now have an order-10 highatomic buddy, but only clear
> the type on the first order-9 pageblock. A subsequent alloc + expand
> will warn about this type inconsistency.
>
> [ 411.188518] UNRESERVE: pfn=26000 order=10
> [ 411.188739] ------------[ cut here ]------------
> [ 411.188881] 26200: page type is 3, passed migratetype is 1 (nr=512)
> [ 411.189097] WARNING: CPU: 1 PID: 10152 at mm/page_alloc.c:645 expand+0x1c8/0x1f0
>
> I have a draft patch to make the highatomic reservations update all
> blocks inside the range, not just the first one. I'll send it out as
> soon as I have tested it properly.
The below fixes the issue for me and survives a few hours of
xfstest/generic/176 for me.
Christoph, can you please confirm it addresses what you saw?
Vlastimil, Zi, Mel, can you please take a look?
Thanks!
---
From a0ba36bb9cd7404c07c7a56139fd52efc6981ced Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Wed, 29 May 2024 18:18:12 -0400
Subject: [PATCH] mm: page_alloc: fix highatomic typing in multi-block buddies
Christoph reports a page allocator splat triggered by xfstests:
generic/176 214s ... [ 1204.507931] run fstests generic/176 at 2024-05-27 12:52:30
[] XFS (nvme0n1): Mounting V5 Filesystem cd936307-415f-48a3-b99d-a2d52ae1f273
[] XFS (nvme0n1): Ending clean mount
[] XFS (nvme1n1): Mounting V5 Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
[] XFS (nvme1n1): Ending clean mount
[] XFS (nvme1n1): Unmounting Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
[] XFS (nvme1n1): Mounting V5 Filesystem 7099b02d-9c58-4d1d-be1d-2cc472d12cd9
[] XFS (nvme1n1): Ending clean mount
[] ------------[ cut here ]------------
[] page type is 3, passed migratetype is 1 (nr=512)
[] WARNING: CPU: 0 PID: 509870 at mm/page_alloc.c:645 expand+0x1c5/0x1f0
[] Modules linked in: i2c_i801 crc32_pclmul i2c_smbus [last unloaded: scsi_debug]
[] CPU: 0 PID: 509870 Comm: xfs_io Not tainted 6.10.0-rc1+ #2437
[] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[] RIP: 0010:expand+0x1c5/0x1f0
[] Code: 05 16 70 bf 02 01 e8 ca fc ff ff 8b 54 24 34 44 89 e1 48 c7 c7 80 a2 28 83 48 89 c6 b8 01 00 3
[] RSP: 0018:ffffc90003b2b968 EFLAGS: 00010082
[] RAX: 0000000000000000 RBX: ffffffff83fa9480 RCX: 0000000000000000
[] RDX: 0000000000000005 RSI: 0000000000000027 RDI: 00000000ffffffff
[] RBP: 00000000001f2600 R08: 00000000fffeffff R09: 0000000000000001
[] R10: 0000000000000000 R11: ffffffff83676200 R12: 0000000000000009
[] R13: 0000000000000200 R14: 0000000000000001 R15: ffffea0007c98000
[] FS: 00007f72ca3d5780(0000) GS:ffff8881f9c00000(0000) knlGS:0000000000000000
[] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[] CR2: 00007f72ca1fff38 CR3: 00000001aa0c6002 CR4: 0000000000770ef0
[] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
[] PKRU: 55555554
[] Call Trace:
[] <TASK>
[] ? __warn+0x7b/0x120
[] ? expand+0x1c5/0x1f0
[] ? report_bug+0x191/0x1c0
[] ? handle_bug+0x3c/0x80
[] ? exc_invalid_op+0x17/0x70
[] ? asm_exc_invalid_op+0x1a/0x20
[] ? expand+0x1c5/0x1f0
[] ? expand+0x1c5/0x1f0
[] __rmqueue_pcplist+0x3a9/0x730
[] get_page_from_freelist+0x7a0/0xf00
[] __alloc_pages_noprof+0x153/0x2e0
[] __folio_alloc_noprof+0x10/0xa0
[] __filemap_get_folio+0x16b/0x370
[] iomap_write_begin+0x496/0x680
While trying to service a movable allocation (page type 1), the page
allocator runs into a two-pageblock buddy on the movable freelist
whose second block is typed as highatomic (page type 3).
This inconsistency is caused by the highatomic reservation system
operating on single pageblocks, while MAX_ORDER can be bigger than
that - in this configuration, pageblock_order is 9 while
MAX_PAGE_ORDER is 10. The test case is observed to make several
adjacent order-3 requests with __GFP_DIRECT_RECLAIM cleared, which
marks the surrounding block as highatomic. Upon freeing, the blocks
merge into an order-10 buddy. When the highatomic pool is drained
later on, this order-10 buddy gets moved back to the movable list, but
only the first pageblock is marked movable again. A subsequent
expand() of this buddy warns about the tail being of a different type.
A similar warning could trigger on an order-10 request that only marks
the first pageblock as highatomic and leaves the second one movable.
This is a long-standing bug that's surfaced by the recent block type
warnings added to the allocator. The consequences seem mostly benign,
it just results in odd behavior: the highatomic tail blocks are not
properly drained, instead they end up on the movable list first, then
go back to the highatomic list after an alloc-free cycle.
To fix this, make the highatomic reservation code aware that
allocations/buddies can be larger than a pageblock.
While it's an old quirk, the recently added type consistency warnings
seem to be the most prominent consequence of it. Set the Fixes: tag
accordingly to highlight this backporting dependency.
Fixes: e0932b6c1f94 ("mm: page_alloc: consolidate free page accounting")
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
mm/page_alloc.c | 48 +++++++++++++++++++++++++++++++++---------------
1 file changed, 33 insertions(+), 15 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2e22ce5675ca..754ca5821266 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1955,10 +1955,12 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
}
/*
- * Reserve a pageblock for exclusive use of high-order atomic allocations if
- * there are no empty page blocks that contain a page with a suitable order
+ * Reserve the pageblock(s) surrounding an allocation request for
+ * exclusive use of high-order atomic allocations if there are no
+ * empty page blocks that contain a page with a suitable order
*/
-static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
+static void reserve_highatomic_pageblock(struct page *page, int order,
+ struct zone *zone)
{
int mt;
unsigned long max_managed, flags;
@@ -1984,10 +1986,17 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
/* Yoink! */
mt = get_pageblock_migratetype(page);
/* Only reserve normal pageblocks (i.e., they can merge with others) */
- if (migratetype_is_mergeable(mt))
- if (move_freepages_block(zone, page, mt,
- MIGRATE_HIGHATOMIC) != -1)
- zone->nr_reserved_highatomic += pageblock_nr_pages;
+ if (!migratetype_is_mergeable(mt))
+ goto out_unlock;
+
+ if (order < pageblock_order) {
+ if (move_freepages_block(zone, page, mt, MIGRATE_HIGHATOMIC) == -1)
+ goto out_unlock;
+ zone->nr_reserved_highatomic += pageblock_nr_pages;
+ } else {
+ change_pageblock_range(page, order, MIGRATE_HIGHATOMIC);
+ zone->nr_reserved_highatomic += 1 << order;
+ }
out_unlock:
spin_unlock_irqrestore(&zone->lock, flags);
@@ -1999,7 +2008,7 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
* intense memory pressure but failed atomic allocations should be easier
* to recover from than an OOM.
*
- * If @force is true, try to unreserve a pageblock even though highatomic
+ * If @force is true, try to unreserve pageblocks even though highatomic
* pageblock is exhausted.
*/
static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
@@ -2041,6 +2050,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
* adjust the count once.
*/
if (is_migrate_highatomic(mt)) {
+ unsigned long size;
/*
* It should never happen but changes to
* locking could inadvertently allow a per-cpu
@@ -2048,9 +2058,9 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
* while unreserving so be safe and watch for
* underflows.
*/
- zone->nr_reserved_highatomic -= min(
- pageblock_nr_pages,
- zone->nr_reserved_highatomic);
+ size = max(pageblock_nr_pages, 1UL << order);
+ size = min(size, zone->nr_reserved_highatomic);
+ zone->nr_reserved_highatomic -= size;
}
/*
@@ -2062,11 +2072,19 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
* of pageblocks that cannot be completely freed
* may increase.
*/
- ret = move_freepages_block(zone, page, mt,
- ac->migratetype);
+ if (order < pageblock_order)
+ ret = move_freepages_block(zone, page, mt,
+ ac->migratetype);
+ else {
+ move_to_free_list(page, zone, order, mt,
+ ac->migratetype);
+ change_pageblock_range(page, order,
+ ac->migratetype);
+ ret = 1;
+ }
/*
- * Reserving this block already succeeded, so this should
- * not fail on zone boundaries.
+ * Reserving the block(s) already succeeded,
+ * so this should not fail on zone boundaries.
*/
WARN_ON_ONCE(ret == -1);
if (ret > 0) {
--
2.45.1
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: page type is 3, passed migratetype is 1 (nr=512)
2024-05-30 1:04 ` Johannes Weiner
@ 2024-05-30 1:51 ` Zi Yan
2024-05-30 3:22 ` Johannes Weiner
2024-05-30 4:06 ` [PATCH] mm: page_alloc: fix highatomic typing in multi-block buddies kernel test robot
` (2 subsequent siblings)
3 siblings, 1 reply; 13+ messages in thread
From: Zi Yan @ 2024-05-30 1:51 UTC (permalink / raw)
To: Johannes Weiner
Cc: Christoph Hellwig, Andy Shevchenko, Baolin Wang, linux-mm,
Andrew Morton, Vlastimil Babka, Mel Gorman
[-- Attachment #1: Type: text/plain, Size: 11349 bytes --]
On 29 May 2024, at 21:04, Johannes Weiner wrote:
> On Wed, May 29, 2024 at 12:28:35PM -0400, Johannes Weiner wrote:
>> Thanks for the config, I was able to reproduce it with.
>>
>> On Tue, May 28, 2024 at 12:48:05PM -0400, Johannes Weiner wrote:
>>> Ah, but there DOES seem to be an issue with how we reserve
>>> highatomics: reserving and unreserving happens one pageblock at a
>>> time, but MAX_ORDER is usually bigger. If we rmqueue() an order-10
>>> request, reserve_highatomic_block() will only convert the first
>>> order-9 block in it; the tail will remain the original type, which
>>> will produce a buddy of mixed type blocks upon freeing.
>>>
>>> This doesn't fully explain the warning here. We'd expect to see it the
>>> other way round - passing an assumed type of 3 (HIGHATOMIC) for the
>>> remainder that is actually 1 (MOVABLE). But the pageblock-based
>>> reservations look fishy. I'll cook up a patch to make this
>>> range-based. It might just fix it in a way I'm not seeing just yet.
>>
>> tl;dr: With some debugging printks, I was able to see the
>> issue. Should be a straight-forward fix.
>>
>> No order-10 allocations are necessary. Instead, smaller allocations
>> grab blocks for the highatomic pool. Unreserving is lazy, so as those
>> allocations get freed, they have a chance to merge together. Two
>> adjacent highatomic blocks can merge (MAX_ORDER > pageblock_order). On
>> unreserve, we now have an order-10 highatomic buddy, but only clear
>> the type on the first order-9 pageblock. A subsequent alloc + expand
>> will warn about this type inconsistency.
>>
>> [ 411.188518] UNRESERVE: pfn=26000 order=10
>> [ 411.188739] ------------[ cut here ]------------
>> [ 411.188881] 26200: page type is 3, passed migratetype is 1 (nr=512)
>> [ 411.189097] WARNING: CPU: 1 PID: 10152 at mm/page_alloc.c:645 expand+0x1c8/0x1f0
>>
>> I have a draft patch to make the highatomic reservations update all
>> blocks inside the range, not just the first one. I'll send it out as
>> soon as I have tested it properly.
>
> The below fixes the issue for me and survives a few hours of
> xfstest/generic/176 for me.
>
> Christoph, can you please confirm it addresses what you saw?
>
> Vlastimil, Zi, Mel, can you please take a look?
>
> Thanks!
>
> ---
>
> From a0ba36bb9cd7404c07c7a56139fd52efc6981ced Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Wed, 29 May 2024 18:18:12 -0400
> Subject: [PATCH] mm: page_alloc: fix highatomic typing in multi-block buddies
>
> Christoph reports a page allocator splat triggered by xfstests:
>
> generic/176 214s ... [ 1204.507931] run fstests generic/176 at 2024-05-27 12:52:30
> [] XFS (nvme0n1): Mounting V5 Filesystem cd936307-415f-48a3-b99d-a2d52ae1f273
> [] XFS (nvme0n1): Ending clean mount
> [] XFS (nvme1n1): Mounting V5 Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
> [] XFS (nvme1n1): Ending clean mount
> [] XFS (nvme1n1): Unmounting Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
> [] XFS (nvme1n1): Mounting V5 Filesystem 7099b02d-9c58-4d1d-be1d-2cc472d12cd9
> [] XFS (nvme1n1): Ending clean mount
> [] ------------[ cut here ]------------
> [] page type is 3, passed migratetype is 1 (nr=512)
> [] WARNING: CPU: 0 PID: 509870 at mm/page_alloc.c:645 expand+0x1c5/0x1f0
> [] Modules linked in: i2c_i801 crc32_pclmul i2c_smbus [last unloaded: scsi_debug]
> [] CPU: 0 PID: 509870 Comm: xfs_io Not tainted 6.10.0-rc1+ #2437
> [] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [] RIP: 0010:expand+0x1c5/0x1f0
> [] Code: 05 16 70 bf 02 01 e8 ca fc ff ff 8b 54 24 34 44 89 e1 48 c7 c7 80 a2 28 83 48 89 c6 b8 01 00 3
> [] RSP: 0018:ffffc90003b2b968 EFLAGS: 00010082
> [] RAX: 0000000000000000 RBX: ffffffff83fa9480 RCX: 0000000000000000
> [] RDX: 0000000000000005 RSI: 0000000000000027 RDI: 00000000ffffffff
> [] RBP: 00000000001f2600 R08: 00000000fffeffff R09: 0000000000000001
> [] R10: 0000000000000000 R11: ffffffff83676200 R12: 0000000000000009
> [] R13: 0000000000000200 R14: 0000000000000001 R15: ffffea0007c98000
> [] FS: 00007f72ca3d5780(0000) GS:ffff8881f9c00000(0000) knlGS:0000000000000000
> [] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [] CR2: 00007f72ca1fff38 CR3: 00000001aa0c6002 CR4: 0000000000770ef0
> [] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
> [] PKRU: 55555554
> [] Call Trace:
> [] <TASK>
> [] ? __warn+0x7b/0x120
> [] ? expand+0x1c5/0x1f0
> [] ? report_bug+0x191/0x1c0
> [] ? handle_bug+0x3c/0x80
> [] ? exc_invalid_op+0x17/0x70
> [] ? asm_exc_invalid_op+0x1a/0x20
> [] ? expand+0x1c5/0x1f0
> [] ? expand+0x1c5/0x1f0
> [] __rmqueue_pcplist+0x3a9/0x730
> [] get_page_from_freelist+0x7a0/0xf00
> [] __alloc_pages_noprof+0x153/0x2e0
> [] __folio_alloc_noprof+0x10/0xa0
> [] __filemap_get_folio+0x16b/0x370
> [] iomap_write_begin+0x496/0x680
>
> While trying to service a movable allocation (page type 1), the page
> allocator runs into a two-pageblock buddy on the movable freelist
> whose second block is typed as highatomic (page type 3).
>
> This inconsistency is caused by the highatomic reservation system
> operating on single pageblocks, while MAX_ORDER can be bigger than
> that - in this configuration, pageblock_order is 9 while
> MAX_PAGE_ORDER is 10. The test case is observed to make several
> adjacent order-3 requests with __GFP_DIRECT_RECLAIM cleared, which
> marks the surrounding block as highatomic. Upon freeing, the blocks
> merge into an order-10 buddy. When the highatomic pool is drained
> later on, this order-10 buddy gets moved back to the movable list, but
> only the first pageblock is marked movable again. A subsequent
> expand() of this buddy warns about the tail being of a different type.
>
> A similar warning could trigger on an order-10 request that only marks
> the first pageblock as highatomic and leaves the second one movable.
>
> This is a long-standing bug that's surfaced by the recent block type
> warnings added to the allocator. The consequences seem mostly benign,
> it just results in odd behavior: the highatomic tail blocks are not
> properly drained, instead they end up on the movable list first, then
> go back to the highatomic list after an alloc-free cycle.
>
> To fix this, make the highatomic reservation code aware that
> allocations/buddies can be larger than a pageblock.
>
> While it's an old quirk, the recently added type consistency warnings
> seem to be the most prominent consequence of it. Set the Fixes: tag
> accordingly to highlight this backporting dependency.
>
> Fixes: e0932b6c1f94 ("mm: page_alloc: consolidate free page accounting")
> Reported-by: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
> mm/page_alloc.c | 48 +++++++++++++++++++++++++++++++++---------------
> 1 file changed, 33 insertions(+), 15 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2e22ce5675ca..754ca5821266 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1955,10 +1955,12 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
> }
>
> /*
> - * Reserve a pageblock for exclusive use of high-order atomic allocations if
> - * there are no empty page blocks that contain a page with a suitable order
> + * Reserve the pageblock(s) surrounding an allocation request for
> + * exclusive use of high-order atomic allocations if there are no
> + * empty page blocks that contain a page with a suitable order
> */
> -static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
> +static void reserve_highatomic_pageblock(struct page *page, int order,
> + struct zone *zone)
> {
> int mt;
> unsigned long max_managed, flags;
> @@ -1984,10 +1986,17 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
> /* Yoink! */
> mt = get_pageblock_migratetype(page);
> /* Only reserve normal pageblocks (i.e., they can merge with others) */
> - if (migratetype_is_mergeable(mt))
> - if (move_freepages_block(zone, page, mt,
> - MIGRATE_HIGHATOMIC) != -1)
> - zone->nr_reserved_highatomic += pageblock_nr_pages;
> + if (!migratetype_is_mergeable(mt))
> + goto out_unlock;
> +
> + if (order < pageblock_order) {
> + if (move_freepages_block(zone, page, mt, MIGRATE_HIGHATOMIC) == -1)
> + goto out_unlock;
> + zone->nr_reserved_highatomic += pageblock_nr_pages;
> + } else {
> + change_pageblock_range(page, order, MIGRATE_HIGHATOMIC);
Don't you also need to move (page, order) to MIGRATE_HIGHATOMIC free list here?
> + zone->nr_reserved_highatomic += 1 << order;
> + }
You could do:
zone->nr_reserved_highatomic += max_t(unsigned long, pageblock_nr_pages, 1 << order);
instead. But I have no strong opinion on it.
>
> out_unlock:
> spin_unlock_irqrestore(&zone->lock, flags);
> @@ -1999,7 +2008,7 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
> * intense memory pressure but failed atomic allocations should be easier
> * to recover from than an OOM.
> *
> - * If @force is true, try to unreserve a pageblock even though highatomic
> + * If @force is true, try to unreserve pageblocks even though highatomic
> * pageblock is exhausted.
> */
> static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
> @@ -2041,6 +2050,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
> * adjust the count once.
> */
> if (is_migrate_highatomic(mt)) {
> + unsigned long size;
> /*
> * It should never happen but changes to
> * locking could inadvertently allow a per-cpu
> @@ -2048,9 +2058,9 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
> * while unreserving so be safe and watch for
> * underflows.
> */
> - zone->nr_reserved_highatomic -= min(
> - pageblock_nr_pages,
> - zone->nr_reserved_highatomic);
> + size = max(pageblock_nr_pages, 1UL << order);
> + size = min(size, zone->nr_reserved_highatomic);
> + zone->nr_reserved_highatomic -= size;
> }
>
> /*
> @@ -2062,11 +2072,19 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
> * of pageblocks that cannot be completely freed
> * may increase.
> */
> - ret = move_freepages_block(zone, page, mt,
> - ac->migratetype);
> + if (order < pageblock_order)
> + ret = move_freepages_block(zone, page, mt,
> + ac->migratetype);
> + else {
> + move_to_free_list(page, zone, order, mt,
> + ac->migratetype);
> + change_pageblock_range(page, order,
> + ac->migratetype);
> + ret = 1;
> + }
> /*
> - * Reserving this block already succeeded, so this should
> - * not fail on zone boundaries.
> + * Reserving the block(s) already succeeded,
> + * so this should not fail on zone boundaries.
> */
> WARN_ON_ONCE(ret == -1);
> if (ret > 0) {
> --
> 2.45.1
This part looks good to me.
--
Best Regards,
Yan, Zi
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: page type is 3, passed migratetype is 1 (nr=512)
2024-05-30 1:51 ` Zi Yan
@ 2024-05-30 3:22 ` Johannes Weiner
0 siblings, 0 replies; 13+ messages in thread
From: Johannes Weiner @ 2024-05-30 3:22 UTC (permalink / raw)
To: Zi Yan
Cc: Christoph Hellwig, Andy Shevchenko, Baolin Wang, linux-mm,
Andrew Morton, Vlastimil Babka, Mel Gorman
On Wed, May 29, 2024 at 09:51:24PM -0400, Zi Yan wrote:
> On 29 May 2024, at 21:04, Johannes Weiner wrote:
>
> > On Wed, May 29, 2024 at 12:28:35PM -0400, Johannes Weiner wrote:
> >> Thanks for the config, I was able to reproduce it with.
> >>
> >> On Tue, May 28, 2024 at 12:48:05PM -0400, Johannes Weiner wrote:
> >>> Ah, but there DOES seem to be an issue with how we reserve
> >>> highatomics: reserving and unreserving happens one pageblock at a
> >>> time, but MAX_ORDER is usually bigger. If we rmqueue() an order-10
> >>> request, reserve_highatomic_block() will only convert the first
> >>> order-9 block in it; the tail will remain the original type, which
> >>> will produce a buddy of mixed type blocks upon freeing.
> >>>
> >>> This doesn't fully explain the warning here. We'd expect to see it the
> >>> other way round - passing an assumed type of 3 (HIGHATOMIC) for the
> >>> remainder that is actually 1 (MOVABLE). But the pageblock-based
> >>> reservations look fishy. I'll cook up a patch to make this
> >>> range-based. It might just fix it in a way I'm not seeing just yet.
> >>
> >> tl;dr: With some debugging printks, I was able to see the
> >> issue. Should be a straight-forward fix.
> >>
> >> No order-10 allocations are necessary. Instead, smaller allocations
> >> grab blocks for the highatomic pool. Unreserving is lazy, so as those
> >> allocations get freed, they have a chance to merge together. Two
> >> adjacent highatomic blocks can merge (MAX_ORDER > pageblock_order). On
> >> unreserve, we now have an order-10 highatomic buddy, but only clear
> >> the type on the first order-9 pageblock. A subsequent alloc + expand
> >> will warn about this type inconsistency.
> >>
> >> [ 411.188518] UNRESERVE: pfn=26000 order=10
> >> [ 411.188739] ------------[ cut here ]------------
> >> [ 411.188881] 26200: page type is 3, passed migratetype is 1 (nr=512)
> >> [ 411.189097] WARNING: CPU: 1 PID: 10152 at mm/page_alloc.c:645 expand+0x1c8/0x1f0
> >>
> >> I have a draft patch to make the highatomic reservations update all
> >> blocks inside the range, not just the first one. I'll send it out as
> >> soon as I have tested it properly.
> >
> > The below fixes the issue for me and survives a few hours of
> > xfstest/generic/176 for me.
> >
> > Christoph, can you please confirm it addresses what you saw?
> >
> > Vlastimil, Zi, Mel, can you please take a look?
> >
> > Thanks!
> >
> > ---
> >
> > From a0ba36bb9cd7404c07c7a56139fd52efc6981ced Mon Sep 17 00:00:00 2001
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > Date: Wed, 29 May 2024 18:18:12 -0400
> > Subject: [PATCH] mm: page_alloc: fix highatomic typing in multi-block buddies
> >
> > Christoph reports a page allocator splat triggered by xfstests:
> >
> > generic/176 214s ... [ 1204.507931] run fstests generic/176 at 2024-05-27 12:52:30
> > [] XFS (nvme0n1): Mounting V5 Filesystem cd936307-415f-48a3-b99d-a2d52ae1f273
> > [] XFS (nvme0n1): Ending clean mount
> > [] XFS (nvme1n1): Mounting V5 Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
> > [] XFS (nvme1n1): Ending clean mount
> > [] XFS (nvme1n1): Unmounting Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
> > [] XFS (nvme1n1): Mounting V5 Filesystem 7099b02d-9c58-4d1d-be1d-2cc472d12cd9
> > [] XFS (nvme1n1): Ending clean mount
> > [] ------------[ cut here ]------------
> > [] page type is 3, passed migratetype is 1 (nr=512)
> > [] WARNING: CPU: 0 PID: 509870 at mm/page_alloc.c:645 expand+0x1c5/0x1f0
> > [] Modules linked in: i2c_i801 crc32_pclmul i2c_smbus [last unloaded: scsi_debug]
> > [] CPU: 0 PID: 509870 Comm: xfs_io Not tainted 6.10.0-rc1+ #2437
> > [] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> > [] RIP: 0010:expand+0x1c5/0x1f0
> > [] Code: 05 16 70 bf 02 01 e8 ca fc ff ff 8b 54 24 34 44 89 e1 48 c7 c7 80 a2 28 83 48 89 c6 b8 01 00 3
> > [] RSP: 0018:ffffc90003b2b968 EFLAGS: 00010082
> > [] RAX: 0000000000000000 RBX: ffffffff83fa9480 RCX: 0000000000000000
> > [] RDX: 0000000000000005 RSI: 0000000000000027 RDI: 00000000ffffffff
> > [] RBP: 00000000001f2600 R08: 00000000fffeffff R09: 0000000000000001
> > [] R10: 0000000000000000 R11: ffffffff83676200 R12: 0000000000000009
> > [] R13: 0000000000000200 R14: 0000000000000001 R15: ffffea0007c98000
> > [] FS: 00007f72ca3d5780(0000) GS:ffff8881f9c00000(0000) knlGS:0000000000000000
> > [] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [] CR2: 00007f72ca1fff38 CR3: 00000001aa0c6002 CR4: 0000000000770ef0
> > [] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
> > [] PKRU: 55555554
> > [] Call Trace:
> > [] <TASK>
> > [] ? __warn+0x7b/0x120
> > [] ? expand+0x1c5/0x1f0
> > [] ? report_bug+0x191/0x1c0
> > [] ? handle_bug+0x3c/0x80
> > [] ? exc_invalid_op+0x17/0x70
> > [] ? asm_exc_invalid_op+0x1a/0x20
> > [] ? expand+0x1c5/0x1f0
> > [] ? expand+0x1c5/0x1f0
> > [] __rmqueue_pcplist+0x3a9/0x730
> > [] get_page_from_freelist+0x7a0/0xf00
> > [] __alloc_pages_noprof+0x153/0x2e0
> > [] __folio_alloc_noprof+0x10/0xa0
> > [] __filemap_get_folio+0x16b/0x370
> > [] iomap_write_begin+0x496/0x680
> >
> > While trying to service a movable allocation (page type 1), the page
> > allocator runs into a two-pageblock buddy on the movable freelist
> > whose second block is typed as highatomic (page type 3).
> >
> > This inconsistency is caused by the highatomic reservation system
> > operating on single pageblocks, while MAX_ORDER can be bigger than
> > that - in this configuration, pageblock_order is 9 while
> > MAX_PAGE_ORDER is 10. The test case is observed to make several
> > adjacent order-3 requests with __GFP_DIRECT_RECLAIM cleared, which
> > marks the surrounding block as highatomic. Upon freeing, the blocks
> > merge into an order-10 buddy. When the highatomic pool is drained
> > later on, this order-10 buddy gets moved back to the movable list, but
> > only the first pageblock is marked movable again. A subsequent
> > expand() of this buddy warns about the tail being of a different type.
> >
> > A similar warning could trigger on an order-10 request that only marks
> > the first pageblock as highatomic and leaves the second one movable.
> >
> > This is a long-standing bug that's surfaced by the recent block type
> > warnings added to the allocator. The consequences seem mostly benign,
> > it just results in odd behavior: the highatomic tail blocks are not
> > properly drained, instead they end up on the movable list first, then
> > go back to the highatomic list after an alloc-free cycle.
> >
> > To fix this, make the highatomic reservation code aware that
> > allocations/buddies can be larger than a pageblock.
> >
> > While it's an old quirk, the recently added type consistency warnings
> > seem to be the most prominent consequence of it. Set the Fixes: tag
> > accordingly to highlight this backporting dependency.
> >
> > Fixes: e0932b6c1f94 ("mm: page_alloc: consolidate free page accounting")
> > Reported-by: Christoph Hellwig <hch@infradead.org>
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> > mm/page_alloc.c | 48 +++++++++++++++++++++++++++++++++---------------
> > 1 file changed, 33 insertions(+), 15 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 2e22ce5675ca..754ca5821266 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1955,10 +1955,12 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
> > }
> >
> > /*
> > - * Reserve a pageblock for exclusive use of high-order atomic allocations if
> > - * there are no empty page blocks that contain a page with a suitable order
> > + * Reserve the pageblock(s) surrounding an allocation request for
> > + * exclusive use of high-order atomic allocations if there are no
> > + * empty page blocks that contain a page with a suitable order
> > */
> > -static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
> > +static void reserve_highatomic_pageblock(struct page *page, int order,
> > + struct zone *zone)
> > {
> > int mt;
> > unsigned long max_managed, flags;
> > @@ -1984,10 +1986,17 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
> > /* Yoink! */
> > mt = get_pageblock_migratetype(page);
> > /* Only reserve normal pageblocks (i.e., they can merge with others) */
> > - if (migratetype_is_mergeable(mt))
> > - if (move_freepages_block(zone, page, mt,
> > - MIGRATE_HIGHATOMIC) != -1)
> > - zone->nr_reserved_highatomic += pageblock_nr_pages;
> > + if (!migratetype_is_mergeable(mt))
> > + goto out_unlock;
> > +
> > + if (order < pageblock_order) {
> > + if (move_freepages_block(zone, page, mt, MIGRATE_HIGHATOMIC) == -1)
> > + goto out_unlock;
> > + zone->nr_reserved_highatomic += pageblock_nr_pages;
> > + } else {
> > + change_pageblock_range(page, order, MIGRATE_HIGHATOMIC);
>
> Don't you also need to move (page, order) to MIGRATE_HIGHATOMIC free list here?
Reserving happens after rmqueue(), so the page is already removed from
the freelist. If the page is smaller than a pageblock, there might be
other buddies in the block that need to be moved, but if the page is
one or more whole pageblocks, there are no other buddies in the range.
> > + zone->nr_reserved_highatomic += 1 << order;
> > + }
>
> You could do:
>
> zone->nr_reserved_highatomic += max_t(unsigned long, pageblock_nr_pages, 1 << order);
>
> instead. But I have no strong opinion on it.
Hm I played around with it, but since we have the if-else already I
figure it's a bit more readable to split the updates rather than to
have a more complicated max() conditional before/after.
I would prefer it in the unreserve_highatomic_pageblock() path as
well, but it still has that "migratetype change is racy" check, which
makes that impractical. That check seems no longer needed, though, now
that we take zone->lock for every type test/set, so I was planning to
send a follow-up to remove that.
> > out_unlock:
> > spin_unlock_irqrestore(&zone->lock, flags);
> > @@ -1999,7 +2008,7 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
> > * intense memory pressure but failed atomic allocations should be easier
> > * to recover from than an OOM.
> > *
> > - * If @force is true, try to unreserve a pageblock even though highatomic
> > + * If @force is true, try to unreserve pageblocks even though highatomic
> > * pageblock is exhausted.
> > */
> > static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
> > @@ -2041,6 +2050,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
> > * adjust the count once.
> > */
> > if (is_migrate_highatomic(mt)) {
> > + unsigned long size;
> > /*
> > * It should never happen but changes to
> > * locking could inadvertently allow a per-cpu
> > @@ -2048,9 +2058,9 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
> > * while unreserving so be safe and watch for
> > * underflows.
> > */
> > - zone->nr_reserved_highatomic -= min(
> > - pageblock_nr_pages,
> > - zone->nr_reserved_highatomic);
> > + size = max(pageblock_nr_pages, 1UL << order);
> > + size = min(size, zone->nr_reserved_highatomic);
> > + zone->nr_reserved_highatomic -= size;
> > }
> >
> > /*
> > @@ -2062,11 +2072,19 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
> > * of pageblocks that cannot be completely freed
> > * may increase.
> > */
> > - ret = move_freepages_block(zone, page, mt,
> > - ac->migratetype);
> > + if (order < pageblock_order)
> > + ret = move_freepages_block(zone, page, mt,
> > + ac->migratetype);
> > + else {
> > + move_to_free_list(page, zone, order, mt,
> > + ac->migratetype);
> > + change_pageblock_range(page, order,
> > + ac->migratetype);
> > + ret = 1;
> > + }
> > /*
> > - * Reserving this block already succeeded, so this should
> > - * not fail on zone boundaries.
> > + * Reserving the block(s) already succeeded,
> > + * so this should not fail on zone boundaries.
> > */
> > WARN_ON_ONCE(ret == -1);
> > if (ret > 0) {
> > --
> > 2.45.1
>
> This part looks good to me.
Thanks for taking a look!
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] mm: page_alloc: fix highatomic typing in multi-block buddies
2024-05-30 1:04 ` Johannes Weiner
2024-05-30 1:51 ` Zi Yan
@ 2024-05-30 4:06 ` kernel test robot
2024-05-30 11:42 ` page type is 3, passed migratetype is 1 (nr=512) Johannes Weiner
2024-05-31 5:41 ` Christoph Hellwig
3 siblings, 0 replies; 13+ messages in thread
From: kernel test robot @ 2024-05-30 4:06 UTC (permalink / raw)
To: Johannes Weiner, Christoph Hellwig
Cc: oe-kbuild-all, Andy Shevchenko, Baolin Wang, linux-mm,
Andrew Morton, Vlastimil Babka, Zi Yan, Mel Gorman
Hi Johannes,
kernel test robot noticed the following build errors:
[auto build test ERROR on akpm-mm/mm-everything]
url: https://github.com/intel-lab-lkp/linux/commits/Johannes-Weiner/mm-page_alloc-fix-highatomic-typing-in-multi-block-buddies/20240530-090639
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20240530010419.GA1132939%40cmpxchg.org
patch subject: [PATCH] mm: page_alloc: fix highatomic typing in multi-block buddies
config: i386-buildonly-randconfig-001-20240530 (https://download.01.org/0day-ci/archive/20240530/202405301134.V8IUApym-lkp@intel.com/config)
compiler: gcc-13 (Ubuntu 13.2.0-4ubuntu3) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240530/202405301134.V8IUApym-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202405301134.V8IUApym-lkp@intel.com/
All error/warnings (new ones prefixed by >>):
mm/page_alloc.c: In function 'get_page_from_freelist':
>> mm/page_alloc.c:3464:68: warning: passing argument 2 of 'reserve_highatomic_pageblock' makes integer from pointer without a cast [-Wint-conversion]
3464 | reserve_highatomic_pageblock(page, zone);
| ^~~~
| |
| struct zone *
mm/page_alloc.c:1964:65: note: expected 'int' but argument is of type 'struct zone *'
1964 | static void reserve_highatomic_pageblock(struct page *page, int order,
| ~~~~^~~~~
>> mm/page_alloc.c:3464:33: error: too few arguments to function 'reserve_highatomic_pageblock'
3464 | reserve_highatomic_pageblock(page, zone);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/page_alloc.c:1964:13: note: declared here
1964 | static void reserve_highatomic_pageblock(struct page *page, int order,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
vim +/reserve_highatomic_pageblock +3464 mm/page_alloc.c
8510e69c8efef8 Joonsoo Kim 2020-08-06 3310
7fb1d9fca5c6e3 Rohit Seth 2005-11-13 3311 /*
0798e5193cd70f Paul Jackson 2006-12-06 3312 * get_page_from_freelist goes through the zonelist trying to allocate
7fb1d9fca5c6e3 Rohit Seth 2005-11-13 3313 * a page.
7fb1d9fca5c6e3 Rohit Seth 2005-11-13 3314 */
7fb1d9fca5c6e3 Rohit Seth 2005-11-13 3315 static struct page *
a9263751e11a07 Vlastimil Babka 2015-02-11 3316 get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
a9263751e11a07 Vlastimil Babka 2015-02-11 3317 const struct alloc_context *ac)
753ee728964e5a Martin Hicks 2005-06-21 3318 {
6bb154504f8b49 Mel Gorman 2018-12-28 3319 struct zoneref *z;
5117f45d11a9ee Mel Gorman 2009-06-16 3320 struct zone *zone;
8a87d6959f0d81 Wonhyuk Yang 2022-05-12 3321 struct pglist_data *last_pgdat = NULL;
8a87d6959f0d81 Wonhyuk Yang 2022-05-12 3322 bool last_pgdat_dirty_ok = false;
6bb154504f8b49 Mel Gorman 2018-12-28 3323 bool no_fallback;
3b8c0be43cb844 Mel Gorman 2016-07-28 3324
6bb154504f8b49 Mel Gorman 2018-12-28 3325 retry:
7fb1d9fca5c6e3 Rohit Seth 2005-11-13 3326 /*
9276b1bc96a132 Paul Jackson 2006-12-06 3327 * Scan zonelist, looking for a zone with enough free.
8e4645226b4931 Haifeng Xu 2023-02-28 3328 * See also cpuset_node_allowed() comment in kernel/cgroup/cpuset.c.
7fb1d9fca5c6e3 Rohit Seth 2005-11-13 3329 */
6bb154504f8b49 Mel Gorman 2018-12-28 3330 no_fallback = alloc_flags & ALLOC_NOFRAGMENT;
6bb154504f8b49 Mel Gorman 2018-12-28 3331 z = ac->preferred_zoneref;
30d8ec73e8772b Mateusz Nosek 2020-10-13 3332 for_next_zone_zonelist_nodemask(zone, z, ac->highest_zoneidx,
30d8ec73e8772b Mateusz Nosek 2020-10-13 3333 ac->nodemask) {
be06af002f6d50 Mel Gorman 2016-05-19 3334 struct page *page;
e085dbc52fad8d Johannes Weiner 2013-09-11 3335 unsigned long mark;
e085dbc52fad8d Johannes Weiner 2013-09-11 3336
664eeddeef6539 Mel Gorman 2014-06-04 3337 if (cpusets_enabled() &&
664eeddeef6539 Mel Gorman 2014-06-04 3338 (alloc_flags & ALLOC_CPUSET) &&
002f290627c270 Vlastimil Babka 2016-05-19 3339 !__cpuset_zone_allowed(zone, gfp_mask))
cd38b115d5ad79 Mel Gorman 2011-07-25 3340 continue;
a756cf5908530e Johannes Weiner 2012-01-10 3341 /*
a756cf5908530e Johannes Weiner 2012-01-10 3342 * When allocating a page cache page for writing, we
281e37265f2826 Mel Gorman 2016-07-28 3343 * want to get it from a node that is within its dirty
281e37265f2826 Mel Gorman 2016-07-28 3344 * limit, such that no single node holds more than its
a756cf5908530e Johannes Weiner 2012-01-10 3345 * proportional share of globally allowed dirty pages.
281e37265f2826 Mel Gorman 2016-07-28 3346 * The dirty limits take into account the node's
a756cf5908530e Johannes Weiner 2012-01-10 3347 * lowmem reserves and high watermark so that kswapd
a756cf5908530e Johannes Weiner 2012-01-10 3348 * should be able to balance it without having to
a756cf5908530e Johannes Weiner 2012-01-10 3349 * write pages from its LRU list.
a756cf5908530e Johannes Weiner 2012-01-10 3350 *
a756cf5908530e Johannes Weiner 2012-01-10 3351 * XXX: For now, allow allocations to potentially
281e37265f2826 Mel Gorman 2016-07-28 3352 * exceed the per-node dirty limit in the slowpath
c9ab0c4fbeb020 Mel Gorman 2015-11-06 3353 * (spread_dirty_pages unset) before going into reclaim,
a756cf5908530e Johannes Weiner 2012-01-10 3354 * which is important when on a NUMA setup the allowed
281e37265f2826 Mel Gorman 2016-07-28 3355 * nodes are together not big enough to reach the
a756cf5908530e Johannes Weiner 2012-01-10 3356 * global limit. The proper fix for these situations
281e37265f2826 Mel Gorman 2016-07-28 3357 * will require awareness of nodes in the
a756cf5908530e Johannes Weiner 2012-01-10 3358 * dirty-throttling and the flusher threads.
a756cf5908530e Johannes Weiner 2012-01-10 3359 */
3b8c0be43cb844 Mel Gorman 2016-07-28 3360 if (ac->spread_dirty_pages) {
8a87d6959f0d81 Wonhyuk Yang 2022-05-12 3361 if (last_pgdat != zone->zone_pgdat) {
8a87d6959f0d81 Wonhyuk Yang 2022-05-12 3362 last_pgdat = zone->zone_pgdat;
8a87d6959f0d81 Wonhyuk Yang 2022-05-12 3363 last_pgdat_dirty_ok = node_dirty_ok(zone->zone_pgdat);
8a87d6959f0d81 Wonhyuk Yang 2022-05-12 3364 }
3b8c0be43cb844 Mel Gorman 2016-07-28 3365
8a87d6959f0d81 Wonhyuk Yang 2022-05-12 3366 if (!last_pgdat_dirty_ok)
800a1e750c7b04 Mel Gorman 2014-06-04 3367 continue;
3b8c0be43cb844 Mel Gorman 2016-07-28 3368 }
7fb1d9fca5c6e3 Rohit Seth 2005-11-13 3369
6bb154504f8b49 Mel Gorman 2018-12-28 3370 if (no_fallback && nr_online_nodes > 1 &&
6bb154504f8b49 Mel Gorman 2018-12-28 3371 zone != ac->preferred_zoneref->zone) {
6bb154504f8b49 Mel Gorman 2018-12-28 3372 int local_nid;
6bb154504f8b49 Mel Gorman 2018-12-28 3373
6bb154504f8b49 Mel Gorman 2018-12-28 3374 /*
6bb154504f8b49 Mel Gorman 2018-12-28 3375 * If moving to a remote node, retry but allow
6bb154504f8b49 Mel Gorman 2018-12-28 3376 * fragmenting fallbacks. Locality is more important
6bb154504f8b49 Mel Gorman 2018-12-28 3377 * than fragmentation avoidance.
6bb154504f8b49 Mel Gorman 2018-12-28 3378 */
6bb154504f8b49 Mel Gorman 2018-12-28 3379 local_nid = zone_to_nid(ac->preferred_zoneref->zone);
6bb154504f8b49 Mel Gorman 2018-12-28 3380 if (zone_to_nid(zone) != local_nid) {
6bb154504f8b49 Mel Gorman 2018-12-28 3381 alloc_flags &= ~ALLOC_NOFRAGMENT;
6bb154504f8b49 Mel Gorman 2018-12-28 3382 goto retry;
6bb154504f8b49 Mel Gorman 2018-12-28 3383 }
6bb154504f8b49 Mel Gorman 2018-12-28 3384 }
6bb154504f8b49 Mel Gorman 2018-12-28 3385
57c0419c5f0ea2 Huang Ying 2023-10-16 3386 /*
57c0419c5f0ea2 Huang Ying 2023-10-16 3387 * Detect whether the number of free pages is below high
57c0419c5f0ea2 Huang Ying 2023-10-16 3388 * watermark. If so, we will decrease pcp->high and free
57c0419c5f0ea2 Huang Ying 2023-10-16 3389 * PCP pages in free path to reduce the possibility of
57c0419c5f0ea2 Huang Ying 2023-10-16 3390 * premature page reclaiming. Detection is done here to
57c0419c5f0ea2 Huang Ying 2023-10-16 3391 * avoid to do that in hotter free path.
57c0419c5f0ea2 Huang Ying 2023-10-16 3392 */
57c0419c5f0ea2 Huang Ying 2023-10-16 3393 if (test_bit(ZONE_BELOW_HIGH, &zone->flags))
57c0419c5f0ea2 Huang Ying 2023-10-16 3394 goto check_alloc_wmark;
57c0419c5f0ea2 Huang Ying 2023-10-16 3395
57c0419c5f0ea2 Huang Ying 2023-10-16 3396 mark = high_wmark_pages(zone);
57c0419c5f0ea2 Huang Ying 2023-10-16 3397 if (zone_watermark_fast(zone, order, mark,
57c0419c5f0ea2 Huang Ying 2023-10-16 3398 ac->highest_zoneidx, alloc_flags,
57c0419c5f0ea2 Huang Ying 2023-10-16 3399 gfp_mask))
57c0419c5f0ea2 Huang Ying 2023-10-16 3400 goto try_this_zone;
57c0419c5f0ea2 Huang Ying 2023-10-16 3401 else
57c0419c5f0ea2 Huang Ying 2023-10-16 3402 set_bit(ZONE_BELOW_HIGH, &zone->flags);
57c0419c5f0ea2 Huang Ying 2023-10-16 3403
57c0419c5f0ea2 Huang Ying 2023-10-16 3404 check_alloc_wmark:
a921444382b49c Mel Gorman 2018-12-28 3405 mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
48ee5f3696f624 Mel Gorman 2016-05-19 3406 if (!zone_watermark_fast(zone, order, mark,
f80b08fc44536a Charan Teja Reddy 2020-08-06 3407 ac->highest_zoneidx, alloc_flags,
f80b08fc44536a Charan Teja Reddy 2020-08-06 3408 gfp_mask)) {
e085dbc52fad8d Johannes Weiner 2013-09-11 3409 int ret;
fa5e084e43eb14 Mel Gorman 2009-06-16 3410
dcdfdd40fa82b6 Kirill A. Shutemov 2023-06-06 3411 if (has_unaccepted_memory()) {
dcdfdd40fa82b6 Kirill A. Shutemov 2023-06-06 3412 if (try_to_accept_memory(zone, order))
dcdfdd40fa82b6 Kirill A. Shutemov 2023-06-06 3413 goto try_this_zone;
dcdfdd40fa82b6 Kirill A. Shutemov 2023-06-06 3414 }
dcdfdd40fa82b6 Kirill A. Shutemov 2023-06-06 3415
c9e97a1997fbf3 Pavel Tatashin 2018-04-05 3416 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
c9e97a1997fbf3 Pavel Tatashin 2018-04-05 3417 /*
c9e97a1997fbf3 Pavel Tatashin 2018-04-05 3418 * Watermark failed for this zone, but see if we can
c9e97a1997fbf3 Pavel Tatashin 2018-04-05 3419 * grow this zone if it contains deferred pages.
c9e97a1997fbf3 Pavel Tatashin 2018-04-05 3420 */
076cf7ea67010d Anshuman Khandual 2023-01-05 3421 if (deferred_pages_enabled()) {
c9e97a1997fbf3 Pavel Tatashin 2018-04-05 3422 if (_deferred_grow_zone(zone, order))
c9e97a1997fbf3 Pavel Tatashin 2018-04-05 3423 goto try_this_zone;
c9e97a1997fbf3 Pavel Tatashin 2018-04-05 3424 }
c9e97a1997fbf3 Pavel Tatashin 2018-04-05 3425 #endif
5dab29113ca563 Mel Gorman 2014-06-04 3426 /* Checked here to keep the fast path fast */
5dab29113ca563 Mel Gorman 2014-06-04 3427 BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
5dab29113ca563 Mel Gorman 2014-06-04 3428 if (alloc_flags & ALLOC_NO_WATERMARKS)
5dab29113ca563 Mel Gorman 2014-06-04 3429 goto try_this_zone;
5dab29113ca563 Mel Gorman 2014-06-04 3430
202e35db5e719e Dave Hansen 2021-05-04 3431 if (!node_reclaim_enabled() ||
c33d6c06f60f71 Mel Gorman 2016-05-19 3432 !zone_allows_reclaim(ac->preferred_zoneref->zone, zone))
cd38b115d5ad79 Mel Gorman 2011-07-25 3433 continue;
cd38b115d5ad79 Mel Gorman 2011-07-25 3434
a5f5f91da6ad64 Mel Gorman 2016-07-28 3435 ret = node_reclaim(zone->zone_pgdat, gfp_mask, order);
fa5e084e43eb14 Mel Gorman 2009-06-16 3436 switch (ret) {
a5f5f91da6ad64 Mel Gorman 2016-07-28 3437 case NODE_RECLAIM_NOSCAN:
fa5e084e43eb14 Mel Gorman 2009-06-16 3438 /* did not scan */
cd38b115d5ad79 Mel Gorman 2011-07-25 3439 continue;
a5f5f91da6ad64 Mel Gorman 2016-07-28 3440 case NODE_RECLAIM_FULL:
fa5e084e43eb14 Mel Gorman 2009-06-16 3441 /* scanned but unreclaimable */
cd38b115d5ad79 Mel Gorman 2011-07-25 3442 continue;
fa5e084e43eb14 Mel Gorman 2009-06-16 3443 default:
fa5e084e43eb14 Mel Gorman 2009-06-16 3444 /* did we reclaim enough */
fed2719e7a8612 Mel Gorman 2013-04-29 3445 if (zone_watermark_ok(zone, order, mark,
97a225e69a1f88 Joonsoo Kim 2020-06-03 3446 ac->highest_zoneidx, alloc_flags))
fed2719e7a8612 Mel Gorman 2013-04-29 3447 goto try_this_zone;
fed2719e7a8612 Mel Gorman 2013-04-29 3448
fed2719e7a8612 Mel Gorman 2013-04-29 3449 continue;
7fb1d9fca5c6e3 Rohit Seth 2005-11-13 3450 }
0798e5193cd70f Paul Jackson 2006-12-06 3451 }
7fb1d9fca5c6e3 Rohit Seth 2005-11-13 3452
fa5e084e43eb14 Mel Gorman 2009-06-16 3453 try_this_zone:
066b23935578d3 Mel Gorman 2017-02-24 3454 page = rmqueue(ac->preferred_zoneref->zone, zone, order,
0aaa29a56e4fb0 Mel Gorman 2015-11-06 3455 gfp_mask, alloc_flags, ac->migratetype);
753791910e23a9 Vlastimil Babka 2015-02-11 3456 if (page) {
479f854a207ce2 Mel Gorman 2016-05-19 3457 prep_new_page(page, order, gfp_mask, alloc_flags);
0aaa29a56e4fb0 Mel Gorman 2015-11-06 3458
0aaa29a56e4fb0 Mel Gorman 2015-11-06 3459 /*
0aaa29a56e4fb0 Mel Gorman 2015-11-06 3460 * If this is a high-order atomic allocation then check
0aaa29a56e4fb0 Mel Gorman 2015-11-06 3461 * if the pageblock should be reserved for the future
0aaa29a56e4fb0 Mel Gorman 2015-11-06 3462 */
eb2e2b425c6984 Mel Gorman 2023-01-13 3463 if (unlikely(alloc_flags & ALLOC_HIGHATOMIC))
368d983b985572 ZhangPeng 2023-08-09 @3464 reserve_highatomic_pageblock(page, zone);
0aaa29a56e4fb0 Mel Gorman 2015-11-06 3465
753791910e23a9 Vlastimil Babka 2015-02-11 3466 return page;
c9e97a1997fbf3 Pavel Tatashin 2018-04-05 3467 } else {
dcdfdd40fa82b6 Kirill A. Shutemov 2023-06-06 3468 if (has_unaccepted_memory()) {
dcdfdd40fa82b6 Kirill A. Shutemov 2023-06-06 3469 if (try_to_accept_memory(zone, order))
dcdfdd40fa82b6 Kirill A. Shutemov 2023-06-06 3470 goto try_this_zone;
dcdfdd40fa82b6 Kirill A. Shutemov 2023-06-06 3471 }
dcdfdd40fa82b6 Kirill A. Shutemov 2023-06-06 3472
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: page type is 3, passed migratetype is 1 (nr=512)
2024-05-30 1:04 ` Johannes Weiner
2024-05-30 1:51 ` Zi Yan
2024-05-30 4:06 ` [PATCH] mm: page_alloc: fix highatomic typing in multi-block buddies kernel test robot
@ 2024-05-30 11:42 ` Johannes Weiner
2024-05-30 14:34 ` Zi Yan
2024-05-31 13:43 ` Vlastimil Babka
2024-05-31 5:41 ` Christoph Hellwig
3 siblings, 2 replies; 13+ messages in thread
From: Johannes Weiner @ 2024-05-30 11:42 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Andy Shevchenko, Baolin Wang, linux-mm, Andrew Morton,
Vlastimil Babka, Zi Yan, Mel Gorman
On Wed, May 29, 2024 at 09:04:25PM -0400, Johannes Weiner wrote:
> Subject: [PATCH] mm: page_alloc: fix highatomic typing in multi-block buddies
Argh, I dropped the reserve_highatomic_pageblock() caller update when
removing the printks right before sending out. My apologies. Here is
the fixed version:
---
From 6aa9498ee0d7161b0605251116d16b18cd448552 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Wed, 29 May 2024 18:18:12 -0400
Subject: [PATCH] mm: page_alloc: fix highatomic typing in multi-block buddies
Christoph reports a page allocator splat triggered by xfstests:
generic/176 214s ... [ 1204.507931] run fstests generic/176 at 2024-05-27 12:52:30
[] XFS (nvme0n1): Mounting V5 Filesystem cd936307-415f-48a3-b99d-a2d52ae1f273
[] XFS (nvme0n1): Ending clean mount
[] XFS (nvme1n1): Mounting V5 Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
[] XFS (nvme1n1): Ending clean mount
[] XFS (nvme1n1): Unmounting Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
[] XFS (nvme1n1): Mounting V5 Filesystem 7099b02d-9c58-4d1d-be1d-2cc472d12cd9
[] XFS (nvme1n1): Ending clean mount
[] ------------[ cut here ]------------
[] page type is 3, passed migratetype is 1 (nr=512)
[] WARNING: CPU: 0 PID: 509870 at mm/page_alloc.c:645 expand+0x1c5/0x1f0
[] Modules linked in: i2c_i801 crc32_pclmul i2c_smbus [last unloaded: scsi_debug]
[] CPU: 0 PID: 509870 Comm: xfs_io Not tainted 6.10.0-rc1+ #2437
[] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[] RIP: 0010:expand+0x1c5/0x1f0
[] Code: 05 16 70 bf 02 01 e8 ca fc ff ff 8b 54 24 34 44 89 e1 48 c7 c7 80 a2 28 83 48 89 c6 b8 01 00 3
[] RSP: 0018:ffffc90003b2b968 EFLAGS: 00010082
[] RAX: 0000000000000000 RBX: ffffffff83fa9480 RCX: 0000000000000000
[] RDX: 0000000000000005 RSI: 0000000000000027 RDI: 00000000ffffffff
[] RBP: 00000000001f2600 R08: 00000000fffeffff R09: 0000000000000001
[] R10: 0000000000000000 R11: ffffffff83676200 R12: 0000000000000009
[] R13: 0000000000000200 R14: 0000000000000001 R15: ffffea0007c98000
[] FS: 00007f72ca3d5780(0000) GS:ffff8881f9c00000(0000) knlGS:0000000000000000
[] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[] CR2: 00007f72ca1fff38 CR3: 00000001aa0c6002 CR4: 0000000000770ef0
[] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
[] PKRU: 55555554
[] Call Trace:
[] <TASK>
[] ? __warn+0x7b/0x120
[] ? expand+0x1c5/0x1f0
[] ? report_bug+0x191/0x1c0
[] ? handle_bug+0x3c/0x80
[] ? exc_invalid_op+0x17/0x70
[] ? asm_exc_invalid_op+0x1a/0x20
[] ? expand+0x1c5/0x1f0
[] ? expand+0x1c5/0x1f0
[] __rmqueue_pcplist+0x3a9/0x730
[] get_page_from_freelist+0x7a0/0xf00
[] __alloc_pages_noprof+0x153/0x2e0
[] __folio_alloc_noprof+0x10/0xa0
[] __filemap_get_folio+0x16b/0x370
[] iomap_write_begin+0x496/0x680
While trying to service a movable allocation (page type 1), the page
allocator runs into a two-pageblock buddy on the movable freelist
whose second block is typed as highatomic (page type 3).
This inconsistency is caused by the highatomic reservation system
operating on single pageblocks, while MAX_ORDER can be bigger than
that - in this configuration, pageblock_order is 9 while
MAX_PAGE_ORDER is 10. The test case is observed to make several
adjacent order-3 requests with __GFP_DIRECT_RECLAIM cleared, which
marks the surrounding block as highatomic. Upon freeing, the blocks
merge into an order-10 buddy. When the highatomic pool is drained
later on, this order-10 buddy gets moved back to the movable list, but
only the first pageblock is marked movable again. A subsequent
expand() of this buddy warns about the tail being of a different type.
This is a long-standing bug that's surfaced by the recent block type
warnings added to the allocator. The consequences seem mostly benign,
it just results in odd behavior: the highatomic tail blocks are not
properly drained, instead they end up on the movable list first, then
go back to the highatomic list after an alloc-free cycle.
To fix this, make the highatomic reservation code aware that
allocations/buddies can be larger than a pageblock.
While it's an old quirk, the recently added type consistency warnings
seem to be the most prominent consequence of it. Set the Fixes: tag
accordingly to highlight this backporting dependency.
Fixes: e0932b6c1f94 ("mm: page_alloc: consolidate free page accounting")
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
mm/page_alloc.c | 50 +++++++++++++++++++++++++++++++++----------------
1 file changed, 34 insertions(+), 16 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2e22ce5675ca..222299b5c0e6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1955,10 +1955,12 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
}
/*
- * Reserve a pageblock for exclusive use of high-order atomic allocations if
- * there are no empty page blocks that contain a page with a suitable order
+ * Reserve the pageblock(s) surrounding an allocation request for
+ * exclusive use of high-order atomic allocations if there are no
+ * empty page blocks that contain a page with a suitable order
*/
-static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
+static void reserve_highatomic_pageblock(struct page *page, int order,
+ struct zone *zone)
{
int mt;
unsigned long max_managed, flags;
@@ -1984,10 +1986,17 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
/* Yoink! */
mt = get_pageblock_migratetype(page);
/* Only reserve normal pageblocks (i.e., they can merge with others) */
- if (migratetype_is_mergeable(mt))
- if (move_freepages_block(zone, page, mt,
- MIGRATE_HIGHATOMIC) != -1)
- zone->nr_reserved_highatomic += pageblock_nr_pages;
+ if (!migratetype_is_mergeable(mt))
+ goto out_unlock;
+
+ if (order < pageblock_order) {
+ if (move_freepages_block(zone, page, mt, MIGRATE_HIGHATOMIC) == -1)
+ goto out_unlock;
+ zone->nr_reserved_highatomic += pageblock_nr_pages;
+ } else {
+ change_pageblock_range(page, order, MIGRATE_HIGHATOMIC);
+ zone->nr_reserved_highatomic += 1 << order;
+ }
out_unlock:
spin_unlock_irqrestore(&zone->lock, flags);
@@ -1999,7 +2008,7 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
* intense memory pressure but failed atomic allocations should be easier
* to recover from than an OOM.
*
- * If @force is true, try to unreserve a pageblock even though highatomic
+ * If @force is true, try to unreserve pageblocks even though highatomic
* pageblock is exhausted.
*/
static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
@@ -2041,6 +2050,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
* adjust the count once.
*/
if (is_migrate_highatomic(mt)) {
+ unsigned long size;
/*
* It should never happen but changes to
* locking could inadvertently allow a per-cpu
@@ -2048,9 +2058,9 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
* while unreserving so be safe and watch for
* underflows.
*/
- zone->nr_reserved_highatomic -= min(
- pageblock_nr_pages,
- zone->nr_reserved_highatomic);
+ size = max(pageblock_nr_pages, 1UL << order);
+ size = min(size, zone->nr_reserved_highatomic);
+ zone->nr_reserved_highatomic -= size;
}
/*
@@ -2062,11 +2072,19 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
* of pageblocks that cannot be completely freed
* may increase.
*/
- ret = move_freepages_block(zone, page, mt,
- ac->migratetype);
+ if (order < pageblock_order)
+ ret = move_freepages_block(zone, page, mt,
+ ac->migratetype);
+ else {
+ move_to_free_list(page, zone, order, mt,
+ ac->migratetype);
+ change_pageblock_range(page, order,
+ ac->migratetype);
+ ret = 1;
+ }
/*
- * Reserving this block already succeeded, so this should
- * not fail on zone boundaries.
+ * Reserving the block(s) already succeeded,
+ * so this should not fail on zone boundaries.
*/
WARN_ON_ONCE(ret == -1);
if (ret > 0) {
@@ -3406,7 +3424,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
* if the pageblock should be reserved for the future
*/
if (unlikely(alloc_flags & ALLOC_HIGHATOMIC))
- reserve_highatomic_pageblock(page, zone);
+ reserve_highatomic_pageblock(page, order, zone);
return page;
} else {
--
2.45.1
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: page type is 3, passed migratetype is 1 (nr=512)
2024-05-30 11:42 ` page type is 3, passed migratetype is 1 (nr=512) Johannes Weiner
@ 2024-05-30 14:34 ` Zi Yan
2024-05-31 13:43 ` Vlastimil Babka
1 sibling, 0 replies; 13+ messages in thread
From: Zi Yan @ 2024-05-30 14:34 UTC (permalink / raw)
To: Johannes Weiner
Cc: Christoph Hellwig, Andy Shevchenko, Baolin Wang, linux-mm,
Andrew Morton, Vlastimil Babka, Mel Gorman
[-- Attachment #1: Type: text/plain, Size: 5139 bytes --]
On 30 May 2024, at 7:42, Johannes Weiner wrote:
> On Wed, May 29, 2024 at 09:04:25PM -0400, Johannes Weiner wrote:
>> Subject: [PATCH] mm: page_alloc: fix highatomic typing in multi-block buddies
>
> Argh, I dropped the reserve_highatomic_pageblock() caller update when
> removing the printks right before sending out. My apologies. Here is
> the fixed version:
>
> ---
>
> From 6aa9498ee0d7161b0605251116d16b18cd448552 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Wed, 29 May 2024 18:18:12 -0400
> Subject: [PATCH] mm: page_alloc: fix highatomic typing in multi-block buddies
>
> Christoph reports a page allocator splat triggered by xfstests:
>
> generic/176 214s ... [ 1204.507931] run fstests generic/176 at 2024-05-27 12:52:30
> [] XFS (nvme0n1): Mounting V5 Filesystem cd936307-415f-48a3-b99d-a2d52ae1f273
> [] XFS (nvme0n1): Ending clean mount
> [] XFS (nvme1n1): Mounting V5 Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
> [] XFS (nvme1n1): Ending clean mount
> [] XFS (nvme1n1): Unmounting Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
> [] XFS (nvme1n1): Mounting V5 Filesystem 7099b02d-9c58-4d1d-be1d-2cc472d12cd9
> [] XFS (nvme1n1): Ending clean mount
> [] ------------[ cut here ]------------
> [] page type is 3, passed migratetype is 1 (nr=512)
> [] WARNING: CPU: 0 PID: 509870 at mm/page_alloc.c:645 expand+0x1c5/0x1f0
> [] Modules linked in: i2c_i801 crc32_pclmul i2c_smbus [last unloaded: scsi_debug]
> [] CPU: 0 PID: 509870 Comm: xfs_io Not tainted 6.10.0-rc1+ #2437
> [] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [] RIP: 0010:expand+0x1c5/0x1f0
> [] Code: 05 16 70 bf 02 01 e8 ca fc ff ff 8b 54 24 34 44 89 e1 48 c7 c7 80 a2 28 83 48 89 c6 b8 01 00 3
> [] RSP: 0018:ffffc90003b2b968 EFLAGS: 00010082
> [] RAX: 0000000000000000 RBX: ffffffff83fa9480 RCX: 0000000000000000
> [] RDX: 0000000000000005 RSI: 0000000000000027 RDI: 00000000ffffffff
> [] RBP: 00000000001f2600 R08: 00000000fffeffff R09: 0000000000000001
> [] R10: 0000000000000000 R11: ffffffff83676200 R12: 0000000000000009
> [] R13: 0000000000000200 R14: 0000000000000001 R15: ffffea0007c98000
> [] FS: 00007f72ca3d5780(0000) GS:ffff8881f9c00000(0000) knlGS:0000000000000000
> [] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [] CR2: 00007f72ca1fff38 CR3: 00000001aa0c6002 CR4: 0000000000770ef0
> [] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
> [] PKRU: 55555554
> [] Call Trace:
> [] <TASK>
> [] ? __warn+0x7b/0x120
> [] ? expand+0x1c5/0x1f0
> [] ? report_bug+0x191/0x1c0
> [] ? handle_bug+0x3c/0x80
> [] ? exc_invalid_op+0x17/0x70
> [] ? asm_exc_invalid_op+0x1a/0x20
> [] ? expand+0x1c5/0x1f0
> [] ? expand+0x1c5/0x1f0
> [] __rmqueue_pcplist+0x3a9/0x730
> [] get_page_from_freelist+0x7a0/0xf00
> [] __alloc_pages_noprof+0x153/0x2e0
> [] __folio_alloc_noprof+0x10/0xa0
> [] __filemap_get_folio+0x16b/0x370
> [] iomap_write_begin+0x496/0x680
>
> While trying to service a movable allocation (page type 1), the page
> allocator runs into a two-pageblock buddy on the movable freelist
> whose second block is typed as highatomic (page type 3).
>
> This inconsistency is caused by the highatomic reservation system
> operating on single pageblocks, while MAX_ORDER can be bigger than
> that - in this configuration, pageblock_order is 9 while
> MAX_PAGE_ORDER is 10. The test case is observed to make several
> adjacent order-3 requests with __GFP_DIRECT_RECLAIM cleared, which
> marks the surrounding block as highatomic. Upon freeing, the blocks
> merge into an order-10 buddy. When the highatomic pool is drained
> later on, this order-10 buddy gets moved back to the movable list, but
> only the first pageblock is marked movable again. A subsequent
> expand() of this buddy warns about the tail being of a different type.
>
> This is a long-standing bug that's surfaced by the recent block type
> warnings added to the allocator. The consequences seem mostly benign,
> it just results in odd behavior: the highatomic tail blocks are not
> properly drained, instead they end up on the movable list first, then
> go back to the highatomic list after an alloc-free cycle.
>
> To fix this, make the highatomic reservation code aware that
> allocations/buddies can be larger than a pageblock.
>
> While it's an old quirk, the recently added type consistency warnings
> seem to be the most prominent consequence of it. Set the Fixes: tag
> accordingly to highlight this backporting dependency.
>
> Fixes: e0932b6c1f94 ("mm: page_alloc: consolidate free page accounting")
> Reported-by: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
> mm/page_alloc.c | 50 +++++++++++++++++++++++++++++++++----------------
> 1 file changed, 34 insertions(+), 16 deletions(-)
>
The changes look good to me. Thank you for the explanation to my question.
Reviewed-by: Zi Yan <ziy@nvidia.com>
--
Best Regards,
Yan, Zi
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: page type is 3, passed migratetype is 1 (nr=512)
2024-05-30 1:04 ` Johannes Weiner
` (2 preceding siblings ...)
2024-05-30 11:42 ` page type is 3, passed migratetype is 1 (nr=512) Johannes Weiner
@ 2024-05-31 5:41 ` Christoph Hellwig
3 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2024-05-31 5:41 UTC (permalink / raw)
To: Johannes Weiner
Cc: Christoph Hellwig, Andy Shevchenko, Baolin Wang, linux-mm,
Andrew Morton, Vlastimil Babka, Zi Yan, Mel Gorman
On Wed, May 29, 2024 at 09:04:19PM -0400, Johannes Weiner wrote:
> The below fixes the issue for me and survives a few hours of
> xfstest/generic/176 for me.
>
> Christoph, can you please confirm it addresses what you saw?
I had to pass order in one more place for it to compile, but with
that the issue has gone away over extensive testing:
Tested-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: page type is 3, passed migratetype is 1 (nr=512)
2024-05-30 11:42 ` page type is 3, passed migratetype is 1 (nr=512) Johannes Weiner
2024-05-30 14:34 ` Zi Yan
@ 2024-05-31 13:43 ` Vlastimil Babka
1 sibling, 0 replies; 13+ messages in thread
From: Vlastimil Babka @ 2024-05-31 13:43 UTC (permalink / raw)
To: Johannes Weiner, Christoph Hellwig
Cc: Andy Shevchenko, Baolin Wang, linux-mm, Andrew Morton, Zi Yan,
Mel Gorman
On 5/30/24 1:42 PM, Johannes Weiner wrote:
> On Wed, May 29, 2024 at 09:04:25PM -0400, Johannes Weiner wrote:
>> Subject: [PATCH] mm: page_alloc: fix highatomic typing in multi-block buddies
>
> Argh, I dropped the reserve_highatomic_pageblock() caller update when
> removing the printks right before sending out. My apologies. Here is
> the fixed version:
>
> ---
>
> From 6aa9498ee0d7161b0605251116d16b18cd448552 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Wed, 29 May 2024 18:18:12 -0400
> Subject: [PATCH] mm: page_alloc: fix highatomic typing in multi-block buddies
>
> Christoph reports a page allocator splat triggered by xfstests:
>
> generic/176 214s ... [ 1204.507931] run fstests generic/176 at 2024-05-27 12:52:30
> [] XFS (nvme0n1): Mounting V5 Filesystem cd936307-415f-48a3-b99d-a2d52ae1f273
> [] XFS (nvme0n1): Ending clean mount
> [] XFS (nvme1n1): Mounting V5 Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
> [] XFS (nvme1n1): Ending clean mount
> [] XFS (nvme1n1): Unmounting Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
> [] XFS (nvme1n1): Mounting V5 Filesystem 7099b02d-9c58-4d1d-be1d-2cc472d12cd9
> [] XFS (nvme1n1): Ending clean mount
> [] ------------[ cut here ]------------
> [] page type is 3, passed migratetype is 1 (nr=512)
> [] WARNING: CPU: 0 PID: 509870 at mm/page_alloc.c:645 expand+0x1c5/0x1f0
> [] Modules linked in: i2c_i801 crc32_pclmul i2c_smbus [last unloaded: scsi_debug]
> [] CPU: 0 PID: 509870 Comm: xfs_io Not tainted 6.10.0-rc1+ #2437
> [] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [] RIP: 0010:expand+0x1c5/0x1f0
> [] Code: 05 16 70 bf 02 01 e8 ca fc ff ff 8b 54 24 34 44 89 e1 48 c7 c7 80 a2 28 83 48 89 c6 b8 01 00 3
> [] RSP: 0018:ffffc90003b2b968 EFLAGS: 00010082
> [] RAX: 0000000000000000 RBX: ffffffff83fa9480 RCX: 0000000000000000
> [] RDX: 0000000000000005 RSI: 0000000000000027 RDI: 00000000ffffffff
> [] RBP: 00000000001f2600 R08: 00000000fffeffff R09: 0000000000000001
> [] R10: 0000000000000000 R11: ffffffff83676200 R12: 0000000000000009
> [] R13: 0000000000000200 R14: 0000000000000001 R15: ffffea0007c98000
> [] FS: 00007f72ca3d5780(0000) GS:ffff8881f9c00000(0000) knlGS:0000000000000000
> [] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [] CR2: 00007f72ca1fff38 CR3: 00000001aa0c6002 CR4: 0000000000770ef0
> [] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
> [] PKRU: 55555554
> [] Call Trace:
> [] <TASK>
> [] ? __warn+0x7b/0x120
> [] ? expand+0x1c5/0x1f0
> [] ? report_bug+0x191/0x1c0
> [] ? handle_bug+0x3c/0x80
> [] ? exc_invalid_op+0x17/0x70
> [] ? asm_exc_invalid_op+0x1a/0x20
> [] ? expand+0x1c5/0x1f0
> [] ? expand+0x1c5/0x1f0
> [] __rmqueue_pcplist+0x3a9/0x730
> [] get_page_from_freelist+0x7a0/0xf00
> [] __alloc_pages_noprof+0x153/0x2e0
> [] __folio_alloc_noprof+0x10/0xa0
> [] __filemap_get_folio+0x16b/0x370
> [] iomap_write_begin+0x496/0x680
>
> While trying to service a movable allocation (page type 1), the page
> allocator runs into a two-pageblock buddy on the movable freelist
> whose second block is typed as highatomic (page type 3).
>
> This inconsistency is caused by the highatomic reservation system
> operating on single pageblocks, while MAX_ORDER can be bigger than
> that - in this configuration, pageblock_order is 9 while
> MAX_PAGE_ORDER is 10. The test case is observed to make several
> adjacent order-3 requests with __GFP_DIRECT_RECLAIM cleared, which
> marks the surrounding block as highatomic. Upon freeing, the blocks
> merge into an order-10 buddy. When the highatomic pool is drained
> later on, this order-10 buddy gets moved back to the movable list, but
> only the first pageblock is marked movable again. A subsequent
> expand() of this buddy warns about the tail being of a different type.
>
> This is a long-standing bug that's surfaced by the recent block type
> warnings added to the allocator. The consequences seem mostly benign,
> it just results in odd behavior: the highatomic tail blocks are not
> properly drained, instead they end up on the movable list first, then
> go back to the highatomic list after an alloc-free cycle.
>
> To fix this, make the highatomic reservation code aware that
> allocations/buddies can be larger than a pageblock.
>
> While it's an old quirk, the recently added type consistency warnings
> seem to be the most prominent consequence of it. Set the Fixes: tag
> accordingly to highlight this backporting dependency.
>
> Fixes: e0932b6c1f94 ("mm: page_alloc: consolidate free page accounting")
> Reported-by: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Looks good, thanks.
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2024-05-31 13:43 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-27 8:58 page type is 3, passed migratetype is 1 (nr=512) Christoph Hellwig
2024-05-27 13:14 ` Christoph Hellwig
2024-05-28 16:47 ` Johannes Weiner
2024-05-29 5:43 ` Christoph Hellwig
2024-05-29 16:28 ` Johannes Weiner
2024-05-30 1:04 ` Johannes Weiner
2024-05-30 1:51 ` Zi Yan
2024-05-30 3:22 ` Johannes Weiner
2024-05-30 4:06 ` [PATCH] mm: page_alloc: fix highatomic typing in multi-block buddies kernel test robot
2024-05-30 11:42 ` page type is 3, passed migratetype is 1 (nr=512) Johannes Weiner
2024-05-30 14:34 ` Zi Yan
2024-05-31 13:43 ` Vlastimil Babka
2024-05-31 5:41 ` Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox