linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* page type is 3, passed migratetype is 1 (nr=512)
@ 2024-05-27  8:58 Christoph Hellwig
  2024-05-27 13:14 ` Christoph Hellwig
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2024-05-27  8:58 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andy Shevchenko, Baolin Wang, linux-mm

Hi all,

when running xfstests on nfs against a local server I see warnings like
the ones above, which appear to have been added in commit
e0932b6c1f94 (mm: page_alloc: consolidate free page accounting").

Here is a typical backtrace:

generic/154 109s ... [  617.027401] run fstests generic/154 at
2024-05-27 08:53:22
[  628.242905] ------------[ cut here ]------------
[  628.243233] page type is 3, passed migratetype is 1 (nr=512)
[  628.243595] WARNING: CPU: 1 PID: 3608 at mm/page_alloc.c:645 expand+0x1c5/0x1f0
[  628.244017] Modules linked in: crc32_pclmul i2c_piix4
[  628.244321] CPU: 1 PID: 3608 Comm: nfsd Not tainted 6.10.0-rc1+ #2435
[  628.244711] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[  628.245270] RIP: 0010:expand+0x1c5/0x1f0
[  628.245514] Code: 05 16 70 bf 02 01 e8 ca fc ff ff 8b 54 24 34 44 89 e1 48 c7 c7 80 a2 28 83 48 89 c6 b8 01 00 00 00 d3 e0 83
[  628.246610] RSP: 0018:ffffc9000151f6f8 EFLAGS: 00010082
[  628.246920] RAX: 0000000000000000 RBX: ffffffff83fa9480 RCX: 0000000000000000
[  628.247337] RDX: 0000000000000003 RSI: 0000000000000027 RDI: 00000000ffffffff
[  628.247760] RBP: 0000000000174200 R08: 00000000fffeffff R09: 0000000000000001
[  628.248175] R10: 0000000000000000 R11: ffffffff83676200 R12: 0000000000000009
[  628.248603] R13: 0000000000000200 R14: 0000000000000001 R15: ffffea0005d08000
[  628.249022] FS:  0000000000000000(0000) GS:ffff888237d00000(0000) knlGS:0000000000000000
[  628.249499] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  628.249841] CR2: 00005643a0160004 CR3: 000000000365a001 CR4: 0000000000770ef0
[  628.250265] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  628.250697] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
[  628.251121] PKRU: 55555554
[  628.251285] Call Trace:
[  628.251436]  <TASK>
[  628.251572]  ? __warn+0x7b/0x120
[  628.251769]  ? expand+0x1c5/0x1f0
[  628.251969]  ? report_bug+0x191/0x1c0
[  628.252202]  ? handle_bug+0x3c/0x80
[  628.252413]  ? exc_invalid_op+0x17/0x70
[  628.252646]  ? asm_exc_invalid_op+0x1a/0x20
[  628.252899]  ? expand+0x1c5/0x1f0
[  628.253100]  ? expand+0x1c5/0x1f0
[  628.253299]  ? __free_one_page+0x5e4/0x750
[  628.253549]  get_page_from_freelist+0x305/0xf00
[  628.253822]  __alloc_pages_slowpath.constprop.0+0x42b/0xca0
[  628.254148]  __alloc_pages_noprof+0x2b9/0x2e0
[  628.254408]  __folio_alloc_noprof+0x10/0xa0
[  628.254661]  __filemap_get_folio+0x16b/0x370
[  628.254915]  iomap_write_begin+0x496/0x680
[  628.255159]  iomap_file_buffered_write+0x17f/0x440
[  628.255441]  ? __break_lease+0x469/0x710
[  628.255682]  xfs_file_buffered_write+0x7e/0x2a0
[  628.255952]  do_iter_readv_writev+0x107/0x200
[  628.256213]  vfs_iter_write+0x9e/0x240
[  628.256436]  nfsd_vfs_write+0x188/0x5a0
[  628.256673]  nfsd4_write+0x133/0x1c0
[  628.256888]  nfsd4_proc_compound+0x2e3/0x5f0
[  628.257144]  nfsd_dispatch+0xc8/0x210
[  628.257363]  svc_process_common+0x4cf/0x700
[  628.257619]  ? __pfx_nfsd_dispatch+0x10/0x10
[  628.257873]  svc_process+0x13e/0x190
[  628.258086]  svc_recv+0x838/0xa00
[  628.258286]  ? __pfx_nfsd+0x10/0x10
[  628.258500]  nfsd+0x82/0xf0
[  628.258668]  kthread+0xd9/0x110
[  628.258900]  ? __pfx_kthread+0x10/0x10
[  628.259192]  ret_from_fork+0x2c/0x50
[  628.259477]  ? __pfx_kthread+0x10/0x10
[  628.259763]  ret_from_fork_asm+0x1a/0x30
[  628.260070]  </TASK>
[  628.260244] ---[ end trace 0000000000000000 ]---


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: page type is 3, passed migratetype is 1 (nr=512)
  2024-05-27  8:58 page type is 3, passed migratetype is 1 (nr=512) Christoph Hellwig
@ 2024-05-27 13:14 ` Christoph Hellwig
  2024-05-28 16:47   ` Johannes Weiner
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2024-05-27 13:14 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andy Shevchenko, Baolin Wang, linux-mm

On Mon, May 27, 2024 at 01:58:25AM -0700, Christoph Hellwig wrote:
> Hi all,
> 
> when running xfstests on nfs against a local server I see warnings like
> the ones above, which appear to have been added in commit
> e0932b6c1f94 (mm: page_alloc: consolidate free page accounting").

I've also reproduced this with xfstests on local xfs and no nfs in the
loop:

generic/176 214s ... [ 1204.507931] run fstests generic/176 at 2024-05-27 12:52:30
[ 1204.969286] XFS (nvme0n1): Mounting V5 Filesystem cd936307-415f-48a3-b99d-a2d52ae1f273
[ 1204.993621] XFS (nvme0n1): Ending clean mount
[ 1205.387032] XFS (nvme1n1): Mounting V5 Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
[ 1205.412322] XFS (nvme1n1): Ending clean mount
[ 1205.440388] XFS (nvme1n1): Unmounting Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
[ 1205.808063] XFS (nvme1n1): Mounting V5 Filesystem 7099b02d-9c58-4d1d-be1d-2cc472d12cd9
[ 1205.827290] XFS (nvme1n1): Ending clean mount
[ 1208.058931] ------------[ cut here ]------------
[ 1208.059613] page type is 3, passed migratetype is 1 (nr=512)
[ 1208.060402] WARNING: CPU: 0 PID: 509870 at mm/page_alloc.c:645 expand+0x1c5/0x1f0
[ 1208.061352] Modules linked in: i2c_i801 crc32_pclmul i2c_smbus [last unloaded: scsi_debug]
[ 1208.062344] CPU: 0 PID: 509870 Comm: xfs_io Not tainted 6.10.0-rc1+ #2437
[ 1208.063150] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 1208.064204] RIP: 0010:expand+0x1c5/0x1f0
[ 1208.064625] Code: 05 16 70 bf 02 01 e8 ca fc ff ff 8b 54 24 34 44 89 e1 48 c7 c7 80 a2 28 83 48 89 c6 b8 01 00 3
[ 1208.066555] RSP: 0018:ffffc90003b2b968 EFLAGS: 00010082
[ 1208.067111] RAX: 0000000000000000 RBX: ffffffff83fa9480 RCX: 0000000000000000
[ 1208.067872] RDX: 0000000000000005 RSI: 0000000000000027 RDI: 00000000ffffffff
[ 1208.068629] RBP: 00000000001f2600 R08: 00000000fffeffff R09: 0000000000000001
[ 1208.069336] R10: 0000000000000000 R11: ffffffff83676200 R12: 0000000000000009
[ 1208.070038] R13: 0000000000000200 R14: 0000000000000001 R15: ffffea0007c98000
[ 1208.070750] FS:  00007f72ca3d5780(0000) GS:ffff8881f9c00000(0000) knlGS:0000000000000000
[ 1208.071552] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1208.072121] CR2: 00007f72ca1fff38 CR3: 00000001aa0c6002 CR4: 0000000000770ef0
[ 1208.072829] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1208.073527] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
[ 1208.074225] PKRU: 55555554
[ 1208.074507] Call Trace:
[ 1208.074758]  <TASK>
[ 1208.074977]  ? __warn+0x7b/0x120
[ 1208.075308]  ? expand+0x1c5/0x1f0
[ 1208.075652]  ? report_bug+0x191/0x1c0
[ 1208.076043]  ? handle_bug+0x3c/0x80
[ 1208.076400]  ? exc_invalid_op+0x17/0x70
[ 1208.076782]  ? asm_exc_invalid_op+0x1a/0x20
[ 1208.077203]  ? expand+0x1c5/0x1f0
[ 1208.077543]  ? expand+0x1c5/0x1f0
[ 1208.077878]  __rmqueue_pcplist+0x3a9/0x730
[ 1208.078285]  get_page_from_freelist+0x7a0/0xf00
[ 1208.078745]  __alloc_pages_noprof+0x153/0x2e0
[ 1208.079181]  __folio_alloc_noprof+0x10/0xa0
[ 1208.079603]  __filemap_get_folio+0x16b/0x370
[ 1208.080030]  iomap_write_begin+0x496/0x680
[ 1208.080441]  iomap_file_buffered_write+0x17f/0x440
[ 1208.080916]  xfs_file_buffered_write+0x7e/0x2a0
[ 1208.081374]  vfs_write+0x262/0x440
[ 1208.081717]  __x64_sys_pwrite64+0x8f/0xc0
[ 1208.082112]  do_syscall_64+0x4f/0x120
[ 1208.082487]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 1208.082982] RIP: 0033:0x7f72ca4ce2b7
[ 1208.083350] Code: 08 89 3c 24 48 89 4c 24 18 e8 15 f4 f8 ff 4c 8b 54 24 18 48 8b 54 24 10 41 89 c0 48 8b 74 24 b
[ 1208.085126] RSP: 002b:00007ffe56d1a930 EFLAGS: 00000293 ORIG_RAX: 0000000000000012
[ 1208.085867] RAX: ffffffffffffffda RBX: 0000000154400000 RCX: 00007f72ca4ce2b7
[ 1208.086560] RDX: 0000000000400000 RSI: 00007f72c9401000 RDI: 0000000000000003
[ 1208.087248] RBP: 0000000154400000 R08: 0000000000000000 R09: 00007ffe56d1a9d0
[ 1208.087946] R10: 0000000154400000 R11: 0000000000000293 R12: 00000000ffffffff
[ 1208.088639] R13: 00000000abc00000 R14: 0000000000000000 R15: 0000000000000551
[ 1208.089340]  </TASK>
[ 1208.089565] ---[ end trace 0000000000000000 ]---



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: page type is 3, passed migratetype is 1 (nr=512)
  2024-05-27 13:14 ` Christoph Hellwig
@ 2024-05-28 16:47   ` Johannes Weiner
  2024-05-29  5:43     ` Christoph Hellwig
  2024-05-29 16:28     ` Johannes Weiner
  0 siblings, 2 replies; 13+ messages in thread
From: Johannes Weiner @ 2024-05-28 16:47 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Andy Shevchenko, Baolin Wang, linux-mm

Hello,

On Mon, May 27, 2024 at 06:14:12AM -0700, Christoph Hellwig wrote:
> On Mon, May 27, 2024 at 01:58:25AM -0700, Christoph Hellwig wrote:
> > Hi all,
> > 
> > when running xfstests on nfs against a local server I see warnings like
> > the ones above, which appear to have been added in commit
> > e0932b6c1f94 (mm: page_alloc: consolidate free page accounting").
> 
> I've also reproduced this with xfstests on local xfs and no nfs in the
> loop:
> 
> generic/176 214s ... [ 1204.507931] run fstests generic/176 at 2024-05-27 12:52:30
> [ 1204.969286] XFS (nvme0n1): Mounting V5 Filesystem cd936307-415f-48a3-b99d-a2d52ae1f273
> [ 1204.993621] XFS (nvme0n1): Ending clean mount
> [ 1205.387032] XFS (nvme1n1): Mounting V5 Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
> [ 1205.412322] XFS (nvme1n1): Ending clean mount
> [ 1205.440388] XFS (nvme1n1): Unmounting Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
> [ 1205.808063] XFS (nvme1n1): Mounting V5 Filesystem 7099b02d-9c58-4d1d-be1d-2cc472d12cd9
> [ 1205.827290] XFS (nvme1n1): Ending clean mount
> [ 1208.058931] ------------[ cut here ]------------
> [ 1208.059613] page type is 3, passed migratetype is 1 (nr=512)
> [ 1208.060402] WARNING: CPU: 0 PID: 509870 at mm/page_alloc.c:645 expand+0x1c5/0x1f0
> [ 1208.061352] Modules linked in: i2c_i801 crc32_pclmul i2c_smbus [last unloaded: scsi_debug]
> [ 1208.062344] CPU: 0 PID: 509870 Comm: xfs_io Not tainted 6.10.0-rc1+ #2437
> [ 1208.063150] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014

Thanks for the report.

Could you please send me your .config? I'll try to reproduce it
locally.

> [ 1208.064204] RIP: 0010:expand+0x1c5/0x1f0
> [ 1208.064625] Code: 05 16 70 bf 02 01 e8 ca fc ff ff 8b 54 24 34 44 89 e1 48 c7 c7 80 a2 28 83 48 89 c6 b8 01 00 3
> [ 1208.066555] RSP: 0018:ffffc90003b2b968 EFLAGS: 00010082
> [ 1208.067111] RAX: 0000000000000000 RBX: ffffffff83fa9480 RCX: 0000000000000000
> [ 1208.067872] RDX: 0000000000000005 RSI: 0000000000000027 RDI: 00000000ffffffff
> [ 1208.068629] RBP: 00000000001f2600 R08: 00000000fffeffff R09: 0000000000000001
> [ 1208.069336] R10: 0000000000000000 R11: ffffffff83676200 R12: 0000000000000009
> [ 1208.070038] R13: 0000000000000200 R14: 0000000000000001 R15: ffffea0007c98000
> [ 1208.070750] FS:  00007f72ca3d5780(0000) GS:ffff8881f9c00000(0000) knlGS:0000000000000000
> [ 1208.071552] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1208.072121] CR2: 00007f72ca1fff38 CR3: 00000001aa0c6002 CR4: 0000000000770ef0
> [ 1208.072829] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1208.073527] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
> [ 1208.074225] PKRU: 55555554
> [ 1208.074507] Call Trace:
> [ 1208.074758]  <TASK>
> [ 1208.074977]  ? __warn+0x7b/0x120
> [ 1208.075308]  ? expand+0x1c5/0x1f0
> [ 1208.075652]  ? report_bug+0x191/0x1c0
> [ 1208.076043]  ? handle_bug+0x3c/0x80
> [ 1208.076400]  ? exc_invalid_op+0x17/0x70
> [ 1208.076782]  ? asm_exc_invalid_op+0x1a/0x20
> [ 1208.077203]  ? expand+0x1c5/0x1f0
> [ 1208.077543]  ? expand+0x1c5/0x1f0
> [ 1208.077878]  __rmqueue_pcplist+0x3a9/0x730

Ok so the allocator is taking a larger buddy off the freelist to
satisfy a smaller request, then puts the remainder back on the list.

There is no warning from the del_page_from_free_list(), so the buddy
type and the type of the list it was taken from are coherent.

The warning happens when it expands the remainder of the buddy and
finds the tail block to be of a different type.

Specifically, it takes a movable buddy (type 1) off the movable list,
but finds a tail block of it marked highatomic (type 3).

I don't see how we could have merged those during freeing, because the
highatomic buddy would have failed migratetype_is_mergeable().

Ah, but there DOES seem to be an issue with how we reserve
highatomics: reserving and unreserving happens one pageblock at a
time, but MAX_ORDER is usually bigger. If we rmqueue() an order-10
request, reserve_highatomic_block() will only convert the first
order-9 block in it; the tail will remain the original type, which
will produce a buddy of mixed type blocks upon freeing.

This doesn't fully explain the warning here. We'd expect to see it the
other way round - passing an assumed type of 3 (HIGHATOMIC) for the
remainder that is actually 1 (MOVABLE). But the pageblock-based
reservations look fishy. I'll cook up a patch to make this
range-based. It might just fix it in a way I'm not seeing just yet.

> [ 1208.078285]  get_page_from_freelist+0x7a0/0xf00
> [ 1208.078745]  __alloc_pages_noprof+0x153/0x2e0
> [ 1208.079181]  __folio_alloc_noprof+0x10/0xa0
> [ 1208.079603]  __filemap_get_folio+0x16b/0x370
> [ 1208.080030]  iomap_write_begin+0x496/0x680
> [ 1208.080441]  iomap_file_buffered_write+0x17f/0x440
> [ 1208.080916]  xfs_file_buffered_write+0x7e/0x2a0
> [ 1208.081374]  vfs_write+0x262/0x440
> [ 1208.081717]  __x64_sys_pwrite64+0x8f/0xc0
> [ 1208.082112]  do_syscall_64+0x4f/0x120
> [ 1208.082487]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 1208.082982] RIP: 0033:0x7f72ca4ce2b7
> [ 1208.083350] Code: 08 89 3c 24 48 89 4c 24 18 e8 15 f4 f8 ff 4c 8b 54 24 18 48 8b 54 24 10 41 89 c0 48 8b 74 24 b
> [ 1208.085126] RSP: 002b:00007ffe56d1a930 EFLAGS: 00000293 ORIG_RAX: 0000000000000012
> [ 1208.085867] RAX: ffffffffffffffda RBX: 0000000154400000 RCX: 00007f72ca4ce2b7
> [ 1208.086560] RDX: 0000000000400000 RSI: 00007f72c9401000 RDI: 0000000000000003
> [ 1208.087248] RBP: 0000000154400000 R08: 0000000000000000 R09: 00007ffe56d1a9d0
> [ 1208.087946] R10: 0000000154400000 R11: 0000000000000293 R12: 00000000ffffffff
> [ 1208.088639] R13: 00000000abc00000 R14: 0000000000000000 R15: 0000000000000551
> [ 1208.089340]  </TASK>
> [ 1208.089565] ---[ end trace 0000000000000000 ]---


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: page type is 3, passed migratetype is 1 (nr=512)
  2024-05-28 16:47   ` Johannes Weiner
@ 2024-05-29  5:43     ` Christoph Hellwig
  2024-05-29 16:28     ` Johannes Weiner
  1 sibling, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2024-05-29  5:43 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Christoph Hellwig, Andy Shevchenko, Baolin Wang, linux-mm

[-- Attachment #1: Type: text/plain, Size: 154 bytes --]

On Tue, May 28, 2024 at 12:47:56PM -0400, Johannes Weiner wrote:
> Could you please send me your .config? I'll try to reproduce it
> locally.

Attached.


[-- Attachment #2: config.hch.gz --]
[-- Type: application/gzip, Size: 39546 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: page type is 3, passed migratetype is 1 (nr=512)
  2024-05-28 16:47   ` Johannes Weiner
  2024-05-29  5:43     ` Christoph Hellwig
@ 2024-05-29 16:28     ` Johannes Weiner
  2024-05-30  1:04       ` Johannes Weiner
  1 sibling, 1 reply; 13+ messages in thread
From: Johannes Weiner @ 2024-05-29 16:28 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Andy Shevchenko, Baolin Wang, linux-mm

Thanks for the config, I was able to reproduce it with.

On Tue, May 28, 2024 at 12:48:05PM -0400, Johannes Weiner wrote:
> Ah, but there DOES seem to be an issue with how we reserve
> highatomics: reserving and unreserving happens one pageblock at a
> time, but MAX_ORDER is usually bigger. If we rmqueue() an order-10
> request, reserve_highatomic_block() will only convert the first
> order-9 block in it; the tail will remain the original type, which
> will produce a buddy of mixed type blocks upon freeing.
> 
> This doesn't fully explain the warning here. We'd expect to see it the
> other way round - passing an assumed type of 3 (HIGHATOMIC) for the
> remainder that is actually 1 (MOVABLE). But the pageblock-based
> reservations look fishy. I'll cook up a patch to make this
> range-based. It might just fix it in a way I'm not seeing just yet.

tl;dr: With some debugging printks, I was able to see the
issue. Should be a straight-forward fix.

No order-10 allocations are necessary. Instead, smaller allocations
grab blocks for the highatomic pool. Unreserving is lazy, so as those
allocations get freed, they have a chance to merge together. Two
adjacent highatomic blocks can merge (MAX_ORDER > pageblock_order). On
unreserve, we now have an order-10 highatomic buddy, but only clear
the type on the first order-9 pageblock. A subsequent alloc + expand
will warn about this type inconsistency.

[  411.188518] UNRESERVE: pfn=26000 order=10
[  411.188739] ------------[ cut here ]------------
[  411.188881] 26200: page type is 3, passed migratetype is 1 (nr=512)
[  411.189097] WARNING: CPU: 1 PID: 10152 at mm/page_alloc.c:645 expand+0x1c8/0x1f0

I have a draft patch to make the highatomic reservations update all
blocks inside the range, not just the first one. I'll send it out as
soon as I have tested it properly.

Thanks


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: page type is 3, passed migratetype is 1 (nr=512)
  2024-05-29 16:28     ` Johannes Weiner
@ 2024-05-30  1:04       ` Johannes Weiner
  2024-05-30  1:51         ` Zi Yan
                           ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Johannes Weiner @ 2024-05-30  1:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andy Shevchenko, Baolin Wang, linux-mm, Andrew Morton,
	Vlastimil Babka, Zi Yan, Mel Gorman

On Wed, May 29, 2024 at 12:28:35PM -0400, Johannes Weiner wrote:
> Thanks for the config, I was able to reproduce it with.
> 
> On Tue, May 28, 2024 at 12:48:05PM -0400, Johannes Weiner wrote:
> > Ah, but there DOES seem to be an issue with how we reserve
> > highatomics: reserving and unreserving happens one pageblock at a
> > time, but MAX_ORDER is usually bigger. If we rmqueue() an order-10
> > request, reserve_highatomic_block() will only convert the first
> > order-9 block in it; the tail will remain the original type, which
> > will produce a buddy of mixed type blocks upon freeing.
> > 
> > This doesn't fully explain the warning here. We'd expect to see it the
> > other way round - passing an assumed type of 3 (HIGHATOMIC) for the
> > remainder that is actually 1 (MOVABLE). But the pageblock-based
> > reservations look fishy. I'll cook up a patch to make this
> > range-based. It might just fix it in a way I'm not seeing just yet.
> 
> tl;dr: With some debugging printks, I was able to see the
> issue. Should be a straight-forward fix.
> 
> No order-10 allocations are necessary. Instead, smaller allocations
> grab blocks for the highatomic pool. Unreserving is lazy, so as those
> allocations get freed, they have a chance to merge together. Two
> adjacent highatomic blocks can merge (MAX_ORDER > pageblock_order). On
> unreserve, we now have an order-10 highatomic buddy, but only clear
> the type on the first order-9 pageblock. A subsequent alloc + expand
> will warn about this type inconsistency.
> 
> [  411.188518] UNRESERVE: pfn=26000 order=10
> [  411.188739] ------------[ cut here ]------------
> [  411.188881] 26200: page type is 3, passed migratetype is 1 (nr=512)
> [  411.189097] WARNING: CPU: 1 PID: 10152 at mm/page_alloc.c:645 expand+0x1c8/0x1f0
> 
> I have a draft patch to make the highatomic reservations update all
> blocks inside the range, not just the first one. I'll send it out as
> soon as I have tested it properly.

The below fixes the issue for me and survives a few hours of
xfstest/generic/176 for me.

Christoph, can you please confirm it addresses what you saw?

Vlastimil, Zi, Mel, can you please take a look?

Thanks!

---

From a0ba36bb9cd7404c07c7a56139fd52efc6981ced Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Wed, 29 May 2024 18:18:12 -0400
Subject: [PATCH] mm: page_alloc: fix highatomic typing in multi-block buddies

Christoph reports a page allocator splat triggered by xfstests:

generic/176 214s ... [ 1204.507931] run fstests generic/176 at 2024-05-27 12:52:30
[] XFS (nvme0n1): Mounting V5 Filesystem cd936307-415f-48a3-b99d-a2d52ae1f273
[] XFS (nvme0n1): Ending clean mount
[] XFS (nvme1n1): Mounting V5 Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
[] XFS (nvme1n1): Ending clean mount
[] XFS (nvme1n1): Unmounting Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
[] XFS (nvme1n1): Mounting V5 Filesystem 7099b02d-9c58-4d1d-be1d-2cc472d12cd9
[] XFS (nvme1n1): Ending clean mount
[] ------------[ cut here ]------------
[] page type is 3, passed migratetype is 1 (nr=512)
[] WARNING: CPU: 0 PID: 509870 at mm/page_alloc.c:645 expand+0x1c5/0x1f0
[] Modules linked in: i2c_i801 crc32_pclmul i2c_smbus [last unloaded: scsi_debug]
[] CPU: 0 PID: 509870 Comm: xfs_io Not tainted 6.10.0-rc1+ #2437
[] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[] RIP: 0010:expand+0x1c5/0x1f0
[] Code: 05 16 70 bf 02 01 e8 ca fc ff ff 8b 54 24 34 44 89 e1 48 c7 c7 80 a2 28 83 48 89 c6 b8 01 00 3
[] RSP: 0018:ffffc90003b2b968 EFLAGS: 00010082
[] RAX: 0000000000000000 RBX: ffffffff83fa9480 RCX: 0000000000000000
[] RDX: 0000000000000005 RSI: 0000000000000027 RDI: 00000000ffffffff
[] RBP: 00000000001f2600 R08: 00000000fffeffff R09: 0000000000000001
[] R10: 0000000000000000 R11: ffffffff83676200 R12: 0000000000000009
[] R13: 0000000000000200 R14: 0000000000000001 R15: ffffea0007c98000
[] FS:  00007f72ca3d5780(0000) GS:ffff8881f9c00000(0000) knlGS:0000000000000000
[] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[] CR2: 00007f72ca1fff38 CR3: 00000001aa0c6002 CR4: 0000000000770ef0
[] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
[] PKRU: 55555554
[] Call Trace:
[]  <TASK>
[]  ? __warn+0x7b/0x120
[]  ? expand+0x1c5/0x1f0
[]  ? report_bug+0x191/0x1c0
[]  ? handle_bug+0x3c/0x80
[]  ? exc_invalid_op+0x17/0x70
[]  ? asm_exc_invalid_op+0x1a/0x20
[]  ? expand+0x1c5/0x1f0
[]  ? expand+0x1c5/0x1f0
[]  __rmqueue_pcplist+0x3a9/0x730
[]  get_page_from_freelist+0x7a0/0xf00
[]  __alloc_pages_noprof+0x153/0x2e0
[]  __folio_alloc_noprof+0x10/0xa0
[]  __filemap_get_folio+0x16b/0x370
[]  iomap_write_begin+0x496/0x680

While trying to service a movable allocation (page type 1), the page
allocator runs into a two-pageblock buddy on the movable freelist
whose second block is typed as highatomic (page type 3).

This inconsistency is caused by the highatomic reservation system
operating on single pageblocks, while MAX_ORDER can be bigger than
that - in this configuration, pageblock_order is 9 while
MAX_PAGE_ORDER is 10. The test case is observed to make several
adjacent order-3 requests with __GFP_DIRECT_RECLAIM cleared, which
marks the surrounding block as highatomic. Upon freeing, the blocks
merge into an order-10 buddy. When the highatomic pool is drained
later on, this order-10 buddy gets moved back to the movable list, but
only the first pageblock is marked movable again. A subsequent
expand() of this buddy warns about the tail being of a different type.

A similar warning could trigger on an order-10 request that only marks
the first pageblock as highatomic and leaves the second one movable.

This is a long-standing bug that's surfaced by the recent block type
warnings added to the allocator. The consequences seem mostly benign,
it just results in odd behavior: the highatomic tail blocks are not
properly drained, instead they end up on the movable list first, then
go back to the highatomic list after an alloc-free cycle.

To fix this, make the highatomic reservation code aware that
allocations/buddies can be larger than a pageblock.

While it's an old quirk, the recently added type consistency warnings
seem to be the most prominent consequence of it. Set the Fixes: tag
accordingly to highlight this backporting dependency.

Fixes: e0932b6c1f94 ("mm: page_alloc: consolidate free page accounting")
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 48 +++++++++++++++++++++++++++++++++---------------
 1 file changed, 33 insertions(+), 15 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2e22ce5675ca..754ca5821266 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1955,10 +1955,12 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
 }
 
 /*
- * Reserve a pageblock for exclusive use of high-order atomic allocations if
- * there are no empty page blocks that contain a page with a suitable order
+ * Reserve the pageblock(s) surrounding an allocation request for
+ * exclusive use of high-order atomic allocations if there are no
+ * empty page blocks that contain a page with a suitable order
  */
-static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
+static void reserve_highatomic_pageblock(struct page *page, int order,
+					 struct zone *zone)
 {
 	int mt;
 	unsigned long max_managed, flags;
@@ -1984,10 +1986,17 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
 	/* Yoink! */
 	mt = get_pageblock_migratetype(page);
 	/* Only reserve normal pageblocks (i.e., they can merge with others) */
-	if (migratetype_is_mergeable(mt))
-		if (move_freepages_block(zone, page, mt,
-					 MIGRATE_HIGHATOMIC) != -1)
-			zone->nr_reserved_highatomic += pageblock_nr_pages;
+	if (!migratetype_is_mergeable(mt))
+		goto out_unlock;
+
+	if (order < pageblock_order) {
+		if (move_freepages_block(zone, page, mt, MIGRATE_HIGHATOMIC) == -1)
+			goto out_unlock;
+		zone->nr_reserved_highatomic += pageblock_nr_pages;
+	} else {
+		change_pageblock_range(page, order, MIGRATE_HIGHATOMIC);
+		zone->nr_reserved_highatomic += 1 << order;
+	}
 
 out_unlock:
 	spin_unlock_irqrestore(&zone->lock, flags);
@@ -1999,7 +2008,7 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
  * intense memory pressure but failed atomic allocations should be easier
  * to recover from than an OOM.
  *
- * If @force is true, try to unreserve a pageblock even though highatomic
+ * If @force is true, try to unreserve pageblocks even though highatomic
  * pageblock is exhausted.
  */
 static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
@@ -2041,6 +2050,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 			 * adjust the count once.
 			 */
 			if (is_migrate_highatomic(mt)) {
+				unsigned long size;
 				/*
 				 * It should never happen but changes to
 				 * locking could inadvertently allow a per-cpu
@@ -2048,9 +2058,9 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 				 * while unreserving so be safe and watch for
 				 * underflows.
 				 */
-				zone->nr_reserved_highatomic -= min(
-						pageblock_nr_pages,
-						zone->nr_reserved_highatomic);
+				size = max(pageblock_nr_pages, 1UL << order);
+				size = min(size, zone->nr_reserved_highatomic);
+				zone->nr_reserved_highatomic -= size;
 			}
 
 			/*
@@ -2062,11 +2072,19 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 			 * of pageblocks that cannot be completely freed
 			 * may increase.
 			 */
-			ret = move_freepages_block(zone, page, mt,
-						   ac->migratetype);
+			if (order < pageblock_order)
+				ret = move_freepages_block(zone, page, mt,
+							   ac->migratetype);
+			else {
+				move_to_free_list(page, zone, order, mt,
+						  ac->migratetype);
+				change_pageblock_range(page, order,
+						       ac->migratetype);
+				ret = 1;
+			}
 			/*
-			 * Reserving this block already succeeded, so this should
-			 * not fail on zone boundaries.
+			 * Reserving the block(s) already succeeded,
+			 * so this should not fail on zone boundaries.
 			 */
 			WARN_ON_ONCE(ret == -1);
 			if (ret > 0) {
-- 
2.45.1



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: page type is 3, passed migratetype is 1 (nr=512)
  2024-05-30  1:04       ` Johannes Weiner
@ 2024-05-30  1:51         ` Zi Yan
  2024-05-30  3:22           ` Johannes Weiner
  2024-05-30  4:06         ` [PATCH] mm: page_alloc: fix highatomic typing in multi-block buddies kernel test robot
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Zi Yan @ 2024-05-30  1:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Christoph Hellwig, Andy Shevchenko, Baolin Wang, linux-mm,
	Andrew Morton, Vlastimil Babka, Mel Gorman

[-- Attachment #1: Type: text/plain, Size: 11349 bytes --]

On 29 May 2024, at 21:04, Johannes Weiner wrote:

> On Wed, May 29, 2024 at 12:28:35PM -0400, Johannes Weiner wrote:
>> Thanks for the config, I was able to reproduce it with.
>>
>> On Tue, May 28, 2024 at 12:48:05PM -0400, Johannes Weiner wrote:
>>> Ah, but there DOES seem to be an issue with how we reserve
>>> highatomics: reserving and unreserving happens one pageblock at a
>>> time, but MAX_ORDER is usually bigger. If we rmqueue() an order-10
>>> request, reserve_highatomic_block() will only convert the first
>>> order-9 block in it; the tail will remain the original type, which
>>> will produce a buddy of mixed type blocks upon freeing.
>>>
>>> This doesn't fully explain the warning here. We'd expect to see it the
>>> other way round - passing an assumed type of 3 (HIGHATOMIC) for the
>>> remainder that is actually 1 (MOVABLE). But the pageblock-based
>>> reservations look fishy. I'll cook up a patch to make this
>>> range-based. It might just fix it in a way I'm not seeing just yet.
>>
>> tl;dr: With some debugging printks, I was able to see the
>> issue. Should be a straight-forward fix.
>>
>> No order-10 allocations are necessary. Instead, smaller allocations
>> grab blocks for the highatomic pool. Unreserving is lazy, so as those
>> allocations get freed, they have a chance to merge together. Two
>> adjacent highatomic blocks can merge (MAX_ORDER > pageblock_order). On
>> unreserve, we now have an order-10 highatomic buddy, but only clear
>> the type on the first order-9 pageblock. A subsequent alloc + expand
>> will warn about this type inconsistency.
>>
>> [  411.188518] UNRESERVE: pfn=26000 order=10
>> [  411.188739] ------------[ cut here ]------------
>> [  411.188881] 26200: page type is 3, passed migratetype is 1 (nr=512)
>> [  411.189097] WARNING: CPU: 1 PID: 10152 at mm/page_alloc.c:645 expand+0x1c8/0x1f0
>>
>> I have a draft patch to make the highatomic reservations update all
>> blocks inside the range, not just the first one. I'll send it out as
>> soon as I have tested it properly.
>
> The below fixes the issue for me and survives a few hours of
> xfstest/generic/176 for me.
>
> Christoph, can you please confirm it addresses what you saw?
>
> Vlastimil, Zi, Mel, can you please take a look?
>
> Thanks!
>
> ---
>
> From a0ba36bb9cd7404c07c7a56139fd52efc6981ced Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Wed, 29 May 2024 18:18:12 -0400
> Subject: [PATCH] mm: page_alloc: fix highatomic typing in multi-block buddies
>
> Christoph reports a page allocator splat triggered by xfstests:
>
> generic/176 214s ... [ 1204.507931] run fstests generic/176 at 2024-05-27 12:52:30
> [] XFS (nvme0n1): Mounting V5 Filesystem cd936307-415f-48a3-b99d-a2d52ae1f273
> [] XFS (nvme0n1): Ending clean mount
> [] XFS (nvme1n1): Mounting V5 Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
> [] XFS (nvme1n1): Ending clean mount
> [] XFS (nvme1n1): Unmounting Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
> [] XFS (nvme1n1): Mounting V5 Filesystem 7099b02d-9c58-4d1d-be1d-2cc472d12cd9
> [] XFS (nvme1n1): Ending clean mount
> [] ------------[ cut here ]------------
> [] page type is 3, passed migratetype is 1 (nr=512)
> [] WARNING: CPU: 0 PID: 509870 at mm/page_alloc.c:645 expand+0x1c5/0x1f0
> [] Modules linked in: i2c_i801 crc32_pclmul i2c_smbus [last unloaded: scsi_debug]
> [] CPU: 0 PID: 509870 Comm: xfs_io Not tainted 6.10.0-rc1+ #2437
> [] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [] RIP: 0010:expand+0x1c5/0x1f0
> [] Code: 05 16 70 bf 02 01 e8 ca fc ff ff 8b 54 24 34 44 89 e1 48 c7 c7 80 a2 28 83 48 89 c6 b8 01 00 3
> [] RSP: 0018:ffffc90003b2b968 EFLAGS: 00010082
> [] RAX: 0000000000000000 RBX: ffffffff83fa9480 RCX: 0000000000000000
> [] RDX: 0000000000000005 RSI: 0000000000000027 RDI: 00000000ffffffff
> [] RBP: 00000000001f2600 R08: 00000000fffeffff R09: 0000000000000001
> [] R10: 0000000000000000 R11: ffffffff83676200 R12: 0000000000000009
> [] R13: 0000000000000200 R14: 0000000000000001 R15: ffffea0007c98000
> [] FS:  00007f72ca3d5780(0000) GS:ffff8881f9c00000(0000) knlGS:0000000000000000
> [] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [] CR2: 00007f72ca1fff38 CR3: 00000001aa0c6002 CR4: 0000000000770ef0
> [] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
> [] PKRU: 55555554
> [] Call Trace:
> []  <TASK>
> []  ? __warn+0x7b/0x120
> []  ? expand+0x1c5/0x1f0
> []  ? report_bug+0x191/0x1c0
> []  ? handle_bug+0x3c/0x80
> []  ? exc_invalid_op+0x17/0x70
> []  ? asm_exc_invalid_op+0x1a/0x20
> []  ? expand+0x1c5/0x1f0
> []  ? expand+0x1c5/0x1f0
> []  __rmqueue_pcplist+0x3a9/0x730
> []  get_page_from_freelist+0x7a0/0xf00
> []  __alloc_pages_noprof+0x153/0x2e0
> []  __folio_alloc_noprof+0x10/0xa0
> []  __filemap_get_folio+0x16b/0x370
> []  iomap_write_begin+0x496/0x680
>
> While trying to service a movable allocation (page type 1), the page
> allocator runs into a two-pageblock buddy on the movable freelist
> whose second block is typed as highatomic (page type 3).
>
> This inconsistency is caused by the highatomic reservation system
> operating on single pageblocks, while MAX_ORDER can be bigger than
> that - in this configuration, pageblock_order is 9 while
> MAX_PAGE_ORDER is 10. The test case is observed to make several
> adjacent order-3 requests with __GFP_DIRECT_RECLAIM cleared, which
> marks the surrounding block as highatomic. Upon freeing, the blocks
> merge into an order-10 buddy. When the highatomic pool is drained
> later on, this order-10 buddy gets moved back to the movable list, but
> only the first pageblock is marked movable again. A subsequent
> expand() of this buddy warns about the tail being of a different type.
>
> A similar warning could trigger on an order-10 request that only marks
> the first pageblock as highatomic and leaves the second one movable.
>
> This is a long-standing bug that's surfaced by the recent block type
> warnings added to the allocator. The consequences seem mostly benign,
> it just results in odd behavior: the highatomic tail blocks are not
> properly drained, instead they end up on the movable list first, then
> go back to the highatomic list after an alloc-free cycle.
>
> To fix this, make the highatomic reservation code aware that
> allocations/buddies can be larger than a pageblock.
>
> While it's an old quirk, the recently added type consistency warnings
> seem to be the most prominent consequence of it. Set the Fixes: tag
> accordingly to highlight this backporting dependency.
>
> Fixes: e0932b6c1f94 ("mm: page_alloc: consolidate free page accounting")
> Reported-by: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/page_alloc.c | 48 +++++++++++++++++++++++++++++++++---------------
>  1 file changed, 33 insertions(+), 15 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2e22ce5675ca..754ca5821266 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1955,10 +1955,12 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
>  }
>
>  /*
> - * Reserve a pageblock for exclusive use of high-order atomic allocations if
> - * there are no empty page blocks that contain a page with a suitable order
> + * Reserve the pageblock(s) surrounding an allocation request for
> + * exclusive use of high-order atomic allocations if there are no
> + * empty page blocks that contain a page with a suitable order
>   */
> -static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
> +static void reserve_highatomic_pageblock(struct page *page, int order,
> +					 struct zone *zone)
>  {
>  	int mt;
>  	unsigned long max_managed, flags;
> @@ -1984,10 +1986,17 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
>  	/* Yoink! */
>  	mt = get_pageblock_migratetype(page);
>  	/* Only reserve normal pageblocks (i.e., they can merge with others) */
> -	if (migratetype_is_mergeable(mt))
> -		if (move_freepages_block(zone, page, mt,
> -					 MIGRATE_HIGHATOMIC) != -1)
> -			zone->nr_reserved_highatomic += pageblock_nr_pages;
> +	if (!migratetype_is_mergeable(mt))
> +		goto out_unlock;
> +
> +	if (order < pageblock_order) {
> +		if (move_freepages_block(zone, page, mt, MIGRATE_HIGHATOMIC) == -1)
> +			goto out_unlock;
> +		zone->nr_reserved_highatomic += pageblock_nr_pages;
> +	} else {
> +		change_pageblock_range(page, order, MIGRATE_HIGHATOMIC);

Don't you also need to move (page, order) to MIGRATE_HIGHATOMIC free list here?

> +		zone->nr_reserved_highatomic += 1 << order;
> +	}

You could do:

zone->nr_reserved_highatomic += max_t(unsigned long, pageblock_nr_pages, 1 << order);

instead. But I have no strong opinion on it.

>
>  out_unlock:
>  	spin_unlock_irqrestore(&zone->lock, flags);
> @@ -1999,7 +2008,7 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
>   * intense memory pressure but failed atomic allocations should be easier
>   * to recover from than an OOM.
>   *
> - * If @force is true, try to unreserve a pageblock even though highatomic
> + * If @force is true, try to unreserve pageblocks even though highatomic
>   * pageblock is exhausted.
>   */
>  static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
> @@ -2041,6 +2050,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
>  			 * adjust the count once.
>  			 */
>  			if (is_migrate_highatomic(mt)) {
> +				unsigned long size;
>  				/*
>  				 * It should never happen but changes to
>  				 * locking could inadvertently allow a per-cpu
> @@ -2048,9 +2058,9 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
>  				 * while unreserving so be safe and watch for
>  				 * underflows.
>  				 */
> -				zone->nr_reserved_highatomic -= min(
> -						pageblock_nr_pages,
> -						zone->nr_reserved_highatomic);
> +				size = max(pageblock_nr_pages, 1UL << order);
> +				size = min(size, zone->nr_reserved_highatomic);
> +				zone->nr_reserved_highatomic -= size;
>  			}
>
>  			/*
> @@ -2062,11 +2072,19 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
>  			 * of pageblocks that cannot be completely freed
>  			 * may increase.
>  			 */
> -			ret = move_freepages_block(zone, page, mt,
> -						   ac->migratetype);
> +			if (order < pageblock_order)
> +				ret = move_freepages_block(zone, page, mt,
> +							   ac->migratetype);
> +			else {
> +				move_to_free_list(page, zone, order, mt,
> +						  ac->migratetype);
> +				change_pageblock_range(page, order,
> +						       ac->migratetype);
> +				ret = 1;
> +			}
>  			/*
> -			 * Reserving this block already succeeded, so this should
> -			 * not fail on zone boundaries.
> +			 * Reserving the block(s) already succeeded,
> +			 * so this should not fail on zone boundaries.
>  			 */
>  			WARN_ON_ONCE(ret == -1);
>  			if (ret > 0) {
> -- 
> 2.45.1

This part looks good to me.

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: page type is 3, passed migratetype is 1 (nr=512)
  2024-05-30  1:51         ` Zi Yan
@ 2024-05-30  3:22           ` Johannes Weiner
  0 siblings, 0 replies; 13+ messages in thread
From: Johannes Weiner @ 2024-05-30  3:22 UTC (permalink / raw)
  To: Zi Yan
  Cc: Christoph Hellwig, Andy Shevchenko, Baolin Wang, linux-mm,
	Andrew Morton, Vlastimil Babka, Mel Gorman

On Wed, May 29, 2024 at 09:51:24PM -0400, Zi Yan wrote:
> On 29 May 2024, at 21:04, Johannes Weiner wrote:
> 
> > On Wed, May 29, 2024 at 12:28:35PM -0400, Johannes Weiner wrote:
> >> Thanks for the config, I was able to reproduce it with.
> >>
> >> On Tue, May 28, 2024 at 12:48:05PM -0400, Johannes Weiner wrote:
> >>> Ah, but there DOES seem to be an issue with how we reserve
> >>> highatomics: reserving and unreserving happens one pageblock at a
> >>> time, but MAX_ORDER is usually bigger. If we rmqueue() an order-10
> >>> request, reserve_highatomic_block() will only convert the first
> >>> order-9 block in it; the tail will remain the original type, which
> >>> will produce a buddy of mixed type blocks upon freeing.
> >>>
> >>> This doesn't fully explain the warning here. We'd expect to see it the
> >>> other way round - passing an assumed type of 3 (HIGHATOMIC) for the
> >>> remainder that is actually 1 (MOVABLE). But the pageblock-based
> >>> reservations look fishy. I'll cook up a patch to make this
> >>> range-based. It might just fix it in a way I'm not seeing just yet.
> >>
> >> tl;dr: With some debugging printks, I was able to see the
> >> issue. Should be a straight-forward fix.
> >>
> >> No order-10 allocations are necessary. Instead, smaller allocations
> >> grab blocks for the highatomic pool. Unreserving is lazy, so as those
> >> allocations get freed, they have a chance to merge together. Two
> >> adjacent highatomic blocks can merge (MAX_ORDER > pageblock_order). On
> >> unreserve, we now have an order-10 highatomic buddy, but only clear
> >> the type on the first order-9 pageblock. A subsequent alloc + expand
> >> will warn about this type inconsistency.
> >>
> >> [  411.188518] UNRESERVE: pfn=26000 order=10
> >> [  411.188739] ------------[ cut here ]------------
> >> [  411.188881] 26200: page type is 3, passed migratetype is 1 (nr=512)
> >> [  411.189097] WARNING: CPU: 1 PID: 10152 at mm/page_alloc.c:645 expand+0x1c8/0x1f0
> >>
> >> I have a draft patch to make the highatomic reservations update all
> >> blocks inside the range, not just the first one. I'll send it out as
> >> soon as I have tested it properly.
> >
> > The below fixes the issue for me and survives a few hours of
> > xfstest/generic/176 for me.
> >
> > Christoph, can you please confirm it addresses what you saw?
> >
> > Vlastimil, Zi, Mel, can you please take a look?
> >
> > Thanks!
> >
> > ---
> >
> > From a0ba36bb9cd7404c07c7a56139fd52efc6981ced Mon Sep 17 00:00:00 2001
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > Date: Wed, 29 May 2024 18:18:12 -0400
> > Subject: [PATCH] mm: page_alloc: fix highatomic typing in multi-block buddies
> >
> > Christoph reports a page allocator splat triggered by xfstests:
> >
> > generic/176 214s ... [ 1204.507931] run fstests generic/176 at 2024-05-27 12:52:30
> > [] XFS (nvme0n1): Mounting V5 Filesystem cd936307-415f-48a3-b99d-a2d52ae1f273
> > [] XFS (nvme0n1): Ending clean mount
> > [] XFS (nvme1n1): Mounting V5 Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
> > [] XFS (nvme1n1): Ending clean mount
> > [] XFS (nvme1n1): Unmounting Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
> > [] XFS (nvme1n1): Mounting V5 Filesystem 7099b02d-9c58-4d1d-be1d-2cc472d12cd9
> > [] XFS (nvme1n1): Ending clean mount
> > [] ------------[ cut here ]------------
> > [] page type is 3, passed migratetype is 1 (nr=512)
> > [] WARNING: CPU: 0 PID: 509870 at mm/page_alloc.c:645 expand+0x1c5/0x1f0
> > [] Modules linked in: i2c_i801 crc32_pclmul i2c_smbus [last unloaded: scsi_debug]
> > [] CPU: 0 PID: 509870 Comm: xfs_io Not tainted 6.10.0-rc1+ #2437
> > [] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> > [] RIP: 0010:expand+0x1c5/0x1f0
> > [] Code: 05 16 70 bf 02 01 e8 ca fc ff ff 8b 54 24 34 44 89 e1 48 c7 c7 80 a2 28 83 48 89 c6 b8 01 00 3
> > [] RSP: 0018:ffffc90003b2b968 EFLAGS: 00010082
> > [] RAX: 0000000000000000 RBX: ffffffff83fa9480 RCX: 0000000000000000
> > [] RDX: 0000000000000005 RSI: 0000000000000027 RDI: 00000000ffffffff
> > [] RBP: 00000000001f2600 R08: 00000000fffeffff R09: 0000000000000001
> > [] R10: 0000000000000000 R11: ffffffff83676200 R12: 0000000000000009
> > [] R13: 0000000000000200 R14: 0000000000000001 R15: ffffea0007c98000
> > [] FS:  00007f72ca3d5780(0000) GS:ffff8881f9c00000(0000) knlGS:0000000000000000
> > [] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [] CR2: 00007f72ca1fff38 CR3: 00000001aa0c6002 CR4: 0000000000770ef0
> > [] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
> > [] PKRU: 55555554
> > [] Call Trace:
> > []  <TASK>
> > []  ? __warn+0x7b/0x120
> > []  ? expand+0x1c5/0x1f0
> > []  ? report_bug+0x191/0x1c0
> > []  ? handle_bug+0x3c/0x80
> > []  ? exc_invalid_op+0x17/0x70
> > []  ? asm_exc_invalid_op+0x1a/0x20
> > []  ? expand+0x1c5/0x1f0
> > []  ? expand+0x1c5/0x1f0
> > []  __rmqueue_pcplist+0x3a9/0x730
> > []  get_page_from_freelist+0x7a0/0xf00
> > []  __alloc_pages_noprof+0x153/0x2e0
> > []  __folio_alloc_noprof+0x10/0xa0
> > []  __filemap_get_folio+0x16b/0x370
> > []  iomap_write_begin+0x496/0x680
> >
> > While trying to service a movable allocation (page type 1), the page
> > allocator runs into a two-pageblock buddy on the movable freelist
> > whose second block is typed as highatomic (page type 3).
> >
> > This inconsistency is caused by the highatomic reservation system
> > operating on single pageblocks, while MAX_ORDER can be bigger than
> > that - in this configuration, pageblock_order is 9 while
> > MAX_PAGE_ORDER is 10. The test case is observed to make several
> > adjacent order-3 requests with __GFP_DIRECT_RECLAIM cleared, which
> > marks the surrounding block as highatomic. Upon freeing, the blocks
> > merge into an order-10 buddy. When the highatomic pool is drained
> > later on, this order-10 buddy gets moved back to the movable list, but
> > only the first pageblock is marked movable again. A subsequent
> > expand() of this buddy warns about the tail being of a different type.
> >
> > A similar warning could trigger on an order-10 request that only marks
> > the first pageblock as highatomic and leaves the second one movable.
> >
> > This is a long-standing bug that's surfaced by the recent block type
> > warnings added to the allocator. The consequences seem mostly benign,
> > it just results in odd behavior: the highatomic tail blocks are not
> > properly drained, instead they end up on the movable list first, then
> > go back to the highatomic list after an alloc-free cycle.
> >
> > To fix this, make the highatomic reservation code aware that
> > allocations/buddies can be larger than a pageblock.
> >
> > While it's an old quirk, the recently added type consistency warnings
> > seem to be the most prominent consequence of it. Set the Fixes: tag
> > accordingly to highlight this backporting dependency.
> >
> > Fixes: e0932b6c1f94 ("mm: page_alloc: consolidate free page accounting")
> > Reported-by: Christoph Hellwig <hch@infradead.org>
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> >  mm/page_alloc.c | 48 +++++++++++++++++++++++++++++++++---------------
> >  1 file changed, 33 insertions(+), 15 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 2e22ce5675ca..754ca5821266 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1955,10 +1955,12 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
> >  }
> >
> >  /*
> > - * Reserve a pageblock for exclusive use of high-order atomic allocations if
> > - * there are no empty page blocks that contain a page with a suitable order
> > + * Reserve the pageblock(s) surrounding an allocation request for
> > + * exclusive use of high-order atomic allocations if there are no
> > + * empty page blocks that contain a page with a suitable order
> >   */
> > -static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
> > +static void reserve_highatomic_pageblock(struct page *page, int order,
> > +					 struct zone *zone)
> >  {
> >  	int mt;
> >  	unsigned long max_managed, flags;
> > @@ -1984,10 +1986,17 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
> >  	/* Yoink! */
> >  	mt = get_pageblock_migratetype(page);
> >  	/* Only reserve normal pageblocks (i.e., they can merge with others) */
> > -	if (migratetype_is_mergeable(mt))
> > -		if (move_freepages_block(zone, page, mt,
> > -					 MIGRATE_HIGHATOMIC) != -1)
> > -			zone->nr_reserved_highatomic += pageblock_nr_pages;
> > +	if (!migratetype_is_mergeable(mt))
> > +		goto out_unlock;
> > +
> > +	if (order < pageblock_order) {
> > +		if (move_freepages_block(zone, page, mt, MIGRATE_HIGHATOMIC) == -1)
> > +			goto out_unlock;
> > +		zone->nr_reserved_highatomic += pageblock_nr_pages;
> > +	} else {
> > +		change_pageblock_range(page, order, MIGRATE_HIGHATOMIC);
> 
> Don't you also need to move (page, order) to MIGRATE_HIGHATOMIC free list here?

Reserving happens after rmqueue(), so the page is already removed from
the freelist. If the page is smaller than a pageblock, there might be
other buddies in the block that need to be moved, but if the page is
one or more whole pageblocks, there are no other buddies in the range.

> > +		zone->nr_reserved_highatomic += 1 << order;
> > +	}
> 
> You could do:
> 
> zone->nr_reserved_highatomic += max_t(unsigned long, pageblock_nr_pages, 1 << order);
> 
> instead. But I have no strong opinion on it.

Hm I played around with it, but since we have the if-else already I
figure it's a bit more readable to split the updates rather than to
have a more complicated max() conditional before/after.

I would prefer it in the unreserve_highatomic_pageblock() path as
well, but it still has that "migratetype change is racy" check, which
makes that impractical. That check seems no longer needed, though, now
that we take zone->lock for every type test/set, so I was planning to
send a follow-up to remove that.

> >  out_unlock:
> >  	spin_unlock_irqrestore(&zone->lock, flags);
> > @@ -1999,7 +2008,7 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
> >   * intense memory pressure but failed atomic allocations should be easier
> >   * to recover from than an OOM.
> >   *
> > - * If @force is true, try to unreserve a pageblock even though highatomic
> > + * If @force is true, try to unreserve pageblocks even though highatomic
> >   * pageblock is exhausted.
> >   */
> >  static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
> > @@ -2041,6 +2050,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
> >  			 * adjust the count once.
> >  			 */
> >  			if (is_migrate_highatomic(mt)) {
> > +				unsigned long size;
> >  				/*
> >  				 * It should never happen but changes to
> >  				 * locking could inadvertently allow a per-cpu
> > @@ -2048,9 +2058,9 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
> >  				 * while unreserving so be safe and watch for
> >  				 * underflows.
> >  				 */
> > -				zone->nr_reserved_highatomic -= min(
> > -						pageblock_nr_pages,
> > -						zone->nr_reserved_highatomic);
> > +				size = max(pageblock_nr_pages, 1UL << order);
> > +				size = min(size, zone->nr_reserved_highatomic);
> > +				zone->nr_reserved_highatomic -= size;
> >  			}
> >
> >  			/*
> > @@ -2062,11 +2072,19 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
> >  			 * of pageblocks that cannot be completely freed
> >  			 * may increase.
> >  			 */
> > -			ret = move_freepages_block(zone, page, mt,
> > -						   ac->migratetype);
> > +			if (order < pageblock_order)
> > +				ret = move_freepages_block(zone, page, mt,
> > +							   ac->migratetype);
> > +			else {
> > +				move_to_free_list(page, zone, order, mt,
> > +						  ac->migratetype);
> > +				change_pageblock_range(page, order,
> > +						       ac->migratetype);
> > +				ret = 1;
> > +			}
> >  			/*
> > -			 * Reserving this block already succeeded, so this should
> > -			 * not fail on zone boundaries.
> > +			 * Reserving the block(s) already succeeded,
> > +			 * so this should not fail on zone boundaries.
> >  			 */
> >  			WARN_ON_ONCE(ret == -1);
> >  			if (ret > 0) {
> > -- 
> > 2.45.1
> 
> This part looks good to me.

Thanks for taking a look!


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: page_alloc: fix highatomic typing in multi-block buddies
  2024-05-30  1:04       ` Johannes Weiner
  2024-05-30  1:51         ` Zi Yan
@ 2024-05-30  4:06         ` kernel test robot
  2024-05-30 11:42         ` page type is 3, passed migratetype is 1 (nr=512) Johannes Weiner
  2024-05-31  5:41         ` Christoph Hellwig
  3 siblings, 0 replies; 13+ messages in thread
From: kernel test robot @ 2024-05-30  4:06 UTC (permalink / raw)
  To: Johannes Weiner, Christoph Hellwig
  Cc: oe-kbuild-all, Andy Shevchenko, Baolin Wang, linux-mm,
	Andrew Morton, Vlastimil Babka, Zi Yan, Mel Gorman

Hi Johannes,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Johannes-Weiner/mm-page_alloc-fix-highatomic-typing-in-multi-block-buddies/20240530-090639
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20240530010419.GA1132939%40cmpxchg.org
patch subject: [PATCH] mm: page_alloc: fix highatomic typing in multi-block buddies
config: i386-buildonly-randconfig-001-20240530 (https://download.01.org/0day-ci/archive/20240530/202405301134.V8IUApym-lkp@intel.com/config)
compiler: gcc-13 (Ubuntu 13.2.0-4ubuntu3) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240530/202405301134.V8IUApym-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202405301134.V8IUApym-lkp@intel.com/

All error/warnings (new ones prefixed by >>):

   mm/page_alloc.c: In function 'get_page_from_freelist':
>> mm/page_alloc.c:3464:68: warning: passing argument 2 of 'reserve_highatomic_pageblock' makes integer from pointer without a cast [-Wint-conversion]
    3464 |                                 reserve_highatomic_pageblock(page, zone);
         |                                                                    ^~~~
         |                                                                    |
         |                                                                    struct zone *
   mm/page_alloc.c:1964:65: note: expected 'int' but argument is of type 'struct zone *'
    1964 | static void reserve_highatomic_pageblock(struct page *page, int order,
         |                                                             ~~~~^~~~~
>> mm/page_alloc.c:3464:33: error: too few arguments to function 'reserve_highatomic_pageblock'
    3464 |                                 reserve_highatomic_pageblock(page, zone);
         |                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
   mm/page_alloc.c:1964:13: note: declared here
    1964 | static void reserve_highatomic_pageblock(struct page *page, int order,
         |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~


vim +/reserve_highatomic_pageblock +3464 mm/page_alloc.c

8510e69c8efef8 Joonsoo Kim        2020-08-06  3310  
7fb1d9fca5c6e3 Rohit Seth         2005-11-13  3311  /*
0798e5193cd70f Paul Jackson       2006-12-06  3312   * get_page_from_freelist goes through the zonelist trying to allocate
7fb1d9fca5c6e3 Rohit Seth         2005-11-13  3313   * a page.
7fb1d9fca5c6e3 Rohit Seth         2005-11-13  3314   */
7fb1d9fca5c6e3 Rohit Seth         2005-11-13  3315  static struct page *
a9263751e11a07 Vlastimil Babka    2015-02-11  3316  get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
a9263751e11a07 Vlastimil Babka    2015-02-11  3317  						const struct alloc_context *ac)
753ee728964e5a Martin Hicks       2005-06-21  3318  {
6bb154504f8b49 Mel Gorman         2018-12-28  3319  	struct zoneref *z;
5117f45d11a9ee Mel Gorman         2009-06-16  3320  	struct zone *zone;
8a87d6959f0d81 Wonhyuk Yang       2022-05-12  3321  	struct pglist_data *last_pgdat = NULL;
8a87d6959f0d81 Wonhyuk Yang       2022-05-12  3322  	bool last_pgdat_dirty_ok = false;
6bb154504f8b49 Mel Gorman         2018-12-28  3323  	bool no_fallback;
3b8c0be43cb844 Mel Gorman         2016-07-28  3324  
6bb154504f8b49 Mel Gorman         2018-12-28  3325  retry:
7fb1d9fca5c6e3 Rohit Seth         2005-11-13  3326  	/*
9276b1bc96a132 Paul Jackson       2006-12-06  3327  	 * Scan zonelist, looking for a zone with enough free.
8e4645226b4931 Haifeng Xu         2023-02-28  3328  	 * See also cpuset_node_allowed() comment in kernel/cgroup/cpuset.c.
7fb1d9fca5c6e3 Rohit Seth         2005-11-13  3329  	 */
6bb154504f8b49 Mel Gorman         2018-12-28  3330  	no_fallback = alloc_flags & ALLOC_NOFRAGMENT;
6bb154504f8b49 Mel Gorman         2018-12-28  3331  	z = ac->preferred_zoneref;
30d8ec73e8772b Mateusz Nosek      2020-10-13  3332  	for_next_zone_zonelist_nodemask(zone, z, ac->highest_zoneidx,
30d8ec73e8772b Mateusz Nosek      2020-10-13  3333  					ac->nodemask) {
be06af002f6d50 Mel Gorman         2016-05-19  3334  		struct page *page;
e085dbc52fad8d Johannes Weiner    2013-09-11  3335  		unsigned long mark;
e085dbc52fad8d Johannes Weiner    2013-09-11  3336  
664eeddeef6539 Mel Gorman         2014-06-04  3337  		if (cpusets_enabled() &&
664eeddeef6539 Mel Gorman         2014-06-04  3338  			(alloc_flags & ALLOC_CPUSET) &&
002f290627c270 Vlastimil Babka    2016-05-19  3339  			!__cpuset_zone_allowed(zone, gfp_mask))
cd38b115d5ad79 Mel Gorman         2011-07-25  3340  				continue;
a756cf5908530e Johannes Weiner    2012-01-10  3341  		/*
a756cf5908530e Johannes Weiner    2012-01-10  3342  		 * When allocating a page cache page for writing, we
281e37265f2826 Mel Gorman         2016-07-28  3343  		 * want to get it from a node that is within its dirty
281e37265f2826 Mel Gorman         2016-07-28  3344  		 * limit, such that no single node holds more than its
a756cf5908530e Johannes Weiner    2012-01-10  3345  		 * proportional share of globally allowed dirty pages.
281e37265f2826 Mel Gorman         2016-07-28  3346  		 * The dirty limits take into account the node's
a756cf5908530e Johannes Weiner    2012-01-10  3347  		 * lowmem reserves and high watermark so that kswapd
a756cf5908530e Johannes Weiner    2012-01-10  3348  		 * should be able to balance it without having to
a756cf5908530e Johannes Weiner    2012-01-10  3349  		 * write pages from its LRU list.
a756cf5908530e Johannes Weiner    2012-01-10  3350  		 *
a756cf5908530e Johannes Weiner    2012-01-10  3351  		 * XXX: For now, allow allocations to potentially
281e37265f2826 Mel Gorman         2016-07-28  3352  		 * exceed the per-node dirty limit in the slowpath
c9ab0c4fbeb020 Mel Gorman         2015-11-06  3353  		 * (spread_dirty_pages unset) before going into reclaim,
a756cf5908530e Johannes Weiner    2012-01-10  3354  		 * which is important when on a NUMA setup the allowed
281e37265f2826 Mel Gorman         2016-07-28  3355  		 * nodes are together not big enough to reach the
a756cf5908530e Johannes Weiner    2012-01-10  3356  		 * global limit.  The proper fix for these situations
281e37265f2826 Mel Gorman         2016-07-28  3357  		 * will require awareness of nodes in the
a756cf5908530e Johannes Weiner    2012-01-10  3358  		 * dirty-throttling and the flusher threads.
a756cf5908530e Johannes Weiner    2012-01-10  3359  		 */
3b8c0be43cb844 Mel Gorman         2016-07-28  3360  		if (ac->spread_dirty_pages) {
8a87d6959f0d81 Wonhyuk Yang       2022-05-12  3361  			if (last_pgdat != zone->zone_pgdat) {
8a87d6959f0d81 Wonhyuk Yang       2022-05-12  3362  				last_pgdat = zone->zone_pgdat;
8a87d6959f0d81 Wonhyuk Yang       2022-05-12  3363  				last_pgdat_dirty_ok = node_dirty_ok(zone->zone_pgdat);
8a87d6959f0d81 Wonhyuk Yang       2022-05-12  3364  			}
3b8c0be43cb844 Mel Gorman         2016-07-28  3365  
8a87d6959f0d81 Wonhyuk Yang       2022-05-12  3366  			if (!last_pgdat_dirty_ok)
800a1e750c7b04 Mel Gorman         2014-06-04  3367  				continue;
3b8c0be43cb844 Mel Gorman         2016-07-28  3368  		}
7fb1d9fca5c6e3 Rohit Seth         2005-11-13  3369  
6bb154504f8b49 Mel Gorman         2018-12-28  3370  		if (no_fallback && nr_online_nodes > 1 &&
6bb154504f8b49 Mel Gorman         2018-12-28  3371  		    zone != ac->preferred_zoneref->zone) {
6bb154504f8b49 Mel Gorman         2018-12-28  3372  			int local_nid;
6bb154504f8b49 Mel Gorman         2018-12-28  3373  
6bb154504f8b49 Mel Gorman         2018-12-28  3374  			/*
6bb154504f8b49 Mel Gorman         2018-12-28  3375  			 * If moving to a remote node, retry but allow
6bb154504f8b49 Mel Gorman         2018-12-28  3376  			 * fragmenting fallbacks. Locality is more important
6bb154504f8b49 Mel Gorman         2018-12-28  3377  			 * than fragmentation avoidance.
6bb154504f8b49 Mel Gorman         2018-12-28  3378  			 */
6bb154504f8b49 Mel Gorman         2018-12-28  3379  			local_nid = zone_to_nid(ac->preferred_zoneref->zone);
6bb154504f8b49 Mel Gorman         2018-12-28  3380  			if (zone_to_nid(zone) != local_nid) {
6bb154504f8b49 Mel Gorman         2018-12-28  3381  				alloc_flags &= ~ALLOC_NOFRAGMENT;
6bb154504f8b49 Mel Gorman         2018-12-28  3382  				goto retry;
6bb154504f8b49 Mel Gorman         2018-12-28  3383  			}
6bb154504f8b49 Mel Gorman         2018-12-28  3384  		}
6bb154504f8b49 Mel Gorman         2018-12-28  3385  
57c0419c5f0ea2 Huang Ying         2023-10-16  3386  		/*
57c0419c5f0ea2 Huang Ying         2023-10-16  3387  		 * Detect whether the number of free pages is below high
57c0419c5f0ea2 Huang Ying         2023-10-16  3388  		 * watermark.  If so, we will decrease pcp->high and free
57c0419c5f0ea2 Huang Ying         2023-10-16  3389  		 * PCP pages in free path to reduce the possibility of
57c0419c5f0ea2 Huang Ying         2023-10-16  3390  		 * premature page reclaiming.  Detection is done here to
57c0419c5f0ea2 Huang Ying         2023-10-16  3391  		 * avoid to do that in hotter free path.
57c0419c5f0ea2 Huang Ying         2023-10-16  3392  		 */
57c0419c5f0ea2 Huang Ying         2023-10-16  3393  		if (test_bit(ZONE_BELOW_HIGH, &zone->flags))
57c0419c5f0ea2 Huang Ying         2023-10-16  3394  			goto check_alloc_wmark;
57c0419c5f0ea2 Huang Ying         2023-10-16  3395  
57c0419c5f0ea2 Huang Ying         2023-10-16  3396  		mark = high_wmark_pages(zone);
57c0419c5f0ea2 Huang Ying         2023-10-16  3397  		if (zone_watermark_fast(zone, order, mark,
57c0419c5f0ea2 Huang Ying         2023-10-16  3398  					ac->highest_zoneidx, alloc_flags,
57c0419c5f0ea2 Huang Ying         2023-10-16  3399  					gfp_mask))
57c0419c5f0ea2 Huang Ying         2023-10-16  3400  			goto try_this_zone;
57c0419c5f0ea2 Huang Ying         2023-10-16  3401  		else
57c0419c5f0ea2 Huang Ying         2023-10-16  3402  			set_bit(ZONE_BELOW_HIGH, &zone->flags);
57c0419c5f0ea2 Huang Ying         2023-10-16  3403  
57c0419c5f0ea2 Huang Ying         2023-10-16  3404  check_alloc_wmark:
a921444382b49c Mel Gorman         2018-12-28  3405  		mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
48ee5f3696f624 Mel Gorman         2016-05-19  3406  		if (!zone_watermark_fast(zone, order, mark,
f80b08fc44536a Charan Teja Reddy  2020-08-06  3407  				       ac->highest_zoneidx, alloc_flags,
f80b08fc44536a Charan Teja Reddy  2020-08-06  3408  				       gfp_mask)) {
e085dbc52fad8d Johannes Weiner    2013-09-11  3409  			int ret;
fa5e084e43eb14 Mel Gorman         2009-06-16  3410  
dcdfdd40fa82b6 Kirill A. Shutemov 2023-06-06  3411  			if (has_unaccepted_memory()) {
dcdfdd40fa82b6 Kirill A. Shutemov 2023-06-06  3412  				if (try_to_accept_memory(zone, order))
dcdfdd40fa82b6 Kirill A. Shutemov 2023-06-06  3413  					goto try_this_zone;
dcdfdd40fa82b6 Kirill A. Shutemov 2023-06-06  3414  			}
dcdfdd40fa82b6 Kirill A. Shutemov 2023-06-06  3415  
c9e97a1997fbf3 Pavel Tatashin     2018-04-05  3416  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
c9e97a1997fbf3 Pavel Tatashin     2018-04-05  3417  			/*
c9e97a1997fbf3 Pavel Tatashin     2018-04-05  3418  			 * Watermark failed for this zone, but see if we can
c9e97a1997fbf3 Pavel Tatashin     2018-04-05  3419  			 * grow this zone if it contains deferred pages.
c9e97a1997fbf3 Pavel Tatashin     2018-04-05  3420  			 */
076cf7ea67010d Anshuman Khandual  2023-01-05  3421  			if (deferred_pages_enabled()) {
c9e97a1997fbf3 Pavel Tatashin     2018-04-05  3422  				if (_deferred_grow_zone(zone, order))
c9e97a1997fbf3 Pavel Tatashin     2018-04-05  3423  					goto try_this_zone;
c9e97a1997fbf3 Pavel Tatashin     2018-04-05  3424  			}
c9e97a1997fbf3 Pavel Tatashin     2018-04-05  3425  #endif
5dab29113ca563 Mel Gorman         2014-06-04  3426  			/* Checked here to keep the fast path fast */
5dab29113ca563 Mel Gorman         2014-06-04  3427  			BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
5dab29113ca563 Mel Gorman         2014-06-04  3428  			if (alloc_flags & ALLOC_NO_WATERMARKS)
5dab29113ca563 Mel Gorman         2014-06-04  3429  				goto try_this_zone;
5dab29113ca563 Mel Gorman         2014-06-04  3430  
202e35db5e719e Dave Hansen        2021-05-04  3431  			if (!node_reclaim_enabled() ||
c33d6c06f60f71 Mel Gorman         2016-05-19  3432  			    !zone_allows_reclaim(ac->preferred_zoneref->zone, zone))
cd38b115d5ad79 Mel Gorman         2011-07-25  3433  				continue;
cd38b115d5ad79 Mel Gorman         2011-07-25  3434  
a5f5f91da6ad64 Mel Gorman         2016-07-28  3435  			ret = node_reclaim(zone->zone_pgdat, gfp_mask, order);
fa5e084e43eb14 Mel Gorman         2009-06-16  3436  			switch (ret) {
a5f5f91da6ad64 Mel Gorman         2016-07-28  3437  			case NODE_RECLAIM_NOSCAN:
fa5e084e43eb14 Mel Gorman         2009-06-16  3438  				/* did not scan */
cd38b115d5ad79 Mel Gorman         2011-07-25  3439  				continue;
a5f5f91da6ad64 Mel Gorman         2016-07-28  3440  			case NODE_RECLAIM_FULL:
fa5e084e43eb14 Mel Gorman         2009-06-16  3441  				/* scanned but unreclaimable */
cd38b115d5ad79 Mel Gorman         2011-07-25  3442  				continue;
fa5e084e43eb14 Mel Gorman         2009-06-16  3443  			default:
fa5e084e43eb14 Mel Gorman         2009-06-16  3444  				/* did we reclaim enough */
fed2719e7a8612 Mel Gorman         2013-04-29  3445  				if (zone_watermark_ok(zone, order, mark,
97a225e69a1f88 Joonsoo Kim        2020-06-03  3446  					ac->highest_zoneidx, alloc_flags))
fed2719e7a8612 Mel Gorman         2013-04-29  3447  					goto try_this_zone;
fed2719e7a8612 Mel Gorman         2013-04-29  3448  
fed2719e7a8612 Mel Gorman         2013-04-29  3449  				continue;
7fb1d9fca5c6e3 Rohit Seth         2005-11-13  3450  			}
0798e5193cd70f Paul Jackson       2006-12-06  3451  		}
7fb1d9fca5c6e3 Rohit Seth         2005-11-13  3452  
fa5e084e43eb14 Mel Gorman         2009-06-16  3453  try_this_zone:
066b23935578d3 Mel Gorman         2017-02-24  3454  		page = rmqueue(ac->preferred_zoneref->zone, zone, order,
0aaa29a56e4fb0 Mel Gorman         2015-11-06  3455  				gfp_mask, alloc_flags, ac->migratetype);
753791910e23a9 Vlastimil Babka    2015-02-11  3456  		if (page) {
479f854a207ce2 Mel Gorman         2016-05-19  3457  			prep_new_page(page, order, gfp_mask, alloc_flags);
0aaa29a56e4fb0 Mel Gorman         2015-11-06  3458  
0aaa29a56e4fb0 Mel Gorman         2015-11-06  3459  			/*
0aaa29a56e4fb0 Mel Gorman         2015-11-06  3460  			 * If this is a high-order atomic allocation then check
0aaa29a56e4fb0 Mel Gorman         2015-11-06  3461  			 * if the pageblock should be reserved for the future
0aaa29a56e4fb0 Mel Gorman         2015-11-06  3462  			 */
eb2e2b425c6984 Mel Gorman         2023-01-13  3463  			if (unlikely(alloc_flags & ALLOC_HIGHATOMIC))
368d983b985572 ZhangPeng          2023-08-09 @3464  				reserve_highatomic_pageblock(page, zone);
0aaa29a56e4fb0 Mel Gorman         2015-11-06  3465  
753791910e23a9 Vlastimil Babka    2015-02-11  3466  			return page;
c9e97a1997fbf3 Pavel Tatashin     2018-04-05  3467  		} else {
dcdfdd40fa82b6 Kirill A. Shutemov 2023-06-06  3468  			if (has_unaccepted_memory()) {
dcdfdd40fa82b6 Kirill A. Shutemov 2023-06-06  3469  				if (try_to_accept_memory(zone, order))
dcdfdd40fa82b6 Kirill A. Shutemov 2023-06-06  3470  					goto try_this_zone;
dcdfdd40fa82b6 Kirill A. Shutemov 2023-06-06  3471  			}
dcdfdd40fa82b6 Kirill A. Shutemov 2023-06-06  3472  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: page type is 3, passed migratetype is 1 (nr=512)
  2024-05-30  1:04       ` Johannes Weiner
  2024-05-30  1:51         ` Zi Yan
  2024-05-30  4:06         ` [PATCH] mm: page_alloc: fix highatomic typing in multi-block buddies kernel test robot
@ 2024-05-30 11:42         ` Johannes Weiner
  2024-05-30 14:34           ` Zi Yan
  2024-05-31 13:43           ` Vlastimil Babka
  2024-05-31  5:41         ` Christoph Hellwig
  3 siblings, 2 replies; 13+ messages in thread
From: Johannes Weiner @ 2024-05-30 11:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andy Shevchenko, Baolin Wang, linux-mm, Andrew Morton,
	Vlastimil Babka, Zi Yan, Mel Gorman

On Wed, May 29, 2024 at 09:04:25PM -0400, Johannes Weiner wrote:
> Subject: [PATCH] mm: page_alloc: fix highatomic typing in multi-block buddies

Argh, I dropped the reserve_highatomic_pageblock() caller update when
removing the printks right before sending out. My apologies. Here is
the fixed version:

---

From 6aa9498ee0d7161b0605251116d16b18cd448552 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Wed, 29 May 2024 18:18:12 -0400
Subject: [PATCH] mm: page_alloc: fix highatomic typing in multi-block buddies

Christoph reports a page allocator splat triggered by xfstests:

generic/176 214s ... [ 1204.507931] run fstests generic/176 at 2024-05-27 12:52:30
[] XFS (nvme0n1): Mounting V5 Filesystem cd936307-415f-48a3-b99d-a2d52ae1f273
[] XFS (nvme0n1): Ending clean mount
[] XFS (nvme1n1): Mounting V5 Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
[] XFS (nvme1n1): Ending clean mount
[] XFS (nvme1n1): Unmounting Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
[] XFS (nvme1n1): Mounting V5 Filesystem 7099b02d-9c58-4d1d-be1d-2cc472d12cd9
[] XFS (nvme1n1): Ending clean mount
[] ------------[ cut here ]------------
[] page type is 3, passed migratetype is 1 (nr=512)
[] WARNING: CPU: 0 PID: 509870 at mm/page_alloc.c:645 expand+0x1c5/0x1f0
[] Modules linked in: i2c_i801 crc32_pclmul i2c_smbus [last unloaded: scsi_debug]
[] CPU: 0 PID: 509870 Comm: xfs_io Not tainted 6.10.0-rc1+ #2437
[] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[] RIP: 0010:expand+0x1c5/0x1f0
[] Code: 05 16 70 bf 02 01 e8 ca fc ff ff 8b 54 24 34 44 89 e1 48 c7 c7 80 a2 28 83 48 89 c6 b8 01 00 3
[] RSP: 0018:ffffc90003b2b968 EFLAGS: 00010082
[] RAX: 0000000000000000 RBX: ffffffff83fa9480 RCX: 0000000000000000
[] RDX: 0000000000000005 RSI: 0000000000000027 RDI: 00000000ffffffff
[] RBP: 00000000001f2600 R08: 00000000fffeffff R09: 0000000000000001
[] R10: 0000000000000000 R11: ffffffff83676200 R12: 0000000000000009
[] R13: 0000000000000200 R14: 0000000000000001 R15: ffffea0007c98000
[] FS:  00007f72ca3d5780(0000) GS:ffff8881f9c00000(0000) knlGS:0000000000000000
[] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[] CR2: 00007f72ca1fff38 CR3: 00000001aa0c6002 CR4: 0000000000770ef0
[] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
[] PKRU: 55555554
[] Call Trace:
[]  <TASK>
[]  ? __warn+0x7b/0x120
[]  ? expand+0x1c5/0x1f0
[]  ? report_bug+0x191/0x1c0
[]  ? handle_bug+0x3c/0x80
[]  ? exc_invalid_op+0x17/0x70
[]  ? asm_exc_invalid_op+0x1a/0x20
[]  ? expand+0x1c5/0x1f0
[]  ? expand+0x1c5/0x1f0
[]  __rmqueue_pcplist+0x3a9/0x730
[]  get_page_from_freelist+0x7a0/0xf00
[]  __alloc_pages_noprof+0x153/0x2e0
[]  __folio_alloc_noprof+0x10/0xa0
[]  __filemap_get_folio+0x16b/0x370
[]  iomap_write_begin+0x496/0x680

While trying to service a movable allocation (page type 1), the page
allocator runs into a two-pageblock buddy on the movable freelist
whose second block is typed as highatomic (page type 3).

This inconsistency is caused by the highatomic reservation system
operating on single pageblocks, while MAX_ORDER can be bigger than
that - in this configuration, pageblock_order is 9 while
MAX_PAGE_ORDER is 10. The test case is observed to make several
adjacent order-3 requests with __GFP_DIRECT_RECLAIM cleared, which
marks the surrounding block as highatomic. Upon freeing, the blocks
merge into an order-10 buddy. When the highatomic pool is drained
later on, this order-10 buddy gets moved back to the movable list, but
only the first pageblock is marked movable again. A subsequent
expand() of this buddy warns about the tail being of a different type.

This is a long-standing bug that's surfaced by the recent block type
warnings added to the allocator. The consequences seem mostly benign,
it just results in odd behavior: the highatomic tail blocks are not
properly drained, instead they end up on the movable list first, then
go back to the highatomic list after an alloc-free cycle.

To fix this, make the highatomic reservation code aware that
allocations/buddies can be larger than a pageblock.

While it's an old quirk, the recently added type consistency warnings
seem to be the most prominent consequence of it. Set the Fixes: tag
accordingly to highlight this backporting dependency.

Fixes: e0932b6c1f94 ("mm: page_alloc: consolidate free page accounting")
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 50 +++++++++++++++++++++++++++++++++----------------
 1 file changed, 34 insertions(+), 16 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2e22ce5675ca..222299b5c0e6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1955,10 +1955,12 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
 }
 
 /*
- * Reserve a pageblock for exclusive use of high-order atomic allocations if
- * there are no empty page blocks that contain a page with a suitable order
+ * Reserve the pageblock(s) surrounding an allocation request for
+ * exclusive use of high-order atomic allocations if there are no
+ * empty page blocks that contain a page with a suitable order
  */
-static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
+static void reserve_highatomic_pageblock(struct page *page, int order,
+					 struct zone *zone)
 {
 	int mt;
 	unsigned long max_managed, flags;
@@ -1984,10 +1986,17 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
 	/* Yoink! */
 	mt = get_pageblock_migratetype(page);
 	/* Only reserve normal pageblocks (i.e., they can merge with others) */
-	if (migratetype_is_mergeable(mt))
-		if (move_freepages_block(zone, page, mt,
-					 MIGRATE_HIGHATOMIC) != -1)
-			zone->nr_reserved_highatomic += pageblock_nr_pages;
+	if (!migratetype_is_mergeable(mt))
+		goto out_unlock;
+
+	if (order < pageblock_order) {
+		if (move_freepages_block(zone, page, mt, MIGRATE_HIGHATOMIC) == -1)
+			goto out_unlock;
+		zone->nr_reserved_highatomic += pageblock_nr_pages;
+	} else {
+		change_pageblock_range(page, order, MIGRATE_HIGHATOMIC);
+		zone->nr_reserved_highatomic += 1 << order;
+	}
 
 out_unlock:
 	spin_unlock_irqrestore(&zone->lock, flags);
@@ -1999,7 +2008,7 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
  * intense memory pressure but failed atomic allocations should be easier
  * to recover from than an OOM.
  *
- * If @force is true, try to unreserve a pageblock even though highatomic
+ * If @force is true, try to unreserve pageblocks even though highatomic
  * pageblock is exhausted.
  */
 static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
@@ -2041,6 +2050,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 			 * adjust the count once.
 			 */
 			if (is_migrate_highatomic(mt)) {
+				unsigned long size;
 				/*
 				 * It should never happen but changes to
 				 * locking could inadvertently allow a per-cpu
@@ -2048,9 +2058,9 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 				 * while unreserving so be safe and watch for
 				 * underflows.
 				 */
-				zone->nr_reserved_highatomic -= min(
-						pageblock_nr_pages,
-						zone->nr_reserved_highatomic);
+				size = max(pageblock_nr_pages, 1UL << order);
+				size = min(size, zone->nr_reserved_highatomic);
+				zone->nr_reserved_highatomic -= size;
 			}
 
 			/*
@@ -2062,11 +2072,19 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 			 * of pageblocks that cannot be completely freed
 			 * may increase.
 			 */
-			ret = move_freepages_block(zone, page, mt,
-						   ac->migratetype);
+			if (order < pageblock_order)
+				ret = move_freepages_block(zone, page, mt,
+							   ac->migratetype);
+			else {
+				move_to_free_list(page, zone, order, mt,
+						  ac->migratetype);
+				change_pageblock_range(page, order,
+						       ac->migratetype);
+				ret = 1;
+			}
 			/*
-			 * Reserving this block already succeeded, so this should
-			 * not fail on zone boundaries.
+			 * Reserving the block(s) already succeeded,
+			 * so this should not fail on zone boundaries.
 			 */
 			WARN_ON_ONCE(ret == -1);
 			if (ret > 0) {
@@ -3406,7 +3424,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 			 * if the pageblock should be reserved for the future
 			 */
 			if (unlikely(alloc_flags & ALLOC_HIGHATOMIC))
-				reserve_highatomic_pageblock(page, zone);
+				reserve_highatomic_pageblock(page, order, zone);
 
 			return page;
 		} else {
-- 
2.45.1



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: page type is 3, passed migratetype is 1 (nr=512)
  2024-05-30 11:42         ` page type is 3, passed migratetype is 1 (nr=512) Johannes Weiner
@ 2024-05-30 14:34           ` Zi Yan
  2024-05-31 13:43           ` Vlastimil Babka
  1 sibling, 0 replies; 13+ messages in thread
From: Zi Yan @ 2024-05-30 14:34 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Christoph Hellwig, Andy Shevchenko, Baolin Wang, linux-mm,
	Andrew Morton, Vlastimil Babka, Mel Gorman

[-- Attachment #1: Type: text/plain, Size: 5139 bytes --]

On 30 May 2024, at 7:42, Johannes Weiner wrote:

> On Wed, May 29, 2024 at 09:04:25PM -0400, Johannes Weiner wrote:
>> Subject: [PATCH] mm: page_alloc: fix highatomic typing in multi-block buddies
>
> Argh, I dropped the reserve_highatomic_pageblock() caller update when
> removing the printks right before sending out. My apologies. Here is
> the fixed version:
>
> ---
>
> From 6aa9498ee0d7161b0605251116d16b18cd448552 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Wed, 29 May 2024 18:18:12 -0400
> Subject: [PATCH] mm: page_alloc: fix highatomic typing in multi-block buddies
>
> Christoph reports a page allocator splat triggered by xfstests:
>
> generic/176 214s ... [ 1204.507931] run fstests generic/176 at 2024-05-27 12:52:30
> [] XFS (nvme0n1): Mounting V5 Filesystem cd936307-415f-48a3-b99d-a2d52ae1f273
> [] XFS (nvme0n1): Ending clean mount
> [] XFS (nvme1n1): Mounting V5 Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
> [] XFS (nvme1n1): Ending clean mount
> [] XFS (nvme1n1): Unmounting Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
> [] XFS (nvme1n1): Mounting V5 Filesystem 7099b02d-9c58-4d1d-be1d-2cc472d12cd9
> [] XFS (nvme1n1): Ending clean mount
> [] ------------[ cut here ]------------
> [] page type is 3, passed migratetype is 1 (nr=512)
> [] WARNING: CPU: 0 PID: 509870 at mm/page_alloc.c:645 expand+0x1c5/0x1f0
> [] Modules linked in: i2c_i801 crc32_pclmul i2c_smbus [last unloaded: scsi_debug]
> [] CPU: 0 PID: 509870 Comm: xfs_io Not tainted 6.10.0-rc1+ #2437
> [] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [] RIP: 0010:expand+0x1c5/0x1f0
> [] Code: 05 16 70 bf 02 01 e8 ca fc ff ff 8b 54 24 34 44 89 e1 48 c7 c7 80 a2 28 83 48 89 c6 b8 01 00 3
> [] RSP: 0018:ffffc90003b2b968 EFLAGS: 00010082
> [] RAX: 0000000000000000 RBX: ffffffff83fa9480 RCX: 0000000000000000
> [] RDX: 0000000000000005 RSI: 0000000000000027 RDI: 00000000ffffffff
> [] RBP: 00000000001f2600 R08: 00000000fffeffff R09: 0000000000000001
> [] R10: 0000000000000000 R11: ffffffff83676200 R12: 0000000000000009
> [] R13: 0000000000000200 R14: 0000000000000001 R15: ffffea0007c98000
> [] FS:  00007f72ca3d5780(0000) GS:ffff8881f9c00000(0000) knlGS:0000000000000000
> [] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [] CR2: 00007f72ca1fff38 CR3: 00000001aa0c6002 CR4: 0000000000770ef0
> [] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
> [] PKRU: 55555554
> [] Call Trace:
> []  <TASK>
> []  ? __warn+0x7b/0x120
> []  ? expand+0x1c5/0x1f0
> []  ? report_bug+0x191/0x1c0
> []  ? handle_bug+0x3c/0x80
> []  ? exc_invalid_op+0x17/0x70
> []  ? asm_exc_invalid_op+0x1a/0x20
> []  ? expand+0x1c5/0x1f0
> []  ? expand+0x1c5/0x1f0
> []  __rmqueue_pcplist+0x3a9/0x730
> []  get_page_from_freelist+0x7a0/0xf00
> []  __alloc_pages_noprof+0x153/0x2e0
> []  __folio_alloc_noprof+0x10/0xa0
> []  __filemap_get_folio+0x16b/0x370
> []  iomap_write_begin+0x496/0x680
>
> While trying to service a movable allocation (page type 1), the page
> allocator runs into a two-pageblock buddy on the movable freelist
> whose second block is typed as highatomic (page type 3).
>
> This inconsistency is caused by the highatomic reservation system
> operating on single pageblocks, while MAX_ORDER can be bigger than
> that - in this configuration, pageblock_order is 9 while
> MAX_PAGE_ORDER is 10. The test case is observed to make several
> adjacent order-3 requests with __GFP_DIRECT_RECLAIM cleared, which
> marks the surrounding block as highatomic. Upon freeing, the blocks
> merge into an order-10 buddy. When the highatomic pool is drained
> later on, this order-10 buddy gets moved back to the movable list, but
> only the first pageblock is marked movable again. A subsequent
> expand() of this buddy warns about the tail being of a different type.
>
> This is a long-standing bug that's surfaced by the recent block type
> warnings added to the allocator. The consequences seem mostly benign,
> it just results in odd behavior: the highatomic tail blocks are not
> properly drained, instead they end up on the movable list first, then
> go back to the highatomic list after an alloc-free cycle.
>
> To fix this, make the highatomic reservation code aware that
> allocations/buddies can be larger than a pageblock.
>
> While it's an old quirk, the recently added type consistency warnings
> seem to be the most prominent consequence of it. Set the Fixes: tag
> accordingly to highlight this backporting dependency.
>
> Fixes: e0932b6c1f94 ("mm: page_alloc: consolidate free page accounting")
> Reported-by: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/page_alloc.c | 50 +++++++++++++++++++++++++++++++++----------------
>  1 file changed, 34 insertions(+), 16 deletions(-)
>
The changes look good to me. Thank you for the explanation to my question.

Reviewed-by: Zi Yan <ziy@nvidia.com>

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: page type is 3, passed migratetype is 1 (nr=512)
  2024-05-30  1:04       ` Johannes Weiner
                           ` (2 preceding siblings ...)
  2024-05-30 11:42         ` page type is 3, passed migratetype is 1 (nr=512) Johannes Weiner
@ 2024-05-31  5:41         ` Christoph Hellwig
  3 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2024-05-31  5:41 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Christoph Hellwig, Andy Shevchenko, Baolin Wang, linux-mm,
	Andrew Morton, Vlastimil Babka, Zi Yan, Mel Gorman

On Wed, May 29, 2024 at 09:04:19PM -0400, Johannes Weiner wrote:
> The below fixes the issue for me and survives a few hours of
> xfstest/generic/176 for me.
> 
> Christoph, can you please confirm it addresses what you saw?

I had to pass order in one more place for it to compile, but with
that the issue has gone away over extensive testing:

Tested-by: Christoph Hellwig <hch@lst.de>



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: page type is 3, passed migratetype is 1 (nr=512)
  2024-05-30 11:42         ` page type is 3, passed migratetype is 1 (nr=512) Johannes Weiner
  2024-05-30 14:34           ` Zi Yan
@ 2024-05-31 13:43           ` Vlastimil Babka
  1 sibling, 0 replies; 13+ messages in thread
From: Vlastimil Babka @ 2024-05-31 13:43 UTC (permalink / raw)
  To: Johannes Weiner, Christoph Hellwig
  Cc: Andy Shevchenko, Baolin Wang, linux-mm, Andrew Morton, Zi Yan,
	Mel Gorman

On 5/30/24 1:42 PM, Johannes Weiner wrote:
> On Wed, May 29, 2024 at 09:04:25PM -0400, Johannes Weiner wrote:
>> Subject: [PATCH] mm: page_alloc: fix highatomic typing in multi-block buddies
> 
> Argh, I dropped the reserve_highatomic_pageblock() caller update when
> removing the printks right before sending out. My apologies. Here is
> the fixed version:
> 
> ---
> 
> From 6aa9498ee0d7161b0605251116d16b18cd448552 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Wed, 29 May 2024 18:18:12 -0400
> Subject: [PATCH] mm: page_alloc: fix highatomic typing in multi-block buddies
> 
> Christoph reports a page allocator splat triggered by xfstests:
> 
> generic/176 214s ... [ 1204.507931] run fstests generic/176 at 2024-05-27 12:52:30
> [] XFS (nvme0n1): Mounting V5 Filesystem cd936307-415f-48a3-b99d-a2d52ae1f273
> [] XFS (nvme0n1): Ending clean mount
> [] XFS (nvme1n1): Mounting V5 Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
> [] XFS (nvme1n1): Ending clean mount
> [] XFS (nvme1n1): Unmounting Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850
> [] XFS (nvme1n1): Mounting V5 Filesystem 7099b02d-9c58-4d1d-be1d-2cc472d12cd9
> [] XFS (nvme1n1): Ending clean mount
> [] ------------[ cut here ]------------
> [] page type is 3, passed migratetype is 1 (nr=512)
> [] WARNING: CPU: 0 PID: 509870 at mm/page_alloc.c:645 expand+0x1c5/0x1f0
> [] Modules linked in: i2c_i801 crc32_pclmul i2c_smbus [last unloaded: scsi_debug]
> [] CPU: 0 PID: 509870 Comm: xfs_io Not tainted 6.10.0-rc1+ #2437
> [] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [] RIP: 0010:expand+0x1c5/0x1f0
> [] Code: 05 16 70 bf 02 01 e8 ca fc ff ff 8b 54 24 34 44 89 e1 48 c7 c7 80 a2 28 83 48 89 c6 b8 01 00 3
> [] RSP: 0018:ffffc90003b2b968 EFLAGS: 00010082
> [] RAX: 0000000000000000 RBX: ffffffff83fa9480 RCX: 0000000000000000
> [] RDX: 0000000000000005 RSI: 0000000000000027 RDI: 00000000ffffffff
> [] RBP: 00000000001f2600 R08: 00000000fffeffff R09: 0000000000000001
> [] R10: 0000000000000000 R11: ffffffff83676200 R12: 0000000000000009
> [] R13: 0000000000000200 R14: 0000000000000001 R15: ffffea0007c98000
> [] FS:  00007f72ca3d5780(0000) GS:ffff8881f9c00000(0000) knlGS:0000000000000000
> [] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [] CR2: 00007f72ca1fff38 CR3: 00000001aa0c6002 CR4: 0000000000770ef0
> [] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
> [] PKRU: 55555554
> [] Call Trace:
> []  <TASK>
> []  ? __warn+0x7b/0x120
> []  ? expand+0x1c5/0x1f0
> []  ? report_bug+0x191/0x1c0
> []  ? handle_bug+0x3c/0x80
> []  ? exc_invalid_op+0x17/0x70
> []  ? asm_exc_invalid_op+0x1a/0x20
> []  ? expand+0x1c5/0x1f0
> []  ? expand+0x1c5/0x1f0
> []  __rmqueue_pcplist+0x3a9/0x730
> []  get_page_from_freelist+0x7a0/0xf00
> []  __alloc_pages_noprof+0x153/0x2e0
> []  __folio_alloc_noprof+0x10/0xa0
> []  __filemap_get_folio+0x16b/0x370
> []  iomap_write_begin+0x496/0x680
> 
> While trying to service a movable allocation (page type 1), the page
> allocator runs into a two-pageblock buddy on the movable freelist
> whose second block is typed as highatomic (page type 3).
> 
> This inconsistency is caused by the highatomic reservation system
> operating on single pageblocks, while MAX_ORDER can be bigger than
> that - in this configuration, pageblock_order is 9 while
> MAX_PAGE_ORDER is 10. The test case is observed to make several
> adjacent order-3 requests with __GFP_DIRECT_RECLAIM cleared, which
> marks the surrounding block as highatomic. Upon freeing, the blocks
> merge into an order-10 buddy. When the highatomic pool is drained
> later on, this order-10 buddy gets moved back to the movable list, but
> only the first pageblock is marked movable again. A subsequent
> expand() of this buddy warns about the tail being of a different type.
> 
> This is a long-standing bug that's surfaced by the recent block type
> warnings added to the allocator. The consequences seem mostly benign,
> it just results in odd behavior: the highatomic tail blocks are not
> properly drained, instead they end up on the movable list first, then
> go back to the highatomic list after an alloc-free cycle.
> 
> To fix this, make the highatomic reservation code aware that
> allocations/buddies can be larger than a pageblock.
> 
> While it's an old quirk, the recently added type consistency warnings
> seem to be the most prominent consequence of it. Set the Fixes: tag
> accordingly to highlight this backporting dependency.
> 
> Fixes: e0932b6c1f94 ("mm: page_alloc: consolidate free page accounting")
> Reported-by: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Looks good, thanks.

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-05-31 13:43 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-27  8:58 page type is 3, passed migratetype is 1 (nr=512) Christoph Hellwig
2024-05-27 13:14 ` Christoph Hellwig
2024-05-28 16:47   ` Johannes Weiner
2024-05-29  5:43     ` Christoph Hellwig
2024-05-29 16:28     ` Johannes Weiner
2024-05-30  1:04       ` Johannes Weiner
2024-05-30  1:51         ` Zi Yan
2024-05-30  3:22           ` Johannes Weiner
2024-05-30  4:06         ` [PATCH] mm: page_alloc: fix highatomic typing in multi-block buddies kernel test robot
2024-05-30 11:42         ` page type is 3, passed migratetype is 1 (nr=512) Johannes Weiner
2024-05-30 14:34           ` Zi Yan
2024-05-31 13:43           ` Vlastimil Babka
2024-05-31  5:41         ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox