Re: Kernel oops with 6.14 when enabling TLS

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: Kernel oops with 6.14 when enabling TLS
       [not found] <08c29e4b-2f71-4b6d-8046-27e407214d8c@suse.com>
@ 2025-03-03  7:48 ` Hannes Reinecke
  2025-03-03 11:06   ` Hannes Reinecke
  2025-03-03 14:27   ` Matthew Wilcox
  0 siblings, 2 replies; 36+ messages in thread
From: Hannes Reinecke @ 2025-03-03  7:48 UTC (permalink / raw)
  To: Sagi Grimberg, Matthew Wilcox
  Cc: linux-nvme, linux-block, linux-nvme, linux-mm

On 2/28/25 11:47, Hannes Reinecke wrote:
> Hi Sagi,
> 
> enabling TLS on latest linus tree reliably crashes my system:
> 
> [  487.018058] ------------[ cut here ]------------
> [  487.024046] WARNING: CPU: 9 PID: 6159 at mm/slub.c:4719 
> free_large_kmalloc+0x15/0xa0
> [  487.033549] Modules linked in: tls(E) nvme_tcp(E) af_packet(E) 
> iscsi_ibft(E) iscsi_boot_sysfs(E) amd_atl(E) intel_rapl_msr(E) 
> intel_rapl_common(E) amd64_edac(E) edac_mce_amd(E) nls_iso8859_1(E) 
> nls_cp437(E) dax_hmem(E) vfat(E) cxl_acpi(E) fat(E) kvm_amd(E) 
> ipmi_ssif(E) cxl_port(E) xfs(E) tg3(E) cxl_core(E) ipmi_si(E) i40e(E) 
> kvm(E) einj(E) wmi_bmof(E) acpi_cpufreq(E) ipmi_devintf(E) libphy(E) 
> k10temp(E) libie(E) i2c_piix4(E) i2c_smbus(E) ipmi_msghandler(E) 
> i2c_designware_platform(E) i2c_designware_core(E) button(E) 
> nvme_fabrics(E) nvme_keyring(E) fuse(E) efi_pstore(E) configfs(E) 
> dmi_sysfs(E) ip_tables(E) x_tables(E) ahci(E) libahci(E) 
> ghash_clmulni_intel(E) libata(E) sha512_ssse3(E) sd_mod(E) 
> sha256_ssse3(E) ast(E) sha1_ssse3(E) drm_client_lib(E) scsi_dh_emc(E) 
> i2c_algo_bit(E) aesni_intel(E) drm_shmem_helper(E) scsi_dh_rdac(E) 
> crypto_simd(E) cryptd(E) xhci_pci(E) drm_kms_helper(E) scsi_dh_alua(E) 
> nvme(E) sg(E) xhci_hcd(E) nvme_core(E) scsi_mod(E) drm(E) nvme_auth(E) 
> scsi_common(E) usbcore(E) sp5100_tco(E) ccp(E)
> [  487.033696]  wmi(E) btrfs(E) blake2b_generic(E) xor(E) raid6_pq(E) 
> efivarfs(E)
> [  487.033707] CPU: 9 UID: 0 PID: 6159 Comm: nvme Kdump: loaded Tainted: 
> G        W   E      6.14.0-rc4-default+ #292 
> f1e35f01b401c038558e67f3c2d644747de50dbd
> [  487.033713] Tainted: [W]=WARN, [E]=UNSIGNED_MODULE
> [  487.033715] Hardware name: Lenovo ThinkSystem SR655V3/SB27B09914, 
> BIOS KAE111E-2.10 04/11/2023
> [  487.033717] RIP: 0010:free_large_kmalloc+0x15/0xa0
> [  487.033722] Code: 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 
> 90 90 90 90 0f 1f 44 00 00 55 53 48 89 fb 48 83 ec 08 48 8b 07 a8 40 75 
> 4b <0f> 0b 80 3d 24 4e 8e 01 00 ba 00 f0 ff ff 74 5d 9c 58 0f 1f 40 00
> [  487.205280] RSP: 0018:ff4de44e432a7b70 EFLAGS: 00010246
> [  487.205284] RAX: 000fffffc0000000 RBX: ffd659b280210b80 RCX: 
> ff42118489e0cd80
> [  487.205286] RDX: 0000000000000000 RSI: ff4211848842e000 RDI: 
> ffd659b280210b80
> [  487.205288] RBP: ff4de44e432a7be0 R08: 0000000000000001 R09: 
> 0000000000000002
> [  487.205289] R10: ff4de44e432a7c00 R11: 0000000000000104 R12: 
> ffd659b280210b80
> [  487.205291] R13: ff4211848842e000 R14: ff421186d0696520 R15: 
> ff421186e19c4000
> [  487.205292] FS:  00007f66b8ffd800(0000) GS:ff4211874d980000(0000) 
> knlGS:0000000000000000
> [  487.205294] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  487.205296] CR2: 00007f66b9150d7e CR3: 0000000281a5e005 CR4: 
> 0000000000771ef0
> [  487.205298] PKRU: 55555554
> [  487.205306] Call Trace:
> [  487.205309]  <TASK>
> [  487.205314]  ? __warn+0x85/0x130
> [  487.296763]  ? free_large_kmalloc+0x15/0xa0
> [  487.296772]  ? report_bug+0xf8/0x1e0
> [  487.296779]  ? handle_bug+0x50/0xa0
> [  487.296783]  ? exc_invalid_op+0x13/0x60
> [  487.296786]  ? asm_exc_invalid_op+0x16/0x20
> [  487.296796]  ? free_large_kmalloc+0x15/0xa0
> [  487.296801]  kfree+0x234/0x320
> [  487.332065]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  487.332075]  ? nvmf_connect_admin_queue+0x105/0x1a0 [nvme_fabrics 
> 34d997d53c805aa2fae8e8baee6a736e8da38358]
> [  487.332081]  ? nvmf_connect_admin_queue+0xa1/0x1a0 [nvme_fabrics 
> 34d997d53c805aa2fae8e8baee6a736e8da38358]
> [  487.332084]  nvmf_connect_admin_queue+0x105/0x1a0 [nvme_fabrics 
> 34d997d53c805aa2fae8e8baee6a736e8da38358]
> [  487.332093]  nvme_tcp_start_queue+0x18f/0x310 [nvme_tcp 
> 68f6be106f52ac467179f8a0922f02aeb6fa1f1c]
> [  487.332102]  nvme_tcp_setup_ctrl+0xf8/0x700 [nvme_tcp 
> 68f6be106f52ac467179f8a0922f02aeb6fa1f1c]
> [  487.394479]  ? nvme_change_ctrl_state+0x99/0x1b0 [nvme_core 
> 22f0ce18ead628230226a9b87ebf48eb576bf299]
> [  487.394495]  nvme_tcp_create_ctrl+0x2e3/0x4d0 [nvme_tcp 
> 68f6be106f52ac467179f8a0922f02aeb6fa1f1c]
> [  487.394503]  nvmf_dev_write+0x323/0x3d0 [nvme_fabrics 
> 34d997d53c805aa2fae8e8baee6a736e8da38358]
> [  487.394508]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  487.394514]  vfs_write+0xd9/0x430
> [  487.394521]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  487.394523]  ? __handle_mm_fault+0x7da/0xca0
> [  487.394531]  ksys_write+0x68/0xe0
> [  487.394536]  do_syscall_64+0x74/0x160
> [  487.394543]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  487.394545]  ? __count_memcg_events+0x98/0x130
> [  487.394550]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  487.476947]  ? count_memcg_events.constprop.163+0x1a/0x30
> [  487.476956]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  487.476960]  ? handle_mm_fault+0xa1/0x290
> [  487.476966]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  487.476968]  ? do_user_addr_fault+0x56b/0x830
> [  487.476975]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  487.476977]  ? exc_page_fault+0x68/0x150
> [  487.476983]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  487.476989] RIP: 0033:0x7f66b91216f0
> [  487.476994] Code: 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 
> 90 90 90 90 90 90 90 90 90 80 3d 19 c3 0e 00 00 74 17 b8 01 00 00 00 0f 
> 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 48 89
> [  487.551619] RSP: 002b:00007ffef5393ed8 EFLAGS: 00000202 ORIG_RAX: 
> 0000000000000001
> [  487.551623] RAX: ffffffffffffffda RBX: 000055e61c2113b0 RCX: 
> 00007f66b91216f0
> [  487.551625] RDX: 00000000000000ed RSI: 000055e61c2113b0 RDI: 
> 0000000000000003
> [  487.551627] RBP: 0000000000000003 R08: 00000000000000ed R09: 
> 000055e61c2113b0
> [  487.551628] R10: 0000000000000000 R11: 0000000000000202 R12: 
> 00000000000000ed
> [  487.551630] R13: 00007f66b9380008 R14: 000055e61c20a980 R15: 
> 000055e61c20b100
> [  487.551637]  </TASK>
> [  487.551639] ---[ end trace 0000000000000000 ]---
> [  487.551642] object pointer: 0x00000000346cb6fc
> [  487.554112] nvme nvme1: creating 32 I/O queues.
> [  489.396262] nvme nvme1: mapped 32/0/0 default/read/poll queues.
> [  489.405197] Oops: general protection fault, probably for non- 
> canonical address 0xdead000000000100: 0000 [#1] PREEMPT SMP NOPTI
> [  489.418790] CPU: 9 UID: 0 PID: 6159 Comm: nvme Kdump: loaded Tainted: 
> G        W   E      6.14.0-rc4-default+ #292 
> f1e35f01b401c038558e67f3c2d644747de50dbd
> [  489.435212] Tainted: [W]=WARN, [E]=UNSIGNED_MODULE
> [  489.441381] Hardware name: Lenovo ThinkSystem SR655V3/SB27B09914, 
> BIOS KAE111E-2.10 04/11/2023
> [  489.451841] RIP: 0010:__rmqueue_pcplist+0xe1/0xc80
> [  489.458016] Code: 06 48 83 e8 08 48 89 44 24 70 48 8b 45 00 48 39 c5 
> 0f 84 72 01 00 00 48 8b 55 00 48 8b 32 48 8b 4a 08 48 8d 42 f8 48 89 4e 
> 08 <48> 89 31 48 b9 00 01 00 00 00 00 ad de 48 89 0a 48 8b 4c 24 20 48
> [  489.479905] RSP: 0018:ff4de44e432a7688 EFLAGS: 00010293
> [  489.486567] RAX: ffd659b280210a00 RBX: ff4211874d9bf400 RCX: 
> dead000000000100
> [  489.495370] RDX: ffd659b280210a08 RSI: ffd659b281ed7e08 RDI: 
> ff421184fffd60c0
> [  489.504174] RBP: ff4211874d9bf4b0 R08: ff4211874d9bf400 R09: 
> ff4211874d9bf4b0
> [  489.512976] R10: 0000000000002acd R11: 0000000000000002 R12: 
> 0000000000000003
> [  489.521779] R13: 0000000000000003 R14: 0000000000252800 R15: 
> ff421184fffd60c0
> [  489.530584] FS:  00007f66b8ffd800(0000) GS:ff4211874d980000(0000) 
> knlGS:0000000000000000
> [  489.540460] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  489.547700] CR2: 00007f66b9150d7e CR3: 0000000281a5e005 CR4: 
> 0000000000771ef0
> [  489.556502] PKRU: 55555554
> [  489.560337] Call Trace:
> [  489.563876]  <TASK>
> [  489.567022]  ? __die_body+0x1a/0x60
> [  489.571728]  ? die_addr+0x38/0x60
> [  489.576241]  ? exc_general_protection+0x19e/0x430
> [  489.582325]  ? asm_exc_general_protection+0x22/0x30
> [  489.588603]  ? __rmqueue_pcplist+0xe1/0xc80
> [  489.594092]  ? __rmqueue_pcplist+0x51b/0xc80
> [  489.599687]  get_page_from_freelist+0xe10/0x1680
> [  489.605675]  __alloc_frozen_pages_noprof+0x171/0x340
> [  489.612048]  new_slab+0x90/0x4d0
> [  489.616466]  ___slab_alloc+0x6f3/0xb20
> [  489.621469]  ? sbitmap_init_node+0x77/0x1a0
> [  489.626961]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  489.633134]  ? _get_random_bytes.part.18+0x90/0x120
> [  489.639409]  ? __slab_alloc.isra.98+0x22/0x40
> [  489.645101]  __slab_alloc.isra.98+0x22/0x40
> [  489.650597]  __kmalloc_node_noprof+0x218/0x510
> [  489.656380]  ? sbitmap_init_node+0x77/0x1a0
> [  489.661874]  ? sbitmap_init_node+0x77/0x1a0
> [  489.667362]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  489.673534]  sbitmap_init_node+0x77/0x1a0
> [  489.678830]  sbitmap_queue_init_node+0x24/0x150
> [  489.684709]  blk_mq_init_tags+0x7e/0x110
> [  489.689915]  blk_mq_alloc_map_and_rqs+0x44/0x320
> [  489.695898]  __blk_mq_alloc_map_and_rqs+0x3b/0x60
> [  489.701973]  blk_mq_alloc_tag_set+0x1f1/0x380
> [  489.707662]  nvme_alloc_io_tag_set+0xc2/0x1a0 [nvme_core 
> 22f0ce18ead628230226a9b87ebf48eb576bf299]
> [  489.718534]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  489.724707]  ? nvme_tcp_alloc_queue+0x293/0x7b0 [nvme_tcp 
> 68f6be106f52ac467179f8a0922f02aeb6fa1f1c]
> [  489.735676]  ? __pfx_nvme_tcp_tls_done+0x10/0x10 [nvme_tcp 
> 68f6be106f52ac467179f8a0922f02aeb6fa1f1c]
> [  489.746737]  nvme_tcp_setup_ctrl+0x3ee/0x700 [nvme_tcp 
> 68f6be106f52ac467179f8a0922f02aeb6fa1f1c]
> [  489.757401]  nvme_tcp_create_ctrl+0x2e3/0x4d0 [nvme_tcp 
> 68f6be106f52ac467179f8a0922f02aeb6fa1f1c]
> [  489.768165]  nvmf_dev_write+0x323/0x3d0 [nvme_fabrics 
> 34d997d53c805aa2fae8e8baee6a736e8da38358]
> [  489.778732]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  489.784906]  vfs_write+0xd9/0x430
> [  489.789422]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  489.795591]  ? __handle_mm_fault+0x7da/0xca0
> [  489.801183]  ksys_write+0x68/0xe0
> [  489.805700]  do_syscall_64+0x74/0x160
> [  489.810616]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  489.816794]  ? __count_memcg_events+0x98/0x130
> [  489.822580]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  489.828750]  ? count_memcg_events.constprop.163+0x1a/0x30
> [  489.835606]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  489.841777]  ? handle_mm_fault+0xa1/0x290
> [  489.847070]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  489.853241]  ? do_user_addr_fault+0x56b/0x830
> [  489.858928]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  489.865101]  ? exc_page_fault+0x68/0x150
> [  489.870302]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  489.876769] RIP: 0033:0x7f66b91216f0
> [  489.881573] Code: 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 
> 90 90 90 90 90 90 90 90 90 80 3d 19 c3 0e 00 00 74 17 b8 01 00 00 00 0f 
> 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 48 89
> [  489.903462] RSP: 002b:00007ffef5393ed8 EFLAGS: 00000202 ORIG_RAX: 
> 0000000000000001
> [  489.912761] RAX: ffffffffffffffda RBX: 000055e61c2113b0 RCX: 
> 00007f66b91216f0
> [  489.921564] RDX: 00000000000000ed RSI: 000055e61c2113b0 RDI: 
> 0000000000000003
> [  489.930366] RBP: 0000000000000003 R08: 00000000000000ed R09: 
> 000055e61c2113b0
> [  489.939169] R10: 0000000000000000 R11: 0000000000000202 R12: 
> 00000000000000ed
> [  489.947970] R13: 00007f66b9380008 R14: 000055e61c20a980 R15: 
> 000055e61c20b100
> [  489.956783]  </TASK>
> [  489.960024] Modules linked in: tls(E) nvme_tcp(E) af_packet(E) 
> iscsi_ibft(E) iscsi_boot_sysfs(E) amd_atl(E) intel_rapl_msr(E) 
> intel_rapl_common(E) amd64_edac(E) edac_mce_amd(E) nls_iso8859_1(E) 
> nls_cp437(E) dax_hmem(E) vfat(E) cxl_acpi(E) fat(E) kvm_amd(E) 
> ipmi_ssif(E) cxl_port(E) xfs(E) tg3(E) cxl_core(E) ipmi_si(E) i40e(E) 
> kvm(E) einj(E) wmi_bmof(E) acpi_cpufreq(E) ipmi_devintf(E) libphy(E) 
> k10temp(E) libie(E) i2c_piix4(E) i2c_smbus(E) ipmi_msghandler(E) 
> i2c_designware_platform(E) i2c_designware_core(E) button(E) 
> nvme_fabrics(E) nvme_keyring(E) fuse(E) efi_pstore(E) configfs(E) 
> dmi_sysfs(E) ip_tables(E) x_tables(E) ahci(E) libahci(E) 
> ghash_clmulni_intel(E) libata(E) sha512_ssse3(E) sd_mod(E) 
> sha256_ssse3(E) ast(E) sha1_ssse3(E) drm_client_lib(E) scsi_dh_emc(E) 
> i2c_algo_bit(E) aesni_intel(E) drm_shmem_helper(E) scsi_dh_rdac(E) 
> crypto_simd(E) cryptd(E) xhci_pci(E) drm_kms_helper(E) scsi_dh_alua(E) 
> nvme(E) sg(E) xhci_hcd(E) nvme_core(E) scsi_mod(E) drm(E) nvme_auth(E) 
> scsi_common(E) usbcore(E) sp5100_tco(E) ccp(E)
> [  489.960207]  wmi(E) btrfs(E) blake2b_generic(E) xor(E) raid6_pq(E) 
> efivarfs(E)
> 
> Haven't found a culprit for that one for now, started bisecting.
> Just wanted to report that as a heads-up, maybe you have some idea.
> 

bisect is pointing to
9aec2fb0fd5e ("slab: allocate frozen pages")
and, indeed, reverting this patch on top of linus current resolves
the issue.

Sorry Matthew.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.com                               +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Kernel oops with 6.14 when enabling TLS
  2025-03-03  7:48 ` Kernel oops with 6.14 when enabling TLS Hannes Reinecke
@ 2025-03-03 11:06   ` Hannes Reinecke
  2025-03-03 12:57     ` Hannes Reinecke
  2025-03-03 13:57     ` Matthew Wilcox
  2025-03-03 14:27   ` Matthew Wilcox
  1 sibling, 2 replies; 36+ messages in thread
From: Hannes Reinecke @ 2025-03-03 11:06 UTC (permalink / raw)
  To: Sagi Grimberg, Matthew Wilcox; +Cc: linux-nvme, linux-block, linux-mm

On 3/3/25 08:48, Hannes Reinecke wrote:
> On 2/28/25 11:47, Hannes Reinecke wrote:
>> Hi Sagi,
>>
>> enabling TLS on latest linus tree reliably crashes my system:
>>
>> [  487.018058] ------------[ cut here ]------------
>> [  487.024046] WARNING: CPU: 9 PID: 6159 at mm/slub.c:4719 
>> free_large_kmalloc+0x15/0xa0
[ .. ]
>>
>> Haven't found a culprit for that one for now, started bisecting.
>> Just wanted to report that as a heads-up, maybe you have some idea.
>>
> 
> bisect is pointing to
> 9aec2fb0fd5e ("slab: allocate frozen pages")
> and, indeed, reverting this patch on top of linus current resolves
> the issue.
> 
> Sorry Matthew.
> 
It's getting even worse; after reverting above patch I'm getting a crash
here:
[  968.315152] Oops: general protection fault, probably for 
non-canonical address 0xdead000000000120: 0000 [#1] PREE
MPT SMP NOPTI
[  968.328747] CPU: 30 UID: 0 PID: 665 Comm: kcompactd5 Kdump: loaded 
Tainted: G        W   E      6.14.0-rc4-defaul
t+ #306 9ca11b70f9498982db3664c8471cfe00b0a16485
[  968.345747] Tainted: [W]=WARN, [E]=UNSIGNED_MODULE
[  968.351913] Hardware name: Lenovo ThinkSystem SR655V3/SB27B09914, 
BIOS KAE111E-2.10 04/11/2023
[  968.362371] RIP: 0010:isolate_movable_page+0x7c/0x130
[  968.368826] Code: 02 75 3c f0 48 0f ba 2b 00 72 34 48 89 df e8 8b e0 
f6 ff 84 c0 74 20 48 8b 03 a9 00 00 01 00 75
  16 48 8b 43 18 89 ee 48 89 df <48> 8b 40 fe ff d0 0f 1f 00 84 c0 75 61 
48 89 df e8 ff d8 f2 ff f0
[  968.390698] RSP: 0018:ff582840034c7bd0 EFLAGS: 00010246
[  968.397354] RAX: dead000000000122 RBX: ffc1af3dcf400000 RCX: 
ffc1af3dcf400034
[  968.406145] RDX: dead000000000101 RSI: 000000000000000c RDI: 
ffc1af3dcf400000
[  968.414950] RBP: 000000000000000c R08: 0000000000000000 R09: 
000000000f400000
[  968.423755] R10: 0000000000000400 R11: ff4187a00d995780 R12: 
00000000003d0000
[  968.432562] R13: ff582840034c7d30 R14: 0000000000000001 R15: 
0000000000000001
[  968.441365] FS:  0000000000000000(0000) GS:ff41879ffaa00000(0000) 
knlGS:0000000000000000
[  968.451245] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  968.458488] CR2: 000055b303833c20 CR3: 000000005a838002 CR4: 
0000000000771ef0
[  968.467295] PKRU: 55555554
[  968.471120] Call Trace:
[  968.474655]  <TASK>
[  968.477804]  ? __die_body+0x1a/0x60
[  968.482521]  ? die_addr+0x38/0x60
[  968.487030]  ? exc_general_protection+0x19e/0x430
[  968.493115]  ? asm_exc_general_protection+0x22/0x30
[  968.499395]  ? isolate_movable_page+0x7c/0x130
[  968.505180]  isolate_migratepages_block+0x39a/0x1090
[  968.511555]  ? srso_alias_return_thunk+0x5/0xfbef5
[  968.517728]  ? update_curr+0x19e/0x220
[  968.522725]  compact_zone+0x368/0x1090
[  968.527722]  ? srso_alias_return_thunk+0x5/0xfbef5
[  968.533896]  compact_node+0xa8/0x120
[  968.538720]  kcompactd+0x21e/0x2b0

which again points straight into the 'allocate and free frozen pages'
patchset. Something's buggered there, and I'm not sure if further
bisecting will be getting us anywhere.

Matt?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.com                               +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Kernel oops with 6.14 when enabling TLS
  2025-03-03 11:06   ` Hannes Reinecke
@ 2025-03-03 12:57     ` Hannes Reinecke
  2025-03-03 13:57     ` Matthew Wilcox
  1 sibling, 0 replies; 36+ messages in thread
From: Hannes Reinecke @ 2025-03-03 12:57 UTC (permalink / raw)
  To: Sagi Grimberg, Matthew Wilcox; +Cc: linux-nvme, linux-block, linux-mm

On 3/3/25 12:06, Hannes Reinecke wrote:
> On 3/3/25 08:48, Hannes Reinecke wrote:
>> On 2/28/25 11:47, Hannes Reinecke wrote:
>>> Hi Sagi,
>>>
>>> enabling TLS on latest linus tree reliably crashes my system:
>>>
>>> [  487.018058] ------------[ cut here ]------------
>>> [  487.024046] WARNING: CPU: 9 PID: 6159 at mm/slub.c:4719 
>>> free_large_kmalloc+0x15/0xa0
> [ .. ]
>>>
>>> Haven't found a culprit for that one for now, started bisecting.
>>> Just wanted to report that as a heads-up, maybe you have some idea.
>>>
>>
>> bisect is pointing to
>> 9aec2fb0fd5e ("slab: allocate frozen pages")
>> and, indeed, reverting this patch on top of linus current resolves
>> the issue.
>>
>> Sorry Matthew.
>>
> It's getting even worse; after reverting above patch I'm getting a crash
> here:
> [  968.315152] Oops: general protection fault, probably for non- 
> canonical address 0xdead000000000120: 0000 [#1] PREE
> MPT SMP NOPTI
> [  968.328747] CPU: 30 UID: 0 PID: 665 Comm: kcompactd5 Kdump: loaded 
> Tainted: G        W   E      6.14.0-rc4-defaul
> t+ #306 9ca11b70f9498982db3664c8471cfe00b0a16485
> [  968.345747] Tainted: [W]=WARN, [E]=UNSIGNED_MODULE
> [  968.351913] Hardware name: Lenovo ThinkSystem SR655V3/SB27B09914, 
> BIOS KAE111E-2.10 04/11/2023
> [  968.362371] RIP: 0010:isolate_movable_page+0x7c/0x130
> [  968.368826] Code: 02 75 3c f0 48 0f ba 2b 00 72 34 48 89 df e8 8b e0 
> f6 ff 84 c0 74 20 48 8b 03 a9 00 00 01 00 75
>   16 48 8b 43 18 89 ee 48 89 df <48> 8b 40 fe ff d0 0f 1f 00 84 c0 75 61 
> 48 89 df e8 ff d8 f2 ff f0
> [  968.390698] RSP: 0018:ff582840034c7bd0 EFLAGS: 00010246
> [  968.397354] RAX: dead000000000122 RBX: ffc1af3dcf400000 RCX: 
> ffc1af3dcf400034
> [  968.406145] RDX: dead000000000101 RSI: 000000000000000c RDI: 
> ffc1af3dcf400000
> [  968.414950] RBP: 000000000000000c R08: 0000000000000000 R09: 
> 000000000f400000
> [  968.423755] R10: 0000000000000400 R11: ff4187a00d995780 R12: 
> 00000000003d0000
> [  968.432562] R13: ff582840034c7d30 R14: 0000000000000001 R15: 
> 0000000000000001
> [  968.441365] FS:  0000000000000000(0000) GS:ff41879ffaa00000(0000) 
> knlGS:0000000000000000
> [  968.451245] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  968.458488] CR2: 000055b303833c20 CR3: 000000005a838002 CR4: 
> 0000000000771ef0
> [  968.467295] PKRU: 55555554
> [  968.471120] Call Trace:
> [  968.474655]  <TASK>
> [  968.477804]  ? __die_body+0x1a/0x60
> [  968.482521]  ? die_addr+0x38/0x60
> [  968.487030]  ? exc_general_protection+0x19e/0x430
> [  968.493115]  ? asm_exc_general_protection+0x22/0x30
> [  968.499395]  ? isolate_movable_page+0x7c/0x130
> [  968.505180]  isolate_migratepages_block+0x39a/0x1090
> [  968.511555]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  968.517728]  ? update_curr+0x19e/0x220
> [  968.522725]  compact_zone+0x368/0x1090
> [  968.527722]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  968.533896]  compact_node+0xa8/0x120
> [  968.538720]  kcompactd+0x21e/0x2b0
> 
> which again points straight into the 'allocate and free frozen pages'
> patchset. Something's buggered there, and I'm not sure if further
> bisecting will be getting us anywhere.
> 
And now it's even happening without TLS enabled. So seems that
reverting the patch is not sufficient.

Some guidance would be good.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.com                               +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Kernel oops with 6.14 when enabling TLS
  2025-03-03 11:06   ` Hannes Reinecke
  2025-03-03 12:57     ` Hannes Reinecke
@ 2025-03-03 13:57     ` Matthew Wilcox
  2025-03-03 14:05       ` Hannes Reinecke
  1 sibling, 1 reply; 36+ messages in thread
From: Matthew Wilcox @ 2025-03-03 13:57 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Sagi Grimberg, linux-nvme, linux-block, linux-mm

On Mon, Mar 03, 2025 at 12:06:18PM +0100, Hannes Reinecke wrote:
> On 3/3/25 08:48, Hannes Reinecke wrote:
> > On 2/28/25 11:47, Hannes Reinecke wrote:
> > > Hi Sagi,
> > > 
> > > enabling TLS on latest linus tree reliably crashes my system:
> > > 
> > > [  487.018058] ------------[ cut here ]------------
> > > [  487.024046] WARNING: CPU: 9 PID: 6159 at mm/slub.c:4719
> > > free_large_kmalloc+0x15/0xa0
> [ .. ]
> > > 
> > > Haven't found a culprit for that one for now, started bisecting.
> > > Just wanted to report that as a heads-up, maybe you have some idea.
> > > 
> > 
> > bisect is pointing to
> > 9aec2fb0fd5e ("slab: allocate frozen pages")
> > and, indeed, reverting this patch on top of linus current resolves
> > the issue.
> > 
> > Sorry Matthew.
> > 
> It's getting even worse; after reverting above patch I'm getting a crash
> here:

If you revert that, you also need to revert 8c6e2d122b71.

But let me dig into the original problem.  The fact that it's
kmalloc_large might be a clue.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Kernel oops with 6.14 when enabling TLS
  2025-03-03 13:57     ` Matthew Wilcox
@ 2025-03-03 14:05       ` Hannes Reinecke
  0 siblings, 0 replies; 36+ messages in thread
From: Hannes Reinecke @ 2025-03-03 14:05 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Sagi Grimberg, linux-nvme, linux-block, linux-mm

On 3/3/25 14:57, Matthew Wilcox wrote:
> On Mon, Mar 03, 2025 at 12:06:18PM +0100, Hannes Reinecke wrote:
>> On 3/3/25 08:48, Hannes Reinecke wrote:
>>> On 2/28/25 11:47, Hannes Reinecke wrote:
>>>> Hi Sagi,
>>>>
>>>> enabling TLS on latest linus tree reliably crashes my system:
>>>>
>>>> [  487.018058] ------------[ cut here ]------------
>>>> [  487.024046] WARNING: CPU: 9 PID: 6159 at mm/slub.c:4719
>>>> free_large_kmalloc+0x15/0xa0
>> [ .. ]
>>>>
>>>> Haven't found a culprit for that one for now, started bisecting.
>>>> Just wanted to report that as a heads-up, maybe you have some idea.
>>>>
>>>
>>> bisect is pointing to
>>> 9aec2fb0fd5e ("slab: allocate frozen pages")
>>> and, indeed, reverting this patch on top of linus current resolves
>>> the issue.
>>>
>>> Sorry Matthew.
>>>
>> It's getting even worse; after reverting above patch I'm getting a crash
>> here:
> 
> If you revert that, you also need to revert 8c6e2d122b71.
> 
> But let me dig into the original problem.  The fact that it's
> kmalloc_large might be a clue.

Let me know if you need more details.

Incidentally, there's a blktest unit for TLS now:

https://github.com/osandov/blktests/pull/158

which should allow you to recreate the issue locally.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.com                               +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Kernel oops with 6.14 when enabling TLS
  2025-03-03  7:48 ` Kernel oops with 6.14 when enabling TLS Hannes Reinecke
  2025-03-03 11:06   ` Hannes Reinecke
@ 2025-03-03 14:27   ` Matthew Wilcox
  2025-03-03 14:42     ` Matthew Wilcox
  1 sibling, 1 reply; 36+ messages in thread
From: Matthew Wilcox @ 2025-03-03 14:27 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Sagi Grimberg, linux-nvme, linux-block, linux-mm

On Mon, Mar 03, 2025 at 08:48:09AM +0100, Hannes Reinecke wrote:
> On 2/28/25 11:47, Hannes Reinecke wrote:
> > Hi Sagi,
> > 
> > enabling TLS on latest linus tree reliably crashes my system:
> > 
> > [  487.018058] ------------[ cut here ]------------
> > [  487.024046] WARNING: CPU: 9 PID: 6159 at mm/slub.c:4719
> > free_large_kmalloc+0x15/0xa0

That's:

        if (WARN_ON_ONCE(order == 0))
                pr_warn_once("object pointer: 0x%p\n", object);

And while the object pointer is obfuscated (hashed pointers), this
wouldn't be helpful in trying to track down the problem.  Perhaps
we could make this a VM_WARN_ON_ONCE_FOLIO() so we get the dump_page()?

I'm tempted to believe this is a double-free, but then I'm not sure why
it'd be triggered by this patch.

> > [  487.296801]  kfree+0x234/0x320
> > [  487.332084]  nvmf_connect_admin_queue+0x105/0x1a0 [nvme_fabrics
> > 34d997d53c805aa2fae8e8baee6a736e8da38358]
> > [  487.332093]  nvme_tcp_start_queue+0x18f/0x310 [nvme_tcp
> > 68f6be106f52ac467179f8a0922f02aeb6fa1f1c]
> > [  487.332102]  nvme_tcp_setup_ctrl+0xf8/0x700 [nvme_tcp
> > 68f6be106f52ac467179f8a0922f02aeb6fa1f1c]
> > [  487.394495]  nvme_tcp_create_ctrl+0x2e3/0x4d0 [nvme_tcp
> > 68f6be106f52ac467179f8a0922f02aeb6fa1f1c]
> > [  487.394503]  nvmf_dev_write+0x323/0x3d0 [nvme_fabrics
> > 34d997d53c805aa2fae8e8baee6a736e8da38358]
> > [  487.394514]  vfs_write+0xd9/0x430
> > [  487.551642] object pointer: 0x00000000346cb6fc

Oh, wait, that's not the crash!

We continue to free the folio.  Even though we hit the "can't happen"
case.  That's dangerous.

> > [  489.405197] Oops: general protection fault, probably for non-
> > canonical address 0xdead000000000100: 0000 [#1] PREEMPT SMP NOPTI

I think we all recognise that as list poison.  I bet this is a double-free.

Or it could be a wild-free.  I mean, look at kfree():

        folio = virt_to_folio(object);
        if (unlikely(!folio_test_slab(folio))) {
                free_large_kmalloc(folio, (void *)object);
                return;
        }

So if you call kfree() on a random pointer, chances are it's not part
of slab, and we jump into the free_large_kmalloc() path.

We have a _lot_ of page types available.  We should mark large kmallocs
as such.  I'll send a patch to do that.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Kernel oops with 6.14 when enabling TLS
  2025-03-03 14:27   ` Matthew Wilcox
@ 2025-03-03 14:42     ` Matthew Wilcox
  2025-03-03 15:12       ` Vlastimil Babka
  2025-03-03 15:39       ` Hannes Reinecke
  0 siblings, 2 replies; 36+ messages in thread
From: Matthew Wilcox @ 2025-03-03 14:42 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Vlastimil Babka, Sagi Grimberg, linux-nvme, linux-block, linux-mm

On Mon, Mar 03, 2025 at 02:27:06PM +0000, Matthew Wilcox wrote:
> We have a _lot_ of page types available.  We should mark large kmallocs
> as such.  I'll send a patch to do that.

Can you try this?  It should fix the crash, at least.  Not sure why the
frozen patch triggered it.

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 36d283552f80..df9234e5f478 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -925,14 +925,15 @@ FOLIO_FLAG_FALSE(has_hwpoisoned)
 enum pagetype {
 	/* 0x00-0x7f are positive numbers, ie mapcount */
 	/* Reserve 0x80-0xef for mapcount overflow. */
-	PGTY_buddy	= 0xf0,
-	PGTY_offline	= 0xf1,
-	PGTY_table	= 0xf2,
-	PGTY_guard	= 0xf3,
-	PGTY_hugetlb	= 0xf4,
-	PGTY_slab	= 0xf5,
-	PGTY_zsmalloc	= 0xf6,
-	PGTY_unaccepted	= 0xf7,
+	PGTY_buddy		= 0xf0,
+	PGTY_offline		= 0xf1,
+	PGTY_table		= 0xf2,
+	PGTY_guard		= 0xf3,
+	PGTY_hugetlb		= 0xf4,
+	PGTY_slab		= 0xf5,
+	PGTY_zsmalloc		= 0xf6,
+	PGTY_unaccepted		= 0xf7,
+	PGTY_large_kmalloc	= 0xf8,
 
 	PGTY_mapcount_underflow = 0xff
 };
@@ -1075,6 +1076,7 @@ PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc)
  * Serialized with zone lock.
  */
 PAGE_TYPE_OPS(Unaccepted, unaccepted, unaccepted)
+FOLIO_TYPE_OPS(large_kmalloc, large_kmalloc)
 
 /**
  * PageHuge - Determine if the page belongs to hugetlbfs
diff --git a/mm/slub.c b/mm/slub.c
index 1f50129dcfb3..872e1bab3bd1 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4241,6 +4241,7 @@ static void *___kmalloc_large_node(size_t size, gfp_t flags, int node)
 		ptr = folio_address(folio);
 		lruvec_stat_mod_folio(folio, NR_SLAB_UNRECLAIMABLE_B,
 				      PAGE_SIZE << order);
+		__folio_set_large_kmalloc(folio);
 	}
 
 	ptr = kasan_kmalloc_large(ptr, size, flags);
@@ -4716,6 +4717,11 @@ static void free_large_kmalloc(struct folio *folio, void *object)
 {
 	unsigned int order = folio_order(folio);
 
+	if (WARN_ON_ONCE(!folio_test_large_kmalloc(folio))) {
+		dump_page(&folio->page, "Not a kmalloc allocation");
+		return;
+	}
+
 	if (WARN_ON_ONCE(order == 0))
 		pr_warn_once("object pointer: 0x%p\n", object);
 
@@ -4725,6 +4731,7 @@ static void free_large_kmalloc(struct folio *folio, void *object)
 
 	lruvec_stat_mod_folio(folio, NR_SLAB_UNRECLAIMABLE_B,
 			      -(PAGE_SIZE << order));
+	__folio_clear_large_kmalloc(folio);
 	folio_put(folio);
 }
 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Kernel oops with 6.14 when enabling TLS
  2025-03-03 14:42     ` Matthew Wilcox
@ 2025-03-03 15:12       ` Vlastimil Babka
  2025-03-03 15:39       ` Hannes Reinecke
  1 sibling, 0 replies; 36+ messages in thread
From: Vlastimil Babka @ 2025-03-03 15:12 UTC (permalink / raw)
  To: Matthew Wilcox, Hannes Reinecke
  Cc: Sagi Grimberg, linux-nvme, linux-block, linux-mm

On 3/3/25 15:42, Matthew Wilcox wrote:
> On Mon, Mar 03, 2025 at 02:27:06PM +0000, Matthew Wilcox wrote:
>> We have a _lot_ of page types available.  We should mark large kmallocs
>> as such.  I'll send a patch to do that.
> 
> Can you try this?  It should fix the crash, at least.  Not sure why the
> frozen patch triggered it.

Having CONFIG_PAGE_OWNER and booting with page_owner=on could make the dump
more useful too, tell you where the dumped page was last allocated and freed.

In addition CONFIG_DEBUG_PAGEALLOC and booting with debug_pagealloc=on could
potentially catch a use-after-free sooner (if something accesses the page
via direct mapping after it's kfree()'d.

But if this is due to a tight race, it could also mask the bug etc.

> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 36d283552f80..df9234e5f478 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -925,14 +925,15 @@ FOLIO_FLAG_FALSE(has_hwpoisoned)
>  enum pagetype {
>  	/* 0x00-0x7f are positive numbers, ie mapcount */
>  	/* Reserve 0x80-0xef for mapcount overflow. */
> -	PGTY_buddy	= 0xf0,
> -	PGTY_offline	= 0xf1,
> -	PGTY_table	= 0xf2,
> -	PGTY_guard	= 0xf3,
> -	PGTY_hugetlb	= 0xf4,
> -	PGTY_slab	= 0xf5,
> -	PGTY_zsmalloc	= 0xf6,
> -	PGTY_unaccepted	= 0xf7,
> +	PGTY_buddy		= 0xf0,
> +	PGTY_offline		= 0xf1,
> +	PGTY_table		= 0xf2,
> +	PGTY_guard		= 0xf3,
> +	PGTY_hugetlb		= 0xf4,
> +	PGTY_slab		= 0xf5,
> +	PGTY_zsmalloc		= 0xf6,
> +	PGTY_unaccepted		= 0xf7,
> +	PGTY_large_kmalloc	= 0xf8,
>  
>  	PGTY_mapcount_underflow = 0xff
>  };
> @@ -1075,6 +1076,7 @@ PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc)
>   * Serialized with zone lock.
>   */
>  PAGE_TYPE_OPS(Unaccepted, unaccepted, unaccepted)
> +FOLIO_TYPE_OPS(large_kmalloc, large_kmalloc)
>  
>  /**
>   * PageHuge - Determine if the page belongs to hugetlbfs
> diff --git a/mm/slub.c b/mm/slub.c
> index 1f50129dcfb3..872e1bab3bd1 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4241,6 +4241,7 @@ static void *___kmalloc_large_node(size_t size, gfp_t flags, int node)
>  		ptr = folio_address(folio);
>  		lruvec_stat_mod_folio(folio, NR_SLAB_UNRECLAIMABLE_B,
>  				      PAGE_SIZE << order);
> +		__folio_set_large_kmalloc(folio);
>  	}
>  
>  	ptr = kasan_kmalloc_large(ptr, size, flags);
> @@ -4716,6 +4717,11 @@ static void free_large_kmalloc(struct folio *folio, void *object)
>  {
>  	unsigned int order = folio_order(folio);
>  
> +	if (WARN_ON_ONCE(!folio_test_large_kmalloc(folio))) {
> +		dump_page(&folio->page, "Not a kmalloc allocation");
> +		return;
> +	}
> +
>  	if (WARN_ON_ONCE(order == 0))
>  		pr_warn_once("object pointer: 0x%p\n", object);
>  
> @@ -4725,6 +4731,7 @@ static void free_large_kmalloc(struct folio *folio, void *object)
>  
>  	lruvec_stat_mod_folio(folio, NR_SLAB_UNRECLAIMABLE_B,
>  			      -(PAGE_SIZE << order));
> +	__folio_clear_large_kmalloc(folio);
>  	folio_put(folio);
>  }
>  



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Kernel oops with 6.14 when enabling TLS
  2025-03-03 14:42     ` Matthew Wilcox
  2025-03-03 15:12       ` Vlastimil Babka
@ 2025-03-03 15:39       ` Hannes Reinecke
  2025-03-03 15:48         ` Matthew Wilcox
  1 sibling, 1 reply; 36+ messages in thread
From: Hannes Reinecke @ 2025-03-03 15:39 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Vlastimil Babka, Sagi Grimberg, linux-nvme, linux-block, linux-mm

On 3/3/25 15:42, Matthew Wilcox wrote:
> On Mon, Mar 03, 2025 at 02:27:06PM +0000, Matthew Wilcox wrote:
>> We have a _lot_ of page types available.  We should mark large kmallocs
>> as such.  I'll send a patch to do that.
> 
> Can you try this?  It should fix the crash, at least.  Not sure why the
> frozen patch triggered it.
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 36d283552f80..df9234e5f478 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -925,14 +925,15 @@ FOLIO_FLAG_FALSE(has_hwpoisoned)
>   enum pagetype {
>   	/* 0x00-0x7f are positive numbers, ie mapcount */
>   	/* Reserve 0x80-0xef for mapcount overflow. */
> -	PGTY_buddy	= 0xf0,
> -	PGTY_offline	= 0xf1,
> -	PGTY_table	= 0xf2,
> -	PGTY_guard	= 0xf3,
> -	PGTY_hugetlb	= 0xf4,
> -	PGTY_slab	= 0xf5,
> -	PGTY_zsmalloc	= 0xf6,
> -	PGTY_unaccepted	= 0xf7,
> +	PGTY_buddy		= 0xf0,
> +	PGTY_offline		= 0xf1,
> +	PGTY_table		= 0xf2,
> +	PGTY_guard		= 0xf3,
> +	PGTY_hugetlb		= 0xf4,
> +	PGTY_slab		= 0xf5,
> +	PGTY_zsmalloc		= 0xf6,
> +	PGTY_unaccepted		= 0xf7,
> +	PGTY_large_kmalloc	= 0xf8,
>   
>   	PGTY_mapcount_underflow = 0xff
>   };
> @@ -1075,6 +1076,7 @@ PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc)
>    * Serialized with zone lock.
>    */
>   PAGE_TYPE_OPS(Unaccepted, unaccepted, unaccepted)
> +FOLIO_TYPE_OPS(large_kmalloc, large_kmalloc)
>   
>   /**
>    * PageHuge - Determine if the page belongs to hugetlbfs
> diff --git a/mm/slub.c b/mm/slub.c
> index 1f50129dcfb3..872e1bab3bd1 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4241,6 +4241,7 @@ static void *___kmalloc_large_node(size_t size, gfp_t flags, int node)
>   		ptr = folio_address(folio);
>   		lruvec_stat_mod_folio(folio, NR_SLAB_UNRECLAIMABLE_B,
>   				      PAGE_SIZE << order);
> +		__folio_set_large_kmalloc(folio);
>   	}
>   
>   	ptr = kasan_kmalloc_large(ptr, size, flags);
> @@ -4716,6 +4717,11 @@ static void free_large_kmalloc(struct folio *folio, void *object)
>   {
>   	unsigned int order = folio_order(folio);
>   
> +	if (WARN_ON_ONCE(!folio_test_large_kmalloc(folio))) {
> +		dump_page(&folio->page, "Not a kmalloc allocation");
> +		return;
> +	}
> +
>   	if (WARN_ON_ONCE(order == 0))
>   		pr_warn_once("object pointer: 0x%p\n", object);
>   
> @@ -4725,6 +4731,7 @@ static void free_large_kmalloc(struct folio *folio, void *object)
>   
>   	lruvec_stat_mod_folio(folio, NR_SLAB_UNRECLAIMABLE_B,
>   			      -(PAGE_SIZE << order));
> +	__folio_clear_large_kmalloc(folio);
>   	folio_put(folio);
>   }
>   

Still crashes:

[   63.561990] nvme nvme1: queue 0: failed to start TLS: -3
[   63.652070] ------------[ cut here ]------------
[   63.658068] WARNING: CPU: 6 PID: 5216 at mm/slub.c:4720 
free_large_kmalloc+0x89/0xa0
[   63.667579] Modules linked in: tls(E) nvme_tcp(E) af_packet(E) 
iscsi_ibft(E) iscsi_boot_sysfs(E) nls_iso8859_1(E) nls_cp437(E) vfat(E) 
fat(E) ipmi_ssif(E) amd_atl(E) intel_rapl_msr(E) xfs(E) 
intel_rapl_common(E) amd64_edac(E) edac_mce_amd(E) dax_hmem(E) 
cxl_acpi(E) tg3(E) cxl_port(E) kvm_amd(E) ipmi_si(E) cxl_core(E) i40e(E) 
i2c_piix4(E) ipmi_devintf(E) einj(E) kvm(E) acpi_cpufreq(E) wmi_bmof(E) 
libphy(E) libie(E) k10temp(E) i2c_smbus(E) ipmi_msghandler(E) 
i2c_designware_platform(E) joydev(E) i2c_designware_core(E) button(E) 
nvme_fabrics(E) nvme_keyring(E) fuse(E) efi_pstore(E) configfs(E) 
dmi_sysfs(E) ip_tables(E) x_tables(E) hid_generic(E) usbhid(E) ahci(E) 
libahci(E) ghash_clmulni_intel(E) libata(E) sha512_ssse3(E) 
sha256_ssse3(E) sd_mod(E) ast(E) sha1_ssse3(E) scsi_dh_emc(E) 
drm_client_lib(E) scsi_dh_rdac(E) i2c_algo_bit(E) aesni_intel(E) 
xhci_pci(E) scsi_dh_alua(E) drm_shmem_helper(E) crypto_simd(E) 
drm_kms_helper(E) cryptd(E) sg(E) nvme(E) xhci_hcd(E) nvme_core(E) 
scsi_mod(E) drm(E) nvme_auth(E) scsi_common(E)
[   63.667703]  usbcore(E) ccp(E) sp5100_tco(E) wmi(E) btrfs(E) 
blake2b_generic(E) xor(E) raid6_pq(E) efivarfs(E)
[   63.667717] CPU: 6 UID: 0 PID: 5216 Comm: nvme Kdump: loaded Tainted: 
G        W   E      6.14.0-rc4-default+ #308 
190df031934d7fa516e6fdc38148e19d2fe48841
[   63.667724] Tainted: [W]=WARN, [E]=UNSIGNED_MODULE
[   63.667726] Hardware name: Lenovo ThinkSystem SR655V3/SB27B09914, 
BIOS KAE111E-2.10 04/11/2023
[   63.667728] RIP: 0010:free_large_kmalloc+0x89/0xa0
[   63.667733] Code: df e8 5b ff ff ff 83 7b 30 ff 74 07 c7 43 30 ff ff 
ff ff f0 ff 4b 34 74 07 5b 5d e9 2c 08 b4 ff 48 89 df 5b 5d e9 57 05 f8 
ff <0f> 0b 48 89 df 48 c7 c6 9e bc 5b 82 5b 5d e9 a4 0b fb ff 0f 1f 40
[   63.842753] RSP: 0018:ff824cf0c3307c08 EFLAGS: 00010202
[   63.842757] RAX: 00000000000000ff RBX: fffb0b48c0178e80 RCX: 
ff45d99f856df580
[   63.842759] RDX: 0000000000000000 RSI: ff45d99f85e3a800 RDI: 
fffb0b48c0178e80
[   63.842760] RBP: 00000000fffff000 R08: 0000000000000001 R09: 
0000000000000101
[   63.842762] R10: ff824cf0c3307c90 R11: 0000000000000001 R12: 
fffb0b48c0178e80
[   63.842764] R13: ff45d99f85e3a800 R14: ff45d9a1d3622a30 R15: 
ff45d9a1e9b58000
[   63.842765] FS:  00007f5f53015800(0000) GS:ff45d9a24d800000(0000) 
knlGS:0000000000000000
[   63.842767] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   63.842769] CR2: 00007f5f52f50d7e CR3: 00000002538d0001 CR4: 
0000000000771ef0
[   63.842771] PKRU: 55555554
[   63.842773] Call Trace:
[   63.842776]  <TASK>
[   63.842781]  ? __warn+0x85/0x130
[   63.934362]  ? free_large_kmalloc+0x89/0xa0
[   63.934369]  ? report_bug+0xf8/0x1e0
[   63.934376]  ? handle_bug+0x50/0xa0
[   63.934381]  ? exc_invalid_op+0x13/0x60
[   63.934385]  ? asm_exc_invalid_op+0x16/0x20
[   63.934394]  ? free_large_kmalloc+0x89/0xa0
[   63.934398]  kfree+0x2a5/0x340
[   63.934403]  ? srso_alias_return_thunk+0x5/0xfbef5
[   63.934409]  ? nvmf_connect_admin_queue+0x105/0x1a0 [nvme_fabrics 
18bfa9223bf0bd1ec571f5f45774adcc919a867e]
[   63.987625]  ? nvmf_connect_admin_queue+0xa1/0x1a0 [nvme_fabrics 
18bfa9223bf0bd1ec571f5f45774adcc919a867e]
[   63.987632]  nvmf_connect_admin_queue+0x105/0x1a0 [nvme_fabrics 
18bfa9223bf0bd1ec571f5f45774adcc919a867e]
[   63.987641]  nvme_tcp_start_queue+0x192/0x310 [nvme_tcp 
a0629454ac5200d03b72a09e4d2b1e27dfa113e9]
[   63.987649]  nvme_tcp_setup_ctrl+0xf8/0x700 [nvme_tcp 
a0629454ac5200d03b72a09e4d2b1e27dfa113e9]
[   63.987655]  ? nvme_change_ctrl_state+0x99/0x1b0 [nvme_core 
4c8074fe8bffb31437bcdb2c4a45c7e74861c83b]
[   64.043323]  nvme_tcp_create_ctrl+0x2e3/0x4d0 [nvme_tcp 
a0629454ac5200d03b72a09e4d2b1e27dfa113e9]
[   64.043332]  nvmf_dev_write+0x323/0x3d0 [nvme_fabrics 
18bfa9223bf0bd1ec571f5f45774adcc919a867e]
[   64.043338]  ? srso_alias_return_thunk+0x5/0xfbef5
[   64.043344]  vfs_write+0xd9/0x430
[   64.043349]  ? syscall_exit_to_user_mode+0xc/0x200
[   64.043355]  ? srso_alias_return_thunk+0x5/0xfbef5
[   64.043358]  ? do_syscall_64+0x81/0x160
[   64.043363]  ? srso_alias_return_thunk+0x5/0xfbef5
[   64.043367]  ksys_write+0x68/0xe0
[   64.043371]  do_syscall_64+0x74/0x160
[   64.108416]  ? srso_alias_return_thunk+0x5/0xfbef5
[   64.108422]  ? exc_page_fault+0x68/0x150
[   64.108428]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   64.108433] RIP: 0033:0x7f5f52f216f0
[   64.108438] Code: 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 
90 90 90 90 90 90 90 90 90 80 3d 19 c3 0e 00 00 74 17 b8 01 00 00 00 0f 
05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 48 89
[   64.108440] RSP: 002b:00007ffe45dc8bb8 EFLAGS: 00000202 ORIG_RAX: 
0000000000000001
[   64.108443] RAX: ffffffffffffffda RBX: 00005608dbad8980 RCX: 
00007f5f52f216f0
[   64.108445] RDX: 00000000000000ed RSI: 00005608dbad8980 RDI: 
0000000000000003
[   64.108447] RBP: 0000000000000003 R08: 00000000000000ed R09: 
00005608dbad8980
[   64.108448] R10: 0000000000000000 R11: 0000000000000202 R12: 
00000000000000ed
[   64.108450] R13: 00007f5f5307b008 R14: 00005608dbad1f50 R15: 
00005608dbad26d0
[   64.108456]  </TASK>
[   64.108458] ---[ end trace 0000000000000000 ]---
[   64.108461] page: refcount:0 mapcount:0 mapping:0000000000000000 
index:0x2 pfn:0x5e3a
[   64.108465] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
[   64.108469] raw: 000fffffc0000000 0000000000000000 fffb0b48c0178e90 
0000000000000000
[   64.108472] raw: 0000000000000002 0000000000000000 00000000ffffffff 
0000000000000000
[   64.108473] page dumped because: Not a kmalloc allocation
[   64.112317] nvme nvme1: creating 32 I/O queues.
[   66.074182] nvme nvme1: mapped 32/0/0 default/read/poll queues.
[   66.084156] page: refcount:0 mapcount:0 mapping:0000000000000000 
index:0x0 pfn:0x5de5
[   66.093770] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
[   66.101810] raw: 000fffffc0000000 0000000000000000 dead000000000122 
0000000000000000
[   66.111311] raw: 0000000000000000 0000000000000000 00000000ffffffff 
0000000000000000
[   66.111314] page dumped because: Not a kmalloc allocation
[   66.112001] page: refcount:0 mapcount:0 mapping:0000000000000000 
index:0xdc pfn:0x5de3
[   66.137452] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
[   66.137460] raw: 000fffffc0000000 ff45d9a24d93f420 ff45d9a24d93f420 
0000000000000000
[   66.137464] raw: 00000000000000dc 0000000000000000 00000000ffffffff 
0000000000000000
[   66.137466] page dumped because: Not a kmalloc allocation
[   66.138095] page: refcount:0 mapcount:0 mapping:0000000000000000 
index:0x0 pfn:0x5de5
[   66.180944] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
[   66.180950] raw: 000fffffc0000000 ff45d9a24da3f420 ff45d9a24da3f420 
0000000000000000
[   66.180953] raw: 0000000000000000 0000000000000000 00000000ffffffff 
0000000000000000
[   66.180954] page dumped because: Not a kmalloc allocation
[   66.181672] BUG: unable to handle page fault for address: 
ff40e4ea8fa50250
[   66.223318] #PF: supervisor read access in kernel mode
[   66.223320] #PF: error_code(0x0000) - not-present page
[   66.223322] PGD 0
[   66.223325] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
[   66.223330] CPU: 73 UID: 0 PID: 5216 Comm: nvme Kdump: loaded 
Tainted: G        W   E      6.14.0-rc4-default+ #308 
190df031934d7fa516e6fdc38148e19d2fe48841
[   66.223337] Tainted: [W]=WARN, [E]=UNSIGNED_MODULE
[   66.223338] Hardware name: Lenovo ThinkSystem SR655V3/SB27B09914, 
BIOS KAE111E-2.10 04/11/2023
[   66.223340] RIP: 0010:kfree+0xea/0x340
[   66.223348] Code: b9 92 01 48 89 da 48 29 c2 48 81 fa ff ff 1f 00 77 
09 48 85 c0 0f 85 95 01 00 00 48 c7 c7 18 ba 01 00 49 8b 45 00 65 48 03 
07 <48> 8b 50 08 4c 39 60 10 0f 85 bb 01 00 00 41 8b 4d 28 48 8b 00 48
[   66.223350] RSP: 0018:ff824cf0c3307bf0 EFLAGS: 00010287
[   66.223353] RAX: ff40e4ea8fa50248 RBX: ff45d99f85de0400 RCX: 
0000000000000001
[   66.223355] RDX: ff45d99f85de0400 RSI: ffffffffc1b3028a RDI: 
000000000001ba18
[   66.223357] RBP: ff824cf0c3307c40 R08: 0000000000000001 R09: 
0000000000000008
[   66.223359] R10: ff824cf0c3307c58 R11: 0000000000000001 R12: 
fffb0b48c0177800
[   66.347600] R13: fffb0b48c024ce88 R14: 0000000000000004 R15: 
ffffffffc1b3028a
[   66.347604] FS:  00007f5f53015800(0000) GS:ff45d9a1cdb80000(0000) 
knlGS:0000000000000000
[   66.347607] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   66.347609] CR2: ff40e4ea8fa50250 CR3: 00000002538d0001 CR4: 
0000000000771ef0
[   66.347611] PKRU: 55555554
[   66.347613] Call Trace:
[   66.347616]  <TASK>
[   66.347621]  ? __die_body+0x1a/0x60
[   66.347629]  ? page_fault_oops+0x132/0x4b0
[   66.347634]  ? search_module_extables+0x15/0x60
[   66.347640]  ? srso_alias_return_thunk+0x5/0xfbef5
[   66.347644]  ? search_bpf_extables+0x65/0x70
[   66.347649]  ? srso_alias_return_thunk+0x5/0xfbef5
[   66.347654]  ? exc_page_fault+0xb0/0x150
[   66.347660]  ? asm_exc_page_fault+0x22/0x30
[   66.347667]  ? nvmf_connect_io_queue+0xfa/0x1c0 [nvme_fabrics 
18bfa9223bf0bd1ec571f5f45774adcc919a867e]
[   66.347676]  ? nvmf_connect_io_queue+0xfa/0x1c0 [nvme_fabrics 
18bfa9223bf0bd1ec571f5f45774adcc919a867e]
[   66.460070]  ? kfree+0xea/0x340
[   66.460079]  ? srso_alias_return_thunk+0x5/0xfbef5
[   66.460086]  ? nvmf_connect_io_queue+0xfa/0x1c0 [nvme_fabrics 
18bfa9223bf0bd1ec571f5f45774adcc919a867e]
[   66.460091]  nvmf_connect_io_queue+0xfa/0x1c0 [nvme_fabrics 
18bfa9223bf0bd1ec571f5f45774adcc919a867e]
[   66.460102]  nvme_tcp_start_queue+0x166/0x310 [nvme_tcp 
a0629454ac5200d03b72a09e4d2b1e27dfa113e9]
[   66.460111]  nvme_tcp_start_io_queues+0x32/0x80 [nvme_tcp 
a0629454ac5200d03b72a09e4d2b1e27dfa113e9]
[   66.460117]  nvme_tcp_setup_ctrl+0x421/0x700 [nvme_tcp 
a0629454ac5200d03b72a09e4d2b1e27dfa113e9]
[   66.525407]  nvme_tcp_create_ctrl+0x2e3/0x4d0 [nvme_tcp 
a0629454ac5200d03b72a09e4d2b1e27dfa113e9]
[   66.525416]  nvmf_dev_write+0x323/0x3d0 [nvme_fabrics 
18bfa9223bf0bd1ec571f5f45774adcc919a867e]
[   66.525422]  ? srso_alias_return_thunk+0x5/0xfbef5
[   66.525429]  vfs_write+0xd9/0x430
[   66.525434]  ? syscall_exit_to_user_mode+0xc/0x200
[   66.525441]  ? srso_alias_return_thunk+0x5/0xfbef5
[   66.525443]  ? do_syscall_64+0x81/0x160
[   66.525449]  ? srso_alias_return_thunk+0x5/0xfbef5
[   66.525453]  ksys_write+0x68/0xe0
[   66.585510]  do_syscall_64+0x74/0x160
[   66.585516]  ? srso_alias_return_thunk+0x5/0xfbef5
[   66.585518]  ? exc_page_fault+0x68/0x150
[   66.585523]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   66.585528] RIP: 0033:0x7f5f52f216f0
[   66.585532] Code: 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 
90 90 90 90 90 90 90 90 90 80 3d 19 c3 0e 00 00 74 17 b8 01 00 00 00 0f 
05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 48 89
[   66.585534] RSP: 002b:00007ffe45dc8bb8 EFLAGS: 00000202 ORIG_RAX: 
0000000000000001
[   66.585538] RAX: ffffffffffffffda RBX: 00005608dbad8980 RCX: 
00007f5f52f216f0
[   66.585540] RDX: 00000000000000ed RSI: 00005608dbad8980 RDI: 
0000000000000003
[   66.585541] RBP: 0000000000000003 R08: 00000000000000ed R09: 
00005608dbad8980
[   66.585543] R10: 0000000000000000 R11: 0000000000000202 R12: 
00000000000000ed
[   66.585545] R13: 00007f5f5307b008 R14: 00005608dbad1f50 R15: 
00005608dbad26d0
[   66.585552]  </TASK>
[   66.691445] Modules linked in: tls(E) nvme_tcp(E) af_packet(E) 
iscsi_ibft(E) iscsi_boot_sysfs(E) nls_iso8859_1(E) nls_cp437(E) vfat(E) 
fat(E) ipmi_ssif(E) amd_atl(E) intel_rapl_msr(E) xfs(E) 
intel_rapl_common(E) amd64_edac(E) edac_mce_amd(E) dax_hmem(E) 
cxl_acpi(E) tg3(E) cxl_port(E) kvm_amd(E) ipmi_si(E) cxl_core(E) i40e(E) 
i2c_piix4(E) ipmi_devintf(E) einj(E) kvm(E) acpi_cpufreq(E) wmi_bmof(E) 
libphy(E) libie(E) k10temp(E) i2c_smbus(E) ipmi_msghandler(E) 
i2c_designware_platform(E) joydev(E) i2c_designware_core(E) button(E) 
nvme_fabrics(E) nvme_keyring(E) fuse(E) efi_pstore(E) configfs(E) 
dmi_sysfs(E) ip_tables(E) x_tables(E) hid_generic(E) usbhid(E) ahci(E) 
libahci(E) ghash_clmulni_intel(E) libata(E) sha512_ssse3(E) 
sha256_ssse3(E) sd_mod(E) ast(E) sha1_ssse3(E) scsi_dh_emc(E) 
drm_client_lib(E) scsi_dh_rdac(E) i2c_algo_bit(E) aesni_intel(E) 
xhci_pci(E) scsi_dh_alua(E) drm_shmem_helper(E) crypto_simd(E) 
drm_kms_helper(E) cryptd(E) sg(E) nvme(E) xhci_hcd(E) nvme_core(E) 
scsi_mod(E) drm(E) nvme_auth(E) scsi_common(E)
[   66.691553]  usbcore(E) ccp(E) sp5100_tco(E) wmi(E) btrfs(E) 
blake2b_generic(E) xor(E) raid6_pq(E) efivarfs(E)
[   66.793438] CR2: ff40e4ea8fa50250

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.com                               +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Kernel oops with 6.14 when enabling TLS
  2025-03-03 15:39       ` Hannes Reinecke
@ 2025-03-03 15:48         ` Matthew Wilcox
  2025-03-03 16:15           ` Vlastimil Babka
  0 siblings, 1 reply; 36+ messages in thread
From: Matthew Wilcox @ 2025-03-03 15:48 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Vlastimil Babka, Sagi Grimberg, linux-nvme, linux-block, linux-mm

On Mon, Mar 03, 2025 at 04:39:47PM +0100, Hannes Reinecke wrote:
> On 3/3/25 15:42, Matthew Wilcox wrote:
> > On Mon, Mar 03, 2025 at 02:27:06PM +0000, Matthew Wilcox wrote:
> > > We have a _lot_ of page types available.  We should mark large kmallocs
> > > as such.  I'll send a patch to do that.
> > 
> > Can you try this?  It should fix the crash, at least.  Not sure why the
> > frozen patch triggered it.
> 
> Still crashes:

It warns, but doesn't crash!  This is an improvement.

> [   63.658068] WARNING: CPU: 6 PID: 5216 at mm/slub.c:4720
> free_large_kmalloc+0x89/0xa0
> [   63.667728] RIP: 0010:free_large_kmalloc+0x89/0xa0
> [   63.842773] Call Trace:
> [   63.934398]  kfree+0x2a5/0x340
> [   63.987632]  nvmf_connect_admin_queue+0x105/0x1a0 [nvme_fabrics
> 18bfa9223bf0bd1ec571f5f45774adcc919a867e]
> [   63.987641]  nvme_tcp_start_queue+0x192/0x310 [nvme_tcp
> a0629454ac5200d03b72a09e4d2b1e27dfa113e9]
> [   63.987649]  nvme_tcp_setup_ctrl+0xf8/0x700 [nvme_tcp
> a0629454ac5200d03b72a09e4d2b1e27dfa113e9]
> [   64.043323]  nvme_tcp_create_ctrl+0x2e3/0x4d0 [nvme_tcp
> a0629454ac5200d03b72a09e4d2b1e27dfa113e9]
> [   64.043332]  nvmf_dev_write+0x323/0x3d0 [nvme_fabrics
> 18bfa9223bf0bd1ec571f5f45774adcc919a867e]
> [   64.043344]  vfs_write+0xd9/0x430
> [   64.108458] ---[ end trace 0000000000000000 ]---
> [   64.108461] page: refcount:0 mapcount:0 mapping:0000000000000000
> index:0x2 pfn:0x5e3a
> [   64.108465] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
> [   64.108469] raw: 000fffffc0000000 0000000000000000 fffb0b48c0178e90
> 0000000000000000
> [   64.108472] raw: 0000000000000002 0000000000000000 00000000ffffffff
> 0000000000000000
> [   64.108473] page dumped because: Not a kmalloc allocation

Right.  So you called kfree() on something that isn't currently
kmalloced memory.  Either it used to be kmalloced memory and we freed
the slab that it used to be in, or it's a wild pointer.  Whichever
it is, that's a bug in the caller, not in slab.

Why it bisected to that commit, I can't say.  Maybe it changed the
timing, or maybe it was just luck (whether the allocation which is now
being freed is the last allocation in the slab or not).

> [   66.084156] page: refcount:0 mapcount:0 mapping:0000000000000000
> index:0x0 pfn:0x5de5
> [   66.093770] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
> [   66.101810] raw: 000fffffc0000000 0000000000000000 dead000000000122
> 0000000000000000
> [   66.111311] raw: 0000000000000000 0000000000000000 00000000ffffffff
> 0000000000000000
> [   66.111314] page dumped because: Not a kmalloc allocation
> [   66.112001] page: refcount:0 mapcount:0 mapping:0000000000000000
> index:0xdc pfn:0x5de3
> [   66.137452] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
> [   66.137460] raw: 000fffffc0000000 ff45d9a24d93f420 ff45d9a24d93f420
> 0000000000000000
> [   66.137464] raw: 00000000000000dc 0000000000000000 00000000ffffffff
> 0000000000000000

It happened again ;-)

> [   66.137466] page dumped because: Not a kmalloc allocation
> [   66.138095] page: refcount:0 mapcount:0 mapping:0000000000000000
> index:0x0 pfn:0x5de5
> [   66.180944] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
> [   66.180950] raw: 000fffffc0000000 ff45d9a24da3f420 ff45d9a24da3f420
> 0000000000000000
> [   66.180953] raw: 0000000000000000 0000000000000000 00000000ffffffff
> 0000000000000000
> [   66.180954] page dumped because: Not a kmalloc allocation

And again ...

> [   66.181672] BUG: unable to handle page fault for address:
> ff40e4ea8fa50250

Oh, now it crashed.  But we have so much evidence of a bug in the caller
at this point that I don't think we can blame slab for falling over.
If you're double-freeing something that's _not_ in a freed slab, this
is the kind of thing we might expect?

You need to turn on the debugging options Vlastimil mentioned and try to
figure out what nvme is doing wrong.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Kernel oops with 6.14 when enabling TLS
  2025-03-03 15:48         ` Matthew Wilcox
@ 2025-03-03 16:15           ` Vlastimil Babka
  2025-03-03 22:02             ` Vlastimil Babka
  0 siblings, 1 reply; 36+ messages in thread
From: Vlastimil Babka @ 2025-03-03 16:15 UTC (permalink / raw)
  To: Matthew Wilcox, Hannes Reinecke
  Cc: Sagi Grimberg, linux-nvme, linux-block, linux-mm

On 3/3/25 16:48, Matthew Wilcox wrote:
> You need to turn on the debugging options Vlastimil mentioned and try to
> figure out what nvme is doing wrong.

Agree, looks like some error path going wrong?
Since there seems to be actual non-large kmalloc usage involved, another
debug parameter that could help: CONFIG_SLUB_DEBUG=y, and boot with
"slab_debug=FZPU,kmalloc-*"


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Kernel oops with 6.14 when enabling TLS
  2025-03-03 16:15           ` Vlastimil Babka
@ 2025-03-03 22:02             ` Vlastimil Babka
  2025-03-04  7:58               ` Hannes Reinecke
  0 siblings, 1 reply; 36+ messages in thread
From: Vlastimil Babka @ 2025-03-03 22:02 UTC (permalink / raw)
  To: Matthew Wilcox, Hannes Reinecke
  Cc: Sagi Grimberg, linux-nvme, linux-block, linux-mm

On 3/3/25 17:15, Vlastimil Babka wrote:
> On 3/3/25 16:48, Matthew Wilcox wrote:
>> You need to turn on the debugging options Vlastimil mentioned and try to
>> figure out what nvme is doing wrong.
> 
> Agree, looks like some error path going wrong?
> Since there seems to be actual non-large kmalloc usage involved, another
> debug parameter that could help: CONFIG_SLUB_DEBUG=y, and boot with
> "slab_debug=FZPU,kmalloc-*"

Also make sure you have CONFIG_DEBUG_VM please.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Kernel oops with 6.14 when enabling TLS
  2025-03-03 22:02             ` Vlastimil Babka
@ 2025-03-04  7:58               ` Hannes Reinecke
  2025-03-04  8:18                 ` Vlastimil Babka
  0 siblings, 1 reply; 36+ messages in thread
From: Hannes Reinecke @ 2025-03-04  7:58 UTC (permalink / raw)
  To: Vlastimil Babka, Matthew Wilcox, Hannes Reinecke
  Cc: Sagi Grimberg, linux-nvme, linux-block, linux-mm

On 3/3/25 23:02, Vlastimil Babka wrote:
> On 3/3/25 17:15, Vlastimil Babka wrote:
>> On 3/3/25 16:48, Matthew Wilcox wrote:
>>> You need to turn on the debugging options Vlastimil mentioned and try to
>>> figure out what nvme is doing wrong.
>>
>> Agree, looks like some error path going wrong?
>> Since there seems to be actual non-large kmalloc usage involved, another
>> debug parameter that could help: CONFIG_SLUB_DEBUG=y, and boot with
>> "slab_debug=FZPU,kmalloc-*"
> 
> Also make sure you have CONFIG_DEBUG_VM please.
> 
Here you go:

[  134.506802] page: refcount:0 mapcount:0 mapping:0000000000000000 
index:0x0 pfn:0x101ef8
[  134.509253] head: order:3 mapcount:0 entire_mapcount:0 
nr_pages_mapped:0 pincount:0
[  134.511594] flags: 
0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff)
[  134.513556] page_type: f5(slab)
[  134.513563] raw: 0017ffffc0000040 ffff888100041b00 ffffea0004a90810 
ffff8881000402f0
[  134.513568] raw: 0000000000000000 00000000000a000a 00000000f5000000 
0000000000000000
[  134.513572] head: 0017ffffc0000040 ffff888100041b00 ffffea0004a90810 
ffff8881000402f0
[  134.513575] head: 0000000000000000 00000000000a000a 00000000f5000000 
0000000000000000
[  134.513579] head: 0017ffffc0000003 ffffea000407be01 ffffffffffffffff 
0000000000000000
[  134.513583] head: 0000000000000008 0000000000000000 00000000ffffffff 
0000000000000000
[  134.513585] page dumped because: VM_BUG_ON_FOLIO(((unsigned int) 
folio_ref_count(folio) + 127u <= 127u))
[  134.513615] ------------[ cut here ]------------
[  134.529822] kernel BUG at ./include/linux/mm.h:1455!
[  134.529835] Oops: invalid opcode: 0000 [#1] PREEMPT SMP 
DEBUG_PAGEALLOC NOPTI
[  134.529843] CPU: 0 UID: 0 PID: 274 Comm: kworker/0:1H Kdump: loaded 
Tainted: G            E      6.14.0-rc4-default+ #309 
03b131f1ef70944969b40df9d90a283ed638556f
[  134.536577] Tainted: [E]=UNSIGNED_MODULE
[  134.536580] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
0.0.0 02/06/2015
[  134.536583] Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
[  134.536595] RIP: 0010:__iov_iter_get_pages_alloc+0x676/0x710
[  134.542810] Code: e8 4c 39 e0 49 0f 47 c4 48 01 45 08 48 29 45 18 e9 
90 fa ff ff 48 83 ef 01 e9 7f fe ff ff 48 c7 c6 40 57 4f 82 e8 6a e2 ce 
ff <0f> 0b e8 43 b8 b1 ff eb c5 f7 c1 ff 0f 00 00 48 89 cf 0f 85 4f ff
[  134.542816] RSP: 0018:ffffc900004579d8 EFLAGS: 00010282
[  134.542821] RAX: 000000000000005c RBX: ffffc90000457a90 RCX: 
0000000000000027
[  134.542825] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 
ffff88817f423748
[  134.542828] RBP: ffffc90000457d60 R08: 0000000000000000 R09: 
0000000000000001
[  134.554485] R10: ffffc900004579c0 R11: ffffc90000457720 R12: 
0000000000000000
[  134.554488] R13: ffffea000407be40 R14: ffffc90000457a70 R15: 
ffffc90000457d60
[  134.554495] FS:  0000000000000000(0000) GS:ffff88817f400000(0000) 
knlGS:0000000000000000
[  134.554499] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  134.554502] CR2: 0000556b0675b600 CR3: 0000000106bd8000 CR4: 
0000000000350ef0
[  134.554509] Call Trace:
[  134.554512]  <TASK>
[  134.554516]  ? __die_body+0x1a/0x60
[  134.554525]  ? die+0x38/0x60
[  134.554531]  ? do_trap+0x10f/0x120
[  134.554538]  ? __iov_iter_get_pages_alloc+0x676/0x710
[  134.568839]  ? do_error_trap+0x64/0xa0
[  134.568847]  ? __iov_iter_get_pages_alloc+0x676/0x710
[  134.568855]  ? exc_invalid_op+0x53/0x60
[  134.572489]  ? __iov_iter_get_pages_alloc+0x676/0x710
[  134.572496]  ? asm_exc_invalid_op+0x16/0x20
[  134.572512]  ? __iov_iter_get_pages_alloc+0x676/0x710
[  134.576726]  ? __iov_iter_get_pages_alloc+0x676/0x710
[  134.576733]  ? srso_return_thunk+0x5/0x5f
[  134.576740]  ? ___slab_alloc+0x924/0xb60
[  134.580253]  ? mempool_alloc_noprof+0x41/0x190
[  134.580262]  ? tls_get_rec+0x3d/0x1b0 [tls 
47f199c97f69357468c91efdbba24395e9dbfa77]
[  134.580282]  iov_iter_get_pages2+0x19/0x30
[  134.580289]  sk_msg_zerocopy_from_iter+0x85/0x1d0
[  134.580301]  ? srso_return_thunk+0x5/0x5f
[  134.586842]  ? srso_return_thunk+0x5/0x5f
[  134.586847]  ? __kmalloc_noprof+0x187/0x500
[  134.586854]  ? srso_return_thunk+0x5/0x5f
[  134.586859]  ? __sk_mem_raise_allocated+0x2ba/0x4a0
[  134.591697]  ? srso_return_thunk+0x5/0x5f
[  134.591703]  ? sk_page_frag_refill+0x19/0xb0
[  134.591708]  ? srso_return_thunk+0x5/0x5f
[  134.591712]  ? sk_msg_alloc+0x5a/0x2b0
[  134.591722]  tls_sw_sendmsg+0x6bf/0x9b0 [tls 
47f199c97f69357468c91efdbba24395e9dbfa77]
[  134.598284]  __sock_sendmsg+0x98/0xc0
[  134.598293]  sock_sendmsg+0x5c/0xa0
[  134.600490]  ? srso_return_thunk+0x5/0x5f
[  134.600495]  ? __sock_sendmsg+0x98/0xc0
[  134.600500]  ? srso_return_thunk+0x5/0x5f
[  134.600504]  ? sock_sendmsg+0x5c/0xa0
[  134.600515]  nvme_tcp_try_send_data+0x13f/0x410 [nvme_tcp 
71d3ffab2b48b41b11556946fd79065f8f8b0f42]
[  134.607125]  ? __dequeue_entity+0x401/0x470
[  134.607142]  nvme_tcp_try_send+0x299/0x330 [nvme_tcp 
71d3ffab2b48b41b11556946fd79065f8f8b0f42]
[  134.607153]  nvme_tcp_io_work+0x37/0xb0 [nvme_tcp 
71d3ffab2b48b41b11556946fd79065f8f8b0f42]
[  134.607162]  process_scheduled_works+0x97/0x400
[  134.613657]  ? __pfx_worker_thread+0x10/0x10
[  134.613663]  worker_thread+0x105/0x240
[  134.613669]  ? __pfx_worker_thread+0x10/0x10
[  134.613675]  kthread+0xec/0x200
[  134.618136]  ? __pfx_kthread+0x10/0x10
[  134.618144]  ret_from_fork+0x30/0x50
[  134.618151]  ? __pfx_kthread+0x10/0x10
[  134.618157]  ret_from_fork_asm+0x1a/0x30
[  134.622519]  </TASK>
[  134.622522] Modules linked in: tls(E) nvme_tcp(E) af_packet(E) 
iscsi_ibft(E) iscsi_boot_sysfs(E) xfs(E) nls_iso8859_1(E) nls_cp437(E) 
vfat(E) fat(E) iTCO_wdt(E) intel_rapl_msr(E) intel_pmc_bxt(E) 
intel_rapl_common(E) iTCO_vendor_support(E) bnxt_en(E) i2c_i801(E) 
i2c_mux(E) lpc_ich(E) i2c_smbus(E) joydev(E) mfd_core(E) 
virtio_balloon(E) button(E) nvme_fabrics(E) nvme_keyring(E) nvme_core(E) 
fuse(E) nvme_auth(E) efi_pstore(E) configfs(E) dmi_sysfs(E) ip_tables(E) 
x_tables(E) hid_generic(E) usbhid(E) qxl(E) ahci(E) drm_client_lib(E) 
libahci(E) drm_exec(E) xhci_pci(E) drm_ttm_helper(E) virtio_scsi(E) 
libata(E) ttm(E) xhci_hcd(E) sd_mod(E) scsi_dh_emc(E) drm_kms_helper(E) 
scsi_dh_rdac(E) ghash_clmulni_intel(E) scsi_dh_alua(E) sg(E) 
sha512_ssse3(E) sha256_ssse3(E) drm(E) usbcore(E) scsi_mod(E) 
sha1_ssse3(E) scsi_common(E) serio_raw(E) btrfs(E) blake2b_generic(E) 
xor(E) raid6_pq(E) efivarfs(E) qemu_fw_cfg(E) virtio_rng(E) 
aesni_intel(E) crypto_simd(E) cryptd(E)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Kernel oops with 6.14 when enabling TLS
  2025-03-04  7:58               ` Hannes Reinecke
@ 2025-03-04  8:18                 ` Vlastimil Babka
  2025-03-04 10:20                   ` Hannes Reinecke
  0 siblings, 1 reply; 36+ messages in thread
From: Vlastimil Babka @ 2025-03-04  8:18 UTC (permalink / raw)
  To: Hannes Reinecke, Matthew Wilcox, Hannes Reinecke
  Cc: Sagi Grimberg, linux-nvme, linux-block, linux-mm, Harry Yoo

On 3/4/25 08:58, Hannes Reinecke wrote:
> On 3/3/25 23:02, Vlastimil Babka wrote:
>> On 3/3/25 17:15, Vlastimil Babka wrote:
>>> On 3/3/25 16:48, Matthew Wilcox wrote:
>>>> You need to turn on the debugging options Vlastimil mentioned and try to
>>>> figure out what nvme is doing wrong.
>>>
>>> Agree, looks like some error path going wrong?
>>> Since there seems to be actual non-large kmalloc usage involved, another
>>> debug parameter that could help: CONFIG_SLUB_DEBUG=y, and boot with
>>> "slab_debug=FZPU,kmalloc-*"
>> 
>> Also make sure you have CONFIG_DEBUG_VM please.
>> 
> Here you go:
> 
> [  134.506802] page: refcount:0 mapcount:0 mapping:0000000000000000 
> index:0x0 pfn:0x101ef8
> [  134.509253] head: order:3 mapcount:0 entire_mapcount:0 
> nr_pages_mapped:0 pincount:0
> [  134.511594] flags: 
> 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff)
> [  134.513556] page_type: f5(slab)
> [  134.513563] raw: 0017ffffc0000040 ffff888100041b00 ffffea0004a90810 
> ffff8881000402f0
> [  134.513568] raw: 0000000000000000 00000000000a000a 00000000f5000000 
> 0000000000000000
> [  134.513572] head: 0017ffffc0000040 ffff888100041b00 ffffea0004a90810 
> ffff8881000402f0
> [  134.513575] head: 0000000000000000 00000000000a000a 00000000f5000000 
> 0000000000000000
> [  134.513579] head: 0017ffffc0000003 ffffea000407be01 ffffffffffffffff 
> 0000000000000000
> [  134.513583] head: 0000000000000008 0000000000000000 00000000ffffffff 
> 0000000000000000
> [  134.513585] page dumped because: VM_BUG_ON_FOLIO(((unsigned int) 
> folio_ref_count(folio) + 127u <= 127u))
> [  134.513615] ------------[ cut here ]------------
> [  134.529822] kernel BUG at ./include/linux/mm.h:1455!

Yeah, just as I suspected, folio_get() says the refcount is 0.

> [  134.529835] Oops: invalid opcode: 0000 [#1] PREEMPT SMP 
> DEBUG_PAGEALLOC NOPTI
> [  134.529843] CPU: 0 UID: 0 PID: 274 Comm: kworker/0:1H Kdump: loaded 
> Tainted: G            E      6.14.0-rc4-default+ #309 
> 03b131f1ef70944969b40df9d90a283ed638556f
> [  134.536577] Tainted: [E]=UNSIGNED_MODULE
> [  134.536580] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> 0.0.0 02/06/2015
> [  134.536583] Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
> [  134.536595] RIP: 0010:__iov_iter_get_pages_alloc+0x676/0x710
> [  134.542810] Code: e8 4c 39 e0 49 0f 47 c4 48 01 45 08 48 29 45 18 e9 
> 90 fa ff ff 48 83 ef 01 e9 7f fe ff ff 48 c7 c6 40 57 4f 82 e8 6a e2 ce 
> ff <0f> 0b e8 43 b8 b1 ff eb c5 f7 c1 ff 0f 00 00 48 89 cf 0f 85 4f ff
> [  134.542816] RSP: 0018:ffffc900004579d8 EFLAGS: 00010282
> [  134.542821] RAX: 000000000000005c RBX: ffffc90000457a90 RCX: 
> 0000000000000027
> [  134.542825] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 
> ffff88817f423748
> [  134.542828] RBP: ffffc90000457d60 R08: 0000000000000000 R09: 
> 0000000000000001
> [  134.554485] R10: ffffc900004579c0 R11: ffffc90000457720 R12: 
> 0000000000000000
> [  134.554488] R13: ffffea000407be40 R14: ffffc90000457a70 R15: 
> ffffc90000457d60
> [  134.554495] FS:  0000000000000000(0000) GS:ffff88817f400000(0000) 
> knlGS:0000000000000000
> [  134.554499] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  134.554502] CR2: 0000556b0675b600 CR3: 0000000106bd8000 CR4: 
> 0000000000350ef0
> [  134.554509] Call Trace:
> [  134.554512]  <TASK>
> [  134.554516]  ? __die_body+0x1a/0x60
> [  134.554525]  ? die+0x38/0x60
> [  134.554531]  ? do_trap+0x10f/0x120
> [  134.554538]  ? __iov_iter_get_pages_alloc+0x676/0x710
> [  134.568839]  ? do_error_trap+0x64/0xa0
> [  134.568847]  ? __iov_iter_get_pages_alloc+0x676/0x710
> [  134.568855]  ? exc_invalid_op+0x53/0x60
> [  134.572489]  ? __iov_iter_get_pages_alloc+0x676/0x710
> [  134.572496]  ? asm_exc_invalid_op+0x16/0x20
> [  134.572512]  ? __iov_iter_get_pages_alloc+0x676/0x710
> [  134.576726]  ? __iov_iter_get_pages_alloc+0x676/0x710
> [  134.576733]  ? srso_return_thunk+0x5/0x5f
> [  134.576740]  ? ___slab_alloc+0x924/0xb60
> [  134.580253]  ? mempool_alloc_noprof+0x41/0x190
> [  134.580262]  ? tls_get_rec+0x3d/0x1b0 [tls 
> 47f199c97f69357468c91efdbba24395e9dbfa77]
> [  134.580282]  iov_iter_get_pages2+0x19/0x30

Presumably that's __iov_iter_get_pages_alloc() doing get_page() either in
the " if (iov_iter_is_bvec(i)) " branch or via iter_folioq_get_pages()?

Which doesn't work for a sub-size kmalloc() from a slab folio, which after
the frozen refcount conversion no longer supports get_page().

The question is if this is a mistake specific for this path that's easy to
fix or there are more paths that do this. At the very least the pinning of
page through a kmalloc() allocation from it is useless - the object itself
has to be kfree()'d and that would never happen through a put_page()
reaching zero.

> [  134.580289]  sk_msg_zerocopy_from_iter+0x85/0x1d0
> [  134.580301]  ? srso_return_thunk+0x5/0x5f
> [  134.586842]  ? srso_return_thunk+0x5/0x5f
> [  134.586847]  ? __kmalloc_noprof+0x187/0x500
> [  134.586854]  ? srso_return_thunk+0x5/0x5f
> [  134.586859]  ? __sk_mem_raise_allocated+0x2ba/0x4a0
> [  134.591697]  ? srso_return_thunk+0x5/0x5f
> [  134.591703]  ? sk_page_frag_refill+0x19/0xb0
> [  134.591708]  ? srso_return_thunk+0x5/0x5f
> [  134.591712]  ? sk_msg_alloc+0x5a/0x2b0
> [  134.591722]  tls_sw_sendmsg+0x6bf/0x9b0 [tls 
> 47f199c97f69357468c91efdbba24395e9dbfa77]
> [  134.598284]  __sock_sendmsg+0x98/0xc0
> [  134.598293]  sock_sendmsg+0x5c/0xa0
> [  134.600490]  ? srso_return_thunk+0x5/0x5f
> [  134.600495]  ? __sock_sendmsg+0x98/0xc0
> [  134.600500]  ? srso_return_thunk+0x5/0x5f
> [  134.600504]  ? sock_sendmsg+0x5c/0xa0
> [  134.600515]  nvme_tcp_try_send_data+0x13f/0x410 [nvme_tcp 
> 71d3ffab2b48b41b11556946fd79065f8f8b0f42]
> [  134.607125]  ? __dequeue_entity+0x401/0x470
> [  134.607142]  nvme_tcp_try_send+0x299/0x330 [nvme_tcp 
> 71d3ffab2b48b41b11556946fd79065f8f8b0f42]
> [  134.607153]  nvme_tcp_io_work+0x37/0xb0 [nvme_tcp 
> 71d3ffab2b48b41b11556946fd79065f8f8b0f42]
> [  134.607162]  process_scheduled_works+0x97/0x400
> [  134.613657]  ? __pfx_worker_thread+0x10/0x10
> [  134.613663]  worker_thread+0x105/0x240
> [  134.613669]  ? __pfx_worker_thread+0x10/0x10
> [  134.613675]  kthread+0xec/0x200
> [  134.618136]  ? __pfx_kthread+0x10/0x10
> [  134.618144]  ret_from_fork+0x30/0x50
> [  134.618151]  ? __pfx_kthread+0x10/0x10
> [  134.618157]  ret_from_fork_asm+0x1a/0x30
> [  134.622519]  </TASK>
> [  134.622522] Modules linked in: tls(E) nvme_tcp(E) af_packet(E) 
> iscsi_ibft(E) iscsi_boot_sysfs(E) xfs(E) nls_iso8859_1(E) nls_cp437(E) 
> vfat(E) fat(E) iTCO_wdt(E) intel_rapl_msr(E) intel_pmc_bxt(E) 
> intel_rapl_common(E) iTCO_vendor_support(E) bnxt_en(E) i2c_i801(E) 
> i2c_mux(E) lpc_ich(E) i2c_smbus(E) joydev(E) mfd_core(E) 
> virtio_balloon(E) button(E) nvme_fabrics(E) nvme_keyring(E) nvme_core(E) 
> fuse(E) nvme_auth(E) efi_pstore(E) configfs(E) dmi_sysfs(E) ip_tables(E) 
> x_tables(E) hid_generic(E) usbhid(E) qxl(E) ahci(E) drm_client_lib(E) 
> libahci(E) drm_exec(E) xhci_pci(E) drm_ttm_helper(E) virtio_scsi(E) 
> libata(E) ttm(E) xhci_hcd(E) sd_mod(E) scsi_dh_emc(E) drm_kms_helper(E) 
> scsi_dh_rdac(E) ghash_clmulni_intel(E) scsi_dh_alua(E) sg(E) 
> sha512_ssse3(E) sha256_ssse3(E) drm(E) usbcore(E) scsi_mod(E) 
> sha1_ssse3(E) scsi_common(E) serio_raw(E) btrfs(E) blake2b_generic(E) 
> xor(E) raid6_pq(E) efivarfs(E) qemu_fw_cfg(E) virtio_rng(E) 
> aesni_intel(E) crypto_simd(E) cryptd(E)
> 
> Cheers,
> 
> Hannes



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Kernel oops with 6.14 when enabling TLS
  2025-03-04  8:18                 ` Vlastimil Babka
@ 2025-03-04 10:20                   ` Hannes Reinecke
  2025-03-04 10:26                     ` Vlastimil Babka
  0 siblings, 1 reply; 36+ messages in thread
From: Hannes Reinecke @ 2025-03-04 10:20 UTC (permalink / raw)
  To: Vlastimil Babka, Hannes Reinecke, Matthew Wilcox
  Cc: Sagi Grimberg, linux-nvme, linux-block, linux-mm, Harry Yoo

On 3/4/25 09:18, Vlastimil Babka wrote:
> On 3/4/25 08:58, Hannes Reinecke wrote:
>> On 3/3/25 23:02, Vlastimil Babka wrote:
>>> On 3/3/25 17:15, Vlastimil Babka wrote:
>>>> On 3/3/25 16:48, Matthew Wilcox wrote:
>>>>> You need to turn on the debugging options Vlastimil mentioned and try to
>>>>> figure out what nvme is doing wrong.
>>>>
>>>> Agree, looks like some error path going wrong?
>>>> Since there seems to be actual non-large kmalloc usage involved, another
>>>> debug parameter that could help: CONFIG_SLUB_DEBUG=y, and boot with
>>>> "slab_debug=FZPU,kmalloc-*"
>>>
>>> Also make sure you have CONFIG_DEBUG_VM please.
>>>
>> Here you go:
>>
>> [  134.506802] page: refcount:0 mapcount:0 mapping:0000000000000000
>> index:0x0 pfn:0x101ef8
>> [  134.509253] head: order:3 mapcount:0 entire_mapcount:0
>> nr_pages_mapped:0 pincount:0
>> [  134.511594] flags:
>> 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff)
>> [  134.513556] page_type: f5(slab)
>> [  134.513563] raw: 0017ffffc0000040 ffff888100041b00 ffffea0004a90810
>> ffff8881000402f0
>> [  134.513568] raw: 0000000000000000 00000000000a000a 00000000f5000000
>> 0000000000000000
>> [  134.513572] head: 0017ffffc0000040 ffff888100041b00 ffffea0004a90810
>> ffff8881000402f0
>> [  134.513575] head: 0000000000000000 00000000000a000a 00000000f5000000
>> 0000000000000000
>> [  134.513579] head: 0017ffffc0000003 ffffea000407be01 ffffffffffffffff
>> 0000000000000000
>> [  134.513583] head: 0000000000000008 0000000000000000 00000000ffffffff
>> 0000000000000000
>> [  134.513585] page dumped because: VM_BUG_ON_FOLIO(((unsigned int)
>> folio_ref_count(folio) + 127u <= 127u))
>> [  134.513615] ------------[ cut here ]------------
>> [  134.529822] kernel BUG at ./include/linux/mm.h:1455!
> 
> Yeah, just as I suspected, folio_get() says the refcount is 0.
> 
>> [  134.529835] Oops: invalid opcode: 0000 [#1] PREEMPT SMP
>> DEBUG_PAGEALLOC NOPTI
>> [  134.529843] CPU: 0 UID: 0 PID: 274 Comm: kworker/0:1H Kdump: loaded
>> Tainted: G            E      6.14.0-rc4-default+ #309
>> 03b131f1ef70944969b40df9d90a283ed638556f
>> [  134.536577] Tainted: [E]=UNSIGNED_MODULE
>> [  134.536580] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
>> 0.0.0 02/06/2015
>> [  134.536583] Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
>> [  134.536595] RIP: 0010:__iov_iter_get_pages_alloc+0x676/0x710
>> [  134.542810] Code: e8 4c 39 e0 49 0f 47 c4 48 01 45 08 48 29 45 18 e9
>> 90 fa ff ff 48 83 ef 01 e9 7f fe ff ff 48 c7 c6 40 57 4f 82 e8 6a e2 ce
>> ff <0f> 0b e8 43 b8 b1 ff eb c5 f7 c1 ff 0f 00 00 48 89 cf 0f 85 4f ff
>> [  134.542816] RSP: 0018:ffffc900004579d8 EFLAGS: 00010282
>> [  134.542821] RAX: 000000000000005c RBX: ffffc90000457a90 RCX:
>> 0000000000000027
>> [  134.542825] RDX: 0000000000000000 RSI: 0000000000000002 RDI:
>> ffff88817f423748
>> [  134.542828] RBP: ffffc90000457d60 R08: 0000000000000000 R09:
>> 0000000000000001
>> [  134.554485] R10: ffffc900004579c0 R11: ffffc90000457720 R12:
>> 0000000000000000
>> [  134.554488] R13: ffffea000407be40 R14: ffffc90000457a70 R15:
>> ffffc90000457d60
>> [  134.554495] FS:  0000000000000000(0000) GS:ffff88817f400000(0000)
>> knlGS:0000000000000000
>> [  134.554499] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  134.554502] CR2: 0000556b0675b600 CR3: 0000000106bd8000 CR4:
>> 0000000000350ef0
>> [  134.554509] Call Trace:
>> [  134.554512]  <TASK>
>> [  134.554516]  ? __die_body+0x1a/0x60
>> [  134.554525]  ? die+0x38/0x60
>> [  134.554531]  ? do_trap+0x10f/0x120
>> [  134.554538]  ? __iov_iter_get_pages_alloc+0x676/0x710
>> [  134.568839]  ? do_error_trap+0x64/0xa0
>> [  134.568847]  ? __iov_iter_get_pages_alloc+0x676/0x710
>> [  134.568855]  ? exc_invalid_op+0x53/0x60
>> [  134.572489]  ? __iov_iter_get_pages_alloc+0x676/0x710
>> [  134.572496]  ? asm_exc_invalid_op+0x16/0x20
>> [  134.572512]  ? __iov_iter_get_pages_alloc+0x676/0x710
>> [  134.576726]  ? __iov_iter_get_pages_alloc+0x676/0x710
>> [  134.576733]  ? srso_return_thunk+0x5/0x5f
>> [  134.576740]  ? ___slab_alloc+0x924/0xb60
>> [  134.580253]  ? mempool_alloc_noprof+0x41/0x190
>> [  134.580262]  ? tls_get_rec+0x3d/0x1b0 [tls
>> 47f199c97f69357468c91efdbba24395e9dbfa77]
>> [  134.580282]  iov_iter_get_pages2+0x19/0x30
> 
> Presumably that's __iov_iter_get_pages_alloc() doing get_page() either in
> the " if (iov_iter_is_bvec(i)) " branch or via iter_folioq_get_pages()?
> 
Looks like it.

> Which doesn't work for a sub-size kmalloc() from a slab folio, which after
> the frozen refcount conversion no longer supports get_page().
> 
> The question is if this is a mistake specific for this path that's easy to
> fix or there are more paths that do this. At the very least the pinning of
> page through a kmalloc() allocation from it is useless - the object itself
> has to be kfree()'d and that would never happen through a put_page()
> reaching zero.
> 
Looks like a specific mistake.
tls_sw is the only user of sk_msg_zerocopy_from_iter()
(which is calling into __iov_iter_get_pages_alloc()).

And, more to the point, tls_sw messes up iov pacing coming in from
the upper layers.
So even if the upper layers send individual iovs (where each iov might
contain different allocation types), tls_sw is packing them together 
into full records. So it might end up with iovs having _different_ 
allocations.
Which would explain why we only see it with TLS, but not with normal
connections.

Or so my reasoning goes. Not sure if that's correct.

So I'd be happy with an 'easy' fix for now. Obviously :-)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.com                               +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Kernel oops with 6.14 when enabling TLS
  2025-03-04 10:20                   ` Hannes Reinecke
@ 2025-03-04 10:26                     ` Vlastimil Babka
  2025-03-04 15:11                       ` Hannes Reinecke
                                         ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Vlastimil Babka @ 2025-03-04 10:26 UTC (permalink / raw)
  To: Hannes Reinecke, Hannes Reinecke, Matthew Wilcox, Boris Pismenny,
	John Fastabend, Jakub Kicinski
  Cc: Sagi Grimberg, linux-nvme, linux-block, linux-mm, Harry Yoo, netdev

+Cc NETWORKING [TLS] maintainers and netdev for input, thanks.

The full error is here:
https://lore.kernel.org/all/fcfa11c6-2738-4a2e-baa8-09fa8f79cbf3@suse.de/

On 3/4/25 11:20, Hannes Reinecke wrote:
> On 3/4/25 09:18, Vlastimil Babka wrote:
>> On 3/4/25 08:58, Hannes Reinecke wrote:
>>> On 3/3/25 23:02, Vlastimil Babka wrote:
>>>> On 3/3/25 17:15, Vlastimil Babka wrote:
>>>>> On 3/3/25 16:48, Matthew Wilcox wrote:
>>>>>> You need to turn on the debugging options Vlastimil mentioned and try to
>>>>>> figure out what nvme is doing wrong.
>>>>>
>>>>> Agree, looks like some error path going wrong?
>>>>> Since there seems to be actual non-large kmalloc usage involved, another
>>>>> debug parameter that could help: CONFIG_SLUB_DEBUG=y, and boot with
>>>>> "slab_debug=FZPU,kmalloc-*"
>>>>
>>>> Also make sure you have CONFIG_DEBUG_VM please.
>>>>
>>> Here you go:
>>>
>>> [  134.506802] page: refcount:0 mapcount:0 mapping:0000000000000000
>>> index:0x0 pfn:0x101ef8
>>> [  134.509253] head: order:3 mapcount:0 entire_mapcount:0
>>> nr_pages_mapped:0 pincount:0
>>> [  134.511594] flags:
>>> 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff)
>>> [  134.513556] page_type: f5(slab)
>>> [  134.513563] raw: 0017ffffc0000040 ffff888100041b00 ffffea0004a90810
>>> ffff8881000402f0
>>> [  134.513568] raw: 0000000000000000 00000000000a000a 00000000f5000000
>>> 0000000000000000
>>> [  134.513572] head: 0017ffffc0000040 ffff888100041b00 ffffea0004a90810
>>> ffff8881000402f0
>>> [  134.513575] head: 0000000000000000 00000000000a000a 00000000f5000000
>>> 0000000000000000
>>> [  134.513579] head: 0017ffffc0000003 ffffea000407be01 ffffffffffffffff
>>> 0000000000000000
>>> [  134.513583] head: 0000000000000008 0000000000000000 00000000ffffffff
>>> 0000000000000000
>>> [  134.513585] page dumped because: VM_BUG_ON_FOLIO(((unsigned int)
>>> folio_ref_count(folio) + 127u <= 127u))
>>> [  134.513615] ------------[ cut here ]------------
>>> [  134.529822] kernel BUG at ./include/linux/mm.h:1455!
>> 
>> Yeah, just as I suspected, folio_get() says the refcount is 0.
>> 
>>> [  134.529835] Oops: invalid opcode: 0000 [#1] PREEMPT SMP
>>> DEBUG_PAGEALLOC NOPTI
>>> [  134.529843] CPU: 0 UID: 0 PID: 274 Comm: kworker/0:1H Kdump: loaded
>>> Tainted: G            E      6.14.0-rc4-default+ #309
>>> 03b131f1ef70944969b40df9d90a283ed638556f
>>> [  134.536577] Tainted: [E]=UNSIGNED_MODULE
>>> [  134.536580] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
>>> 0.0.0 02/06/2015
>>> [  134.536583] Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
>>> [  134.536595] RIP: 0010:__iov_iter_get_pages_alloc+0x676/0x710
>>> [  134.542810] Code: e8 4c 39 e0 49 0f 47 c4 48 01 45 08 48 29 45 18 e9
>>> 90 fa ff ff 48 83 ef 01 e9 7f fe ff ff 48 c7 c6 40 57 4f 82 e8 6a e2 ce
>>> ff <0f> 0b e8 43 b8 b1 ff eb c5 f7 c1 ff 0f 00 00 48 89 cf 0f 85 4f ff
>>> [  134.542816] RSP: 0018:ffffc900004579d8 EFLAGS: 00010282
>>> [  134.542821] RAX: 000000000000005c RBX: ffffc90000457a90 RCX:
>>> 0000000000000027
>>> [  134.542825] RDX: 0000000000000000 RSI: 0000000000000002 RDI:
>>> ffff88817f423748
>>> [  134.542828] RBP: ffffc90000457d60 R08: 0000000000000000 R09:
>>> 0000000000000001
>>> [  134.554485] R10: ffffc900004579c0 R11: ffffc90000457720 R12:
>>> 0000000000000000
>>> [  134.554488] R13: ffffea000407be40 R14: ffffc90000457a70 R15:
>>> ffffc90000457d60
>>> [  134.554495] FS:  0000000000000000(0000) GS:ffff88817f400000(0000)
>>> knlGS:0000000000000000
>>> [  134.554499] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  134.554502] CR2: 0000556b0675b600 CR3: 0000000106bd8000 CR4:
>>> 0000000000350ef0
>>> [  134.554509] Call Trace:
>>> [  134.554512]  <TASK>
>>> [  134.554516]  ? __die_body+0x1a/0x60
>>> [  134.554525]  ? die+0x38/0x60
>>> [  134.554531]  ? do_trap+0x10f/0x120
>>> [  134.554538]  ? __iov_iter_get_pages_alloc+0x676/0x710
>>> [  134.568839]  ? do_error_trap+0x64/0xa0
>>> [  134.568847]  ? __iov_iter_get_pages_alloc+0x676/0x710
>>> [  134.568855]  ? exc_invalid_op+0x53/0x60
>>> [  134.572489]  ? __iov_iter_get_pages_alloc+0x676/0x710
>>> [  134.572496]  ? asm_exc_invalid_op+0x16/0x20
>>> [  134.572512]  ? __iov_iter_get_pages_alloc+0x676/0x710
>>> [  134.576726]  ? __iov_iter_get_pages_alloc+0x676/0x710
>>> [  134.576733]  ? srso_return_thunk+0x5/0x5f
>>> [  134.576740]  ? ___slab_alloc+0x924/0xb60
>>> [  134.580253]  ? mempool_alloc_noprof+0x41/0x190
>>> [  134.580262]  ? tls_get_rec+0x3d/0x1b0 [tls
>>> 47f199c97f69357468c91efdbba24395e9dbfa77]
>>> [  134.580282]  iov_iter_get_pages2+0x19/0x30
>> 
>> Presumably that's __iov_iter_get_pages_alloc() doing get_page() either in
>> the " if (iov_iter_is_bvec(i)) " branch or via iter_folioq_get_pages()?
>> 
> Looks like it.
> 
>> Which doesn't work for a sub-size kmalloc() from a slab folio, which after
>> the frozen refcount conversion no longer supports get_page().
>> 
>> The question is if this is a mistake specific for this path that's easy to
>> fix or there are more paths that do this. At the very least the pinning of
>> page through a kmalloc() allocation from it is useless - the object itself
>> has to be kfree()'d and that would never happen through a put_page()
>> reaching zero.
>> 
> Looks like a specific mistake.
> tls_sw is the only user of sk_msg_zerocopy_from_iter()
> (which is calling into __iov_iter_get_pages_alloc()).
> 
> And, more to the point, tls_sw messes up iov pacing coming in from
> the upper layers.
> So even if the upper layers send individual iovs (where each iov might
> contain different allocation types), tls_sw is packing them together 
> into full records. So it might end up with iovs having _different_ 
> allocations.
> Which would explain why we only see it with TLS, but not with normal
> connections.
> 
> Or so my reasoning goes. Not sure if that's correct.
> 
> So I'd be happy with an 'easy' fix for now. Obviously :-)
> 
> Cheers,
> 
> Hannes



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Kernel oops with 6.14 when enabling TLS
  2025-03-04 10:26                     ` Vlastimil Babka
@ 2025-03-04 15:11                       ` Hannes Reinecke
  2025-03-04 15:29                       ` Vlastimil Babka
  2025-03-04 16:14                       ` Matthew Wilcox
  2 siblings, 0 replies; 36+ messages in thread
From: Hannes Reinecke @ 2025-03-04 15:11 UTC (permalink / raw)
  To: Vlastimil Babka, Hannes Reinecke, Matthew Wilcox, Boris Pismenny,
	John Fastabend, Jakub Kicinski
  Cc: Sagi Grimberg, linux-nvme, linux-block, linux-mm, Harry Yoo, netdev

On 3/4/25 11:26, Vlastimil Babka wrote:
> On 3/4/25 11:20, Hannes Reinecke wrote:

[ .. ]
>> So I'd be happy with an 'easy' fix for now. Obviously :-)
>>

With this patch:

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 65f550cb5081..b035a9928cdd 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1190,8 +1190,14 @@ static ssize_t __iov_iter_get_pages_alloc(struct 
iov_iter *i,
                 if (!n)
                         return -ENOMEM;
                 p = *pages;
-               for (int k = 0; k < n; k++)
-                       get_page(p[k] = page + k);
+               for (int k = 0; k < n; k++) {
+                       if (!get_page_unless_zero(p[k] = page + k)) {
+                               pr_warn("%s: frozen page %d of %d\n",
+                                       __func__, k, n);
+                               return -ENOMEM;
+                       }
+               }
+
                 maxsize = min_t(size_t, maxsize, n * PAGE_SIZE - *start);
                 i->count -= maxsize;
                 i->iov_offset += maxsize;


the system doesn't crash anymore:
[   51.520949] __iov_iter_get_pages_alloc: frozen page 0 of 1
[   51.536393] nvme nvme0: creating 4 I/O queues.
[   51.968897] nvme nvme0: mapped 4/0/0 default/read/poll queues.
[   51.972207] __iov_iter_get_pages_alloc: frozen page 0 of 1
[   51.974528] __iov_iter_get_pages_alloc: frozen page 0 of 1
[   51.976928] __iov_iter_get_pages_alloc: frozen page 0 of 1
[   51.978980] __iov_iter_get_pages_alloc: frozen page 0 of 1
[   51.981236] nvme nvme0: new ctrl: NQN "nqn.blktests-subsystem-1", 
addr 10.161.9.19:4420, hostnqn: 
nqn.2014-08.org.nvmexpress:uuid:027a49dc-b554-40e5-b0f9-0a9ea03ec30c

and the allocation in question is coming from
drivers/nvme/host/fabrics.c:nvmf_connect_data_prep(), which
coincidentally _is_ a kmalloc()ed buffer.

But TLS doesn't work, either:

[   58.886754] nvme nvme0: I/O tag 1 (3001) type 4 opcode 0x18 (Keep 
Alive) QID 0 timeout
[   58.889112] nvme nvme0: starting error recovery
[   58.892176] nvme nvme0: failed nvme_keep_alive_end_io error=10
[   58.892282] nvme nvme0: reading non-mdts-limits failed: -4
[   58.902490] nvme nvme0: Reconnecting in 10 seconds...

(probably not surprising seeing that an error is returned ..)

So yeah, looks like TLS has issues with kmalloced data.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Kernel oops with 6.14 when enabling TLS
  2025-03-04 10:26                     ` Vlastimil Babka
  2025-03-04 15:11                       ` Hannes Reinecke
@ 2025-03-04 15:29                       ` Vlastimil Babka
  2025-03-04 16:20                         ` Hannes Reinecke
  2025-03-04 16:14                       ` Matthew Wilcox
  2 siblings, 1 reply; 36+ messages in thread
From: Vlastimil Babka @ 2025-03-04 15:29 UTC (permalink / raw)
  To: Hannes Reinecke, Hannes Reinecke, Matthew Wilcox, Boris Pismenny,
	John Fastabend, Jakub Kicinski
  Cc: Sagi Grimberg, linux-nvme, linux-block, linux-mm, Harry Yoo, netdev

On 3/4/25 11:26, Vlastimil Babka wrote:
> +Cc NETWORKING [TLS] maintainers and netdev for input, thanks.
> 
> The full error is here:
> https://lore.kernel.org/all/fcfa11c6-2738-4a2e-baa8-09fa8f79cbf3@suse.de/
> 
> On 3/4/25 11:20, Hannes Reinecke wrote:
>> On 3/4/25 09:18, Vlastimil Babka wrote:
>>> On 3/4/25 08:58, Hannes Reinecke wrote:
>>>> On 3/3/25 23:02, Vlastimil Babka wrote:
>>>>> On 3/3/25 17:15, Vlastimil Babka wrote:
>>>>>> On 3/3/25 16:48, Matthew Wilcox wrote:
>>>>>>> You need to turn on the debugging options Vlastimil mentioned and try to
>>>>>>> figure out what nvme is doing wrong.
>>>>>>
>>>>>> Agree, looks like some error path going wrong?
>>>>>> Since there seems to be actual non-large kmalloc usage involved, another
>>>>>> debug parameter that could help: CONFIG_SLUB_DEBUG=y, and boot with
>>>>>> "slab_debug=FZPU,kmalloc-*"
>>>>>
>>>>> Also make sure you have CONFIG_DEBUG_VM please.
>>>>>
>>>> Here you go:
>>>>
>>>> [  134.506802] page: refcount:0 mapcount:0 mapping:0000000000000000
>>>> index:0x0 pfn:0x101ef8
>>>> [  134.509253] head: order:3 mapcount:0 entire_mapcount:0
>>>> nr_pages_mapped:0 pincount:0
>>>> [  134.511594] flags:
>>>> 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff)
>>>> [  134.513556] page_type: f5(slab)
>>>> [  134.513563] raw: 0017ffffc0000040 ffff888100041b00 ffffea0004a90810
>>>> ffff8881000402f0
>>>> [  134.513568] raw: 0000000000000000 00000000000a000a 00000000f5000000
>>>> 0000000000000000
>>>> [  134.513572] head: 0017ffffc0000040 ffff888100041b00 ffffea0004a90810
>>>> ffff8881000402f0
>>>> [  134.513575] head: 0000000000000000 00000000000a000a 00000000f5000000
>>>> 0000000000000000
>>>> [  134.513579] head: 0017ffffc0000003 ffffea000407be01 ffffffffffffffff
>>>> 0000000000000000
>>>> [  134.513583] head: 0000000000000008 0000000000000000 00000000ffffffff
>>>> 0000000000000000
>>>> [  134.513585] page dumped because: VM_BUG_ON_FOLIO(((unsigned int)
>>>> folio_ref_count(folio) + 127u <= 127u))
>>>> [  134.513615] ------------[ cut here ]------------
>>>> [  134.529822] kernel BUG at ./include/linux/mm.h:1455!
>>> 
>>> Yeah, just as I suspected, folio_get() says the refcount is 0.
>>> 
>>>> [  134.529835] Oops: invalid opcode: 0000 [#1] PREEMPT SMP
>>>> DEBUG_PAGEALLOC NOPTI
>>>> [  134.529843] CPU: 0 UID: 0 PID: 274 Comm: kworker/0:1H Kdump: loaded
>>>> Tainted: G            E      6.14.0-rc4-default+ #309
>>>> 03b131f1ef70944969b40df9d90a283ed638556f
>>>> [  134.536577] Tainted: [E]=UNSIGNED_MODULE
>>>> [  134.536580] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
>>>> 0.0.0 02/06/2015
>>>> [  134.536583] Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
>>>> [  134.536595] RIP: 0010:__iov_iter_get_pages_alloc+0x676/0x710
>>>> [  134.542810] Code: e8 4c 39 e0 49 0f 47 c4 48 01 45 08 48 29 45 18 e9
>>>> 90 fa ff ff 48 83 ef 01 e9 7f fe ff ff 48 c7 c6 40 57 4f 82 e8 6a e2 ce
>>>> ff <0f> 0b e8 43 b8 b1 ff eb c5 f7 c1 ff 0f 00 00 48 89 cf 0f 85 4f ff
>>>> [  134.542816] RSP: 0018:ffffc900004579d8 EFLAGS: 00010282
>>>> [  134.542821] RAX: 000000000000005c RBX: ffffc90000457a90 RCX:
>>>> 0000000000000027
>>>> [  134.542825] RDX: 0000000000000000 RSI: 0000000000000002 RDI:
>>>> ffff88817f423748
>>>> [  134.542828] RBP: ffffc90000457d60 R08: 0000000000000000 R09:
>>>> 0000000000000001
>>>> [  134.554485] R10: ffffc900004579c0 R11: ffffc90000457720 R12:
>>>> 0000000000000000
>>>> [  134.554488] R13: ffffea000407be40 R14: ffffc90000457a70 R15:
>>>> ffffc90000457d60
>>>> [  134.554495] FS:  0000000000000000(0000) GS:ffff88817f400000(0000)
>>>> knlGS:0000000000000000
>>>> [  134.554499] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [  134.554502] CR2: 0000556b0675b600 CR3: 0000000106bd8000 CR4:
>>>> 0000000000350ef0
>>>> [  134.554509] Call Trace:
>>>> [  134.554512]  <TASK>
>>>> [  134.554516]  ? __die_body+0x1a/0x60
>>>> [  134.554525]  ? die+0x38/0x60
>>>> [  134.554531]  ? do_trap+0x10f/0x120
>>>> [  134.554538]  ? __iov_iter_get_pages_alloc+0x676/0x710
>>>> [  134.568839]  ? do_error_trap+0x64/0xa0
>>>> [  134.568847]  ? __iov_iter_get_pages_alloc+0x676/0x710
>>>> [  134.568855]  ? exc_invalid_op+0x53/0x60
>>>> [  134.572489]  ? __iov_iter_get_pages_alloc+0x676/0x710
>>>> [  134.572496]  ? asm_exc_invalid_op+0x16/0x20
>>>> [  134.572512]  ? __iov_iter_get_pages_alloc+0x676/0x710
>>>> [  134.576726]  ? __iov_iter_get_pages_alloc+0x676/0x710
>>>> [  134.576733]  ? srso_return_thunk+0x5/0x5f
>>>> [  134.576740]  ? ___slab_alloc+0x924/0xb60
>>>> [  134.580253]  ? mempool_alloc_noprof+0x41/0x190
>>>> [  134.580262]  ? tls_get_rec+0x3d/0x1b0 [tls
>>>> 47f199c97f69357468c91efdbba24395e9dbfa77]
>>>> [  134.580282]  iov_iter_get_pages2+0x19/0x30
>>> 
>>> Presumably that's __iov_iter_get_pages_alloc() doing get_page() either in
>>> the " if (iov_iter_is_bvec(i)) " branch or via iter_folioq_get_pages()?
>>> 
>> Looks like it.
>> 
>>> Which doesn't work for a sub-size kmalloc() from a slab folio, which after
>>> the frozen refcount conversion no longer supports get_page().
>>> 
>>> The question is if this is a mistake specific for this path that's easy to
>>> fix or there are more paths that do this. At the very least the pinning of
>>> page through a kmalloc() allocation from it is useless - the object itself
>>> has to be kfree()'d and that would never happen through a put_page()
>>> reaching zero.
>>> 
>> Looks like a specific mistake.
>> tls_sw is the only user of sk_msg_zerocopy_from_iter()
>> (which is calling into __iov_iter_get_pages_alloc()).

That's from tls_sw_sendmsg_locked(), right? But that's under:

if (!is_kvec && (full_record || eor) && !async_capable) {

Shouldn't is_kvec be true if we're dealing a kernel buffer (kmalloc()) there?

>> And, more to the point, tls_sw messes up iov pacing coming in from
>> the upper layers.
>> So even if the upper layers send individual iovs (where each iov might
>> contain different allocation types), tls_sw is packing them together 
>> into full records. So it might end up with iovs having _different_ 
>> allocations.
>> Which would explain why we only see it with TLS, but not with normal
>> connections.
>> 
>> Or so my reasoning goes. Not sure if that's correct.
>> 
>> So I'd be happy with an 'easy' fix for now. Obviously :-)
>> 
>> Cheers,
>> 
>> Hannes
> 



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Kernel oops with 6.14 when enabling TLS
  2025-03-04 10:26                     ` Vlastimil Babka
  2025-03-04 15:11                       ` Hannes Reinecke
  2025-03-04 15:29                       ` Vlastimil Babka
@ 2025-03-04 16:14                       ` Matthew Wilcox
  2025-03-04 16:32                         ` Hannes Reinecke
  2 siblings, 1 reply; 36+ messages in thread
From: Matthew Wilcox @ 2025-03-04 16:14 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Hannes Reinecke, Boris Pismenny, John Fastabend, Jakub Kicinski,
	Sagi Grimberg, linux-nvme, linux-block, linux-mm, Harry Yoo,
	netdev

On Tue, Mar 04, 2025 at 11:26:07AM +0100, Vlastimil Babka wrote:
> +Cc NETWORKING [TLS] maintainers and netdev for input, thanks.
> 
> The full error is here:
> https://lore.kernel.org/all/fcfa11c6-2738-4a2e-baa8-09fa8f79cbf3@suse.de/
> 
> On 3/4/25 11:20, Hannes Reinecke wrote:
> > On 3/4/25 09:18, Vlastimil Babka wrote:
> >> On 3/4/25 08:58, Hannes Reinecke wrote:
> >>> On 3/3/25 23:02, Vlastimil Babka wrote:
> >>>> Also make sure you have CONFIG_DEBUG_VM please.
> >>>>
> >>> Here you go:
> >>>
> >>> [  134.506802] page: refcount:0 mapcount:0 mapping:0000000000000000
> >>> index:0x0 pfn:0x101ef8
> >>> [  134.509253] head: order:3 mapcount:0 entire_mapcount:0
> >>> nr_pages_mapped:0 pincount:0
> >>> [  134.511594] flags:
> >>> 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff)
> >>> [  134.513556] page_type: f5(slab)
> >>> [  134.513563] raw: 0017ffffc0000040 ffff888100041b00 ffffea0004a90810
> >>> ffff8881000402f0
> >>> [  134.513568] raw: 0000000000000000 00000000000a000a 00000000f5000000
> >>> 0000000000000000
> >>> [  134.513572] head: 0017ffffc0000040 ffff888100041b00 ffffea0004a90810
> >>> ffff8881000402f0
> >>> [  134.513575] head: 0000000000000000 00000000000a000a 00000000f5000000
> >>> 0000000000000000
> >>> [  134.513579] head: 0017ffffc0000003 ffffea000407be01 ffffffffffffffff
> >>> 0000000000000000
> >>> [  134.513583] head: 0000000000000008 0000000000000000 00000000ffffffff
> >>> 0000000000000000
> >>> [  134.513585] page dumped because: VM_BUG_ON_FOLIO(((unsigned int)
> >>> folio_ref_count(folio) + 127u <= 127u))
> >>> [  134.513615] ------------[ cut here ]------------
> >>> [  134.529822] kernel BUG at ./include/linux/mm.h:1455!
> >> 
> >> Yeah, just as I suspected, folio_get() says the refcount is 0.

... and it has a page_type of f5 (slab)

> >>> [  134.554509] Call Trace:
> >>> [  134.580282]  iov_iter_get_pages2+0x19/0x30
> >> 
> >> Presumably that's __iov_iter_get_pages_alloc() doing get_page() either in
> >> the " if (iov_iter_is_bvec(i)) " branch or via iter_folioq_get_pages()?

It's the bvec path:

                iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len);

> >> Which doesn't work for a sub-size kmalloc() from a slab folio, which after
> >> the frozen refcount conversion no longer supports get_page().
> >> 
> >> The question is if this is a mistake specific for this path that's easy to
> >> fix or there are more paths that do this. At the very least the pinning of
> >> page through a kmalloc() allocation from it is useless - the object itself
> >> has to be kfree()'d and that would never happen through a put_page()
> >> reaching zero.
> >> 
> > Looks like a specific mistake.
> > tls_sw is the only user of sk_msg_zerocopy_from_iter()
> > (which is calling into __iov_iter_get_pages_alloc()).
> > 
> > And, more to the point, tls_sw messes up iov pacing coming in from
> > the upper layers.
> > So even if the upper layers send individual iovs (where each iov might
> > contain different allocation types), tls_sw is packing them together 
> > into full records. So it might end up with iovs having _different_ 
> > allocations.
> > Which would explain why we only see it with TLS, but not with normal
> > connections.

I thought we'd done all the work needed to get rid of these pointless
refcount bumps.  Turns out that's only on the block side (eg commit
e4cc64657bec).  So what does networking need in order to understand
that some iovecs do not need to mess with the refcount?


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Kernel oops with 6.14 when enabling TLS
  2025-03-04 15:29                       ` Vlastimil Babka
@ 2025-03-04 16:20                         ` Hannes Reinecke
  0 siblings, 0 replies; 36+ messages in thread
From: Hannes Reinecke @ 2025-03-04 16:20 UTC (permalink / raw)
  To: Vlastimil Babka, Hannes Reinecke, Matthew Wilcox, Boris Pismenny,
	John Fastabend, Jakub Kicinski
  Cc: Sagi Grimberg, linux-nvme, linux-block, linux-mm, Harry Yoo, netdev

On 3/4/25 16:29, Vlastimil Babka wrote:
> On 3/4/25 11:26, Vlastimil Babka wrote:
>> +Cc NETWORKING [TLS] maintainers and netdev for input, thanks.
>>
>> The full error is here:
>> https://lore.kernel.org/all/fcfa11c6-2738-4a2e-baa8-09fa8f79cbf3@suse.de/
>>
>> On 3/4/25 11:20, Hannes Reinecke wrote:
>>> On 3/4/25 09:18, Vlastimil Babka wrote:
>>>> On 3/4/25 08:58, Hannes Reinecke wrote:
>>>>> On 3/3/25 23:02, Vlastimil Babka wrote:
>>>>>> On 3/3/25 17:15, Vlastimil Babka wrote:
>>>>>>> On 3/3/25 16:48, Matthew Wilcox wrote:
>>>>>>>> You need to turn on the debugging options Vlastimil mentioned and try to
>>>>>>>> figure out what nvme is doing wrong.
>>>>>>>
>>>>>>> Agree, looks like some error path going wrong?
>>>>>>> Since there seems to be actual non-large kmalloc usage involved, another
>>>>>>> debug parameter that could help: CONFIG_SLUB_DEBUG=y, and boot with
>>>>>>> "slab_debug=FZPU,kmalloc-*"
>>>>>>
>>>>>> Also make sure you have CONFIG_DEBUG_VM please.
>>>>>>
>>>>> Here you go:
>>>>>
>>>>> [  134.506802] page: refcount:0 mapcount:0 mapping:0000000000000000
>>>>> index:0x0 pfn:0x101ef8
>>>>> [  134.509253] head: order:3 mapcount:0 entire_mapcount:0
>>>>> nr_pages_mapped:0 pincount:0
>>>>> [  134.511594] flags:
>>>>> 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff)
>>>>> [  134.513556] page_type: f5(slab)
>>>>> [  134.513563] raw: 0017ffffc0000040 ffff888100041b00 ffffea0004a90810
>>>>> ffff8881000402f0
>>>>> [  134.513568] raw: 0000000000000000 00000000000a000a 00000000f5000000
>>>>> 0000000000000000
>>>>> [  134.513572] head: 0017ffffc0000040 ffff888100041b00 ffffea0004a90810
>>>>> ffff8881000402f0
>>>>> [  134.513575] head: 0000000000000000 00000000000a000a 00000000f5000000
>>>>> 0000000000000000
>>>>> [  134.513579] head: 0017ffffc0000003 ffffea000407be01 ffffffffffffffff
>>>>> 0000000000000000
>>>>> [  134.513583] head: 0000000000000008 0000000000000000 00000000ffffffff
>>>>> 0000000000000000
>>>>> [  134.513585] page dumped because: VM_BUG_ON_FOLIO(((unsigned int)
>>>>> folio_ref_count(folio) + 127u <= 127u))
>>>>> [  134.513615] ------------[ cut here ]------------
>>>>> [  134.529822] kernel BUG at ./include/linux/mm.h:1455!
>>>>
>>>> Yeah, just as I suspected, folio_get() says the refcount is 0.
>>>>
>>>>> [  134.529835] Oops: invalid opcode: 0000 [#1] PREEMPT SMP
>>>>> DEBUG_PAGEALLOC NOPTI
>>>>> [  134.529843] CPU: 0 UID: 0 PID: 274 Comm: kworker/0:1H Kdump: loaded
>>>>> Tainted: G            E      6.14.0-rc4-default+ #309
>>>>> 03b131f1ef70944969b40df9d90a283ed638556f
>>>>> [  134.536577] Tainted: [E]=UNSIGNED_MODULE
>>>>> [  134.536580] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
>>>>> 0.0.0 02/06/2015
>>>>> [  134.536583] Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
>>>>> [  134.536595] RIP: 0010:__iov_iter_get_pages_alloc+0x676/0x710
>>>>> [  134.542810] Code: e8 4c 39 e0 49 0f 47 c4 48 01 45 08 48 29 45 18 e9
>>>>> 90 fa ff ff 48 83 ef 01 e9 7f fe ff ff 48 c7 c6 40 57 4f 82 e8 6a e2 ce
>>>>> ff <0f> 0b e8 43 b8 b1 ff eb c5 f7 c1 ff 0f 00 00 48 89 cf 0f 85 4f ff
>>>>> [  134.542816] RSP: 0018:ffffc900004579d8 EFLAGS: 00010282
>>>>> [  134.542821] RAX: 000000000000005c RBX: ffffc90000457a90 RCX:
>>>>> 0000000000000027
>>>>> [  134.542825] RDX: 0000000000000000 RSI: 0000000000000002 RDI:
>>>>> ffff88817f423748
>>>>> [  134.542828] RBP: ffffc90000457d60 R08: 0000000000000000 R09:
>>>>> 0000000000000001
>>>>> [  134.554485] R10: ffffc900004579c0 R11: ffffc90000457720 R12:
>>>>> 0000000000000000
>>>>> [  134.554488] R13: ffffea000407be40 R14: ffffc90000457a70 R15:
>>>>> ffffc90000457d60
>>>>> [  134.554495] FS:  0000000000000000(0000) GS:ffff88817f400000(0000)
>>>>> knlGS:0000000000000000
>>>>> [  134.554499] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [  134.554502] CR2: 0000556b0675b600 CR3: 0000000106bd8000 CR4:
>>>>> 0000000000350ef0
>>>>> [  134.554509] Call Trace:
>>>>> [  134.554512]  <TASK>
>>>>> [  134.554516]  ? __die_body+0x1a/0x60
>>>>> [  134.554525]  ? die+0x38/0x60
>>>>> [  134.554531]  ? do_trap+0x10f/0x120
>>>>> [  134.554538]  ? __iov_iter_get_pages_alloc+0x676/0x710
>>>>> [  134.568839]  ? do_error_trap+0x64/0xa0
>>>>> [  134.568847]  ? __iov_iter_get_pages_alloc+0x676/0x710
>>>>> [  134.568855]  ? exc_invalid_op+0x53/0x60
>>>>> [  134.572489]  ? __iov_iter_get_pages_alloc+0x676/0x710
>>>>> [  134.572496]  ? asm_exc_invalid_op+0x16/0x20
>>>>> [  134.572512]  ? __iov_iter_get_pages_alloc+0x676/0x710
>>>>> [  134.576726]  ? __iov_iter_get_pages_alloc+0x676/0x710
>>>>> [  134.576733]  ? srso_return_thunk+0x5/0x5f
>>>>> [  134.576740]  ? ___slab_alloc+0x924/0xb60
>>>>> [  134.580253]  ? mempool_alloc_noprof+0x41/0x190
>>>>> [  134.580262]  ? tls_get_rec+0x3d/0x1b0 [tls
>>>>> 47f199c97f69357468c91efdbba24395e9dbfa77]
>>>>> [  134.580282]  iov_iter_get_pages2+0x19/0x30
>>>>
>>>> Presumably that's __iov_iter_get_pages_alloc() doing get_page() either in
>>>> the " if (iov_iter_is_bvec(i)) " branch or via iter_folioq_get_pages()?
>>>>
>>> Looks like it.
>>>
>>>> Which doesn't work for a sub-size kmalloc() from a slab folio, which after
>>>> the frozen refcount conversion no longer supports get_page().
>>>>
>>>> The question is if this is a mistake specific for this path that's easy to
>>>> fix or there are more paths that do this. At the very least the pinning of
>>>> page through a kmalloc() allocation from it is useless - the object itself
>>>> has to be kfree()'d and that would never happen through a put_page()
>>>> reaching zero.
>>>>
>>> Looks like a specific mistake.
>>> tls_sw is the only user of sk_msg_zerocopy_from_iter()
>>> (which is calling into __iov_iter_get_pages_alloc()).
> 
> That's from tls_sw_sendmsg_locked(), right? But that's under:
> 
> if (!is_kvec && (full_record || eor) && !async_capable) {
> 
> Shouldn't is_kvec be true if we're dealing a kernel buffer (kmalloc()) there?
> 
Yes, and no.

We're initializing the iter in nvme_tcp_try_send_data():

		bvec_set_page(&bvec, page, len, offset);
		iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len);

and 'page' is coming from bio bvec. So the bv_page could refer to a 
kmalloced page, the bvec is still that, a bvec.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Kernel oops with 6.14 when enabling TLS
  2025-03-04 16:14                       ` Matthew Wilcox
@ 2025-03-04 16:32                         ` Hannes Reinecke
  2025-03-04 16:53                           ` Matthew Wilcox
  0 siblings, 1 reply; 36+ messages in thread
From: Hannes Reinecke @ 2025-03-04 16:32 UTC (permalink / raw)
  To: Matthew Wilcox, Vlastimil Babka
  Cc: Boris Pismenny, John Fastabend, Jakub Kicinski, Sagi Grimberg,
	linux-nvme, linux-block, linux-mm, Harry Yoo, netdev

On 3/4/25 17:14, Matthew Wilcox wrote:
> On Tue, Mar 04, 2025 at 11:26:07AM +0100, Vlastimil Babka wrote:
>> +Cc NETWORKING [TLS] maintainers and netdev for input, thanks.
>>
>> The full error is here:
>> https://lore.kernel.org/all/fcfa11c6-2738-4a2e-baa8-09fa8f79cbf3@suse.de/
>>
>> On 3/4/25 11:20, Hannes Reinecke wrote:
>>> On 3/4/25 09:18, Vlastimil Babka wrote:
>>>> On 3/4/25 08:58, Hannes Reinecke wrote:
>>>>> On 3/3/25 23:02, Vlastimil Babka wrote:
>>>>>> Also make sure you have CONFIG_DEBUG_VM please.
>>>>>>
>>>>> Here you go:
>>>>>
>>>>> [  134.506802] page: refcount:0 mapcount:0 mapping:0000000000000000
>>>>> index:0x0 pfn:0x101ef8
>>>>> [  134.509253] head: order:3 mapcount:0 entire_mapcount:0
>>>>> nr_pages_mapped:0 pincount:0
>>>>> [  134.511594] flags:
>>>>> 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff)
>>>>> [  134.513556] page_type: f5(slab)
>>>>> [  134.513563] raw: 0017ffffc0000040 ffff888100041b00 ffffea0004a90810
>>>>> ffff8881000402f0
>>>>> [  134.513568] raw: 0000000000000000 00000000000a000a 00000000f5000000
>>>>> 0000000000000000
>>>>> [  134.513572] head: 0017ffffc0000040 ffff888100041b00 ffffea0004a90810
>>>>> ffff8881000402f0
>>>>> [  134.513575] head: 0000000000000000 00000000000a000a 00000000f5000000
>>>>> 0000000000000000
>>>>> [  134.513579] head: 0017ffffc0000003 ffffea000407be01 ffffffffffffffff
>>>>> 0000000000000000
>>>>> [  134.513583] head: 0000000000000008 0000000000000000 00000000ffffffff
>>>>> 0000000000000000
>>>>> [  134.513585] page dumped because: VM_BUG_ON_FOLIO(((unsigned int)
>>>>> folio_ref_count(folio) + 127u <= 127u))
>>>>> [  134.513615] ------------[ cut here ]------------
>>>>> [  134.529822] kernel BUG at ./include/linux/mm.h:1455!
>>>>
>>>> Yeah, just as I suspected, folio_get() says the refcount is 0.
> 
> ... and it has a page_type of f5 (slab)
> 
>>>>> [  134.554509] Call Trace:
>>>>> [  134.580282]  iov_iter_get_pages2+0x19/0x30
>>>>
>>>> Presumably that's __iov_iter_get_pages_alloc() doing get_page() either in
>>>> the " if (iov_iter_is_bvec(i)) " branch or via iter_folioq_get_pages()?
> 
> It's the bvec path:
> 
>                  iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len);
> 
>>>> Which doesn't work for a sub-size kmalloc() from a slab folio, which after
>>>> the frozen refcount conversion no longer supports get_page().
>>>>
>>>> The question is if this is a mistake specific for this path that's easy to
>>>> fix or there are more paths that do this. At the very least the pinning of
>>>> page through a kmalloc() allocation from it is useless - the object itself
>>>> has to be kfree()'d and that would never happen through a put_page()
>>>> reaching zero.
>>>>
>>> Looks like a specific mistake.
>>> tls_sw is the only user of sk_msg_zerocopy_from_iter()
>>> (which is calling into __iov_iter_get_pages_alloc()).
>>>
>>> And, more to the point, tls_sw messes up iov pacing coming in from
>>> the upper layers.
>>> So even if the upper layers send individual iovs (where each iov might
>>> contain different allocation types), tls_sw is packing them together
>>> into full records. So it might end up with iovs having _different_
>>> allocations.
>>> Which would explain why we only see it with TLS, but not with normal
>>> connections.
> 
> I thought we'd done all the work needed to get rid of these pointless
> refcount bumps.  Turns out that's only on the block side (eg commit
> e4cc64657bec).  So what does networking need in order to understand
> that some iovecs do not need to mess with the refcount?

The network stack needs to get hold of the page while transmission is 
ongoing, as there is potentially rather deep queueing involved,
requiring several calls to sendmsg() and friends before the page is 
finally transmitted. And maybe some post-processing (checksums,
digests, you name it), too, all of which require the page to be there.

It's all so jumbled up ... personally, I would _love_ to do away with
__iov_iter_get_pages_alloc(). Allocating a page array? Seriously?

And the problem with that is that it's always takes a page(!) reference,
completely oblivious to the fact whether you even _can_ take a page 
reference (eg for tail pages); we've hit this problem several times now
(check for sendpage_ok() ...).

But that's not the real issue; real issue is that the page reference is
taken down in the very bowels of __iov_iter_get_pages_alloc(), but needs
to be undone by the _caller_. Who might (or might not) have an idea
that he needs to drop the reference here.
That's why there is no straightforward conversion; you need to audit
each and every caller and try to find out where the page reference (if 
any) is dropped.
Bah.

Can't we (at the very least) leave it to the caller of 
__iov_iter_get_pages() to get a page reference (he has access to the 
page array, after all ...)? That would make the interface slightly
better, and it'll be far more obvious to the caller what needs
to be done.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.com                               +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Kernel oops with 6.14 when enabling TLS
  2025-03-04 16:32                         ` Hannes Reinecke
@ 2025-03-04 16:53                           ` Matthew Wilcox
  2025-03-04 18:05                             ` Matthew Wilcox
  0 siblings, 1 reply; 36+ messages in thread
From: Matthew Wilcox @ 2025-03-04 16:53 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Vlastimil Babka, Boris Pismenny, John Fastabend, Jakub Kicinski,
	Sagi Grimberg, linux-nvme, linux-block, linux-mm, Harry Yoo,
	netdev

On Tue, Mar 04, 2025 at 05:32:32PM +0100, Hannes Reinecke wrote:
> On 3/4/25 17:14, Matthew Wilcox wrote:
> > I thought we'd done all the work needed to get rid of these pointless
> > refcount bumps.  Turns out that's only on the block side (eg commit
> > e4cc64657bec).  So what does networking need in order to understand
> > that some iovecs do not need to mess with the refcount?
> 
> The network stack needs to get hold of the page while transmission is
> ongoing, as there is potentially rather deep queueing involved,
> requiring several calls to sendmsg() and friends before the page is finally
> transmitted. And maybe some post-processing (checksums,
> digests, you name it), too, all of which require the page to be there.
> 
> It's all so jumbled up ... personally, I would _love_ to do away with
> __iov_iter_get_pages_alloc(). Allocating a page array? Seriously?
> 
> And the problem with that is that it's always takes a page(!) reference,
> completely oblivious to the fact whether you even _can_ take a page
> reference (eg for tail pages); we've hit this problem several times now
> (check for sendpage_ok() ...).

Calling get_page() / put_page() on a tail page is fine -- that just
redirects to the head page.  But calling it on a slab never made any
sense; at best it gets you the equivalent of TYPESAFE_BY_RCU -- that is,
the object can be freed and reallocated, but the underlying slab will
not be reallocated to some other purpose.

> But that's not the real issue; real issue is that the page reference is
> taken down in the very bowels of __iov_iter_get_pages_alloc(), but needs
> to be undone by the _caller_. Who might (or might not) have an idea
> that he needs to drop the reference here.
> That's why there is no straightforward conversion; you need to audit
> each and every caller and try to find out where the page reference (if any)
> is dropped.
> Bah.
> 
> Can't we (at the very least) leave it to the caller of
> __iov_iter_get_pages() to get a page reference (he has access to the page
> array, after all ...)? That would make the interface slightly
> better, and it'll be far more obvious to the caller what needs
> to be done.

Right, that's what happened in the block layer.  We mark the bio with
BIO_PAGE_PINNED if the pincount needs to be dropped.  As a transitional
period, we had BIO_PAGE_REFFED which indicated that the page refcount
needed to be dropped.  Perhaps there's something similar that network
could be doing.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Kernel oops with 6.14 when enabling TLS
  2025-03-04 16:53                           ` Matthew Wilcox
@ 2025-03-04 18:05                             ` Matthew Wilcox
  2025-03-04 18:31                               ` Vlastimil Babka
  2025-03-04 19:39                               ` Hannes Reinecke
  0 siblings, 2 replies; 36+ messages in thread
From: Matthew Wilcox @ 2025-03-04 18:05 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Vlastimil Babka, Boris Pismenny, John Fastabend, Jakub Kicinski,
	Sagi Grimberg, linux-nvme, linux-block, linux-mm, Harry Yoo,
	netdev

On Tue, Mar 04, 2025 at 04:53:09PM +0000, Matthew Wilcox wrote:
> Right, that's what happened in the block layer.  We mark the bio with
> BIO_PAGE_PINNED if the pincount needs to be dropped.  As a transitional
> period, we had BIO_PAGE_REFFED which indicated that the page refcount
> needed to be dropped.  Perhaps there's something similar that network
> could be doing.

Until that time ... how does this look as a quick hack to avoid
reverting the slab change?

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d6fed25243c3..ca08a923ac6d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1520,7 +1520,10 @@ static inline void folio_get(struct folio *folio)
 
 static inline void get_page(struct page *page)
 {
-	folio_get(page_folio(page));
+	struct folio *folio = page_folio(page);
+	if (WARN_ON_ONCE(folio_test_slab(folio)))
+		return;
+	folio_get(folio);
 }
 
 static inline __must_check bool try_get_page(struct page *page)
@@ -1614,6 +1617,8 @@ static inline void put_page(struct page *page)
 {
 	struct folio *folio = page_folio(page);
 
+	if (folio_test_slab(folio))
+		return;
 	folio_put(folio);
 }
 
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 65f550cb5081..8c7fdb7d8c8f 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1190,8 +1190,12 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		if (!n)
 			return -ENOMEM;
 		p = *pages;
-		for (int k = 0; k < n; k++)
-			get_page(p[k] = page + k);
+		for (int k = 0; k < n; k++) {
+			struct folio *folio = page_folio(page);
+			p[k] = page + k;
+			if (!folio_test_slab(folio))
+				folio_get(folio);
+		}
 		maxsize = min_t(size_t, maxsize, n * PAGE_SIZE - *start);
 		i->count -= maxsize;
 		i->iov_offset += maxsize;


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Kernel oops with 6.14 when enabling TLS
  2025-03-04 18:05                             ` Matthew Wilcox
@ 2025-03-04 18:31                               ` Vlastimil Babka
  2025-03-04 19:39                               ` Hannes Reinecke
  1 sibling, 0 replies; 36+ messages in thread
From: Vlastimil Babka @ 2025-03-04 18:31 UTC (permalink / raw)
  To: Matthew Wilcox, Hannes Reinecke
  Cc: Boris Pismenny, John Fastabend, Jakub Kicinski, Sagi Grimberg,
	linux-nvme, linux-block, linux-mm, Harry Yoo, netdev

On 3/4/25 19:05, Matthew Wilcox wrote:
> On Tue, Mar 04, 2025 at 04:53:09PM +0000, Matthew Wilcox wrote:
>> Right, that's what happened in the block layer.  We mark the bio with
>> BIO_PAGE_PINNED if the pincount needs to be dropped.  As a transitional
>> period, we had BIO_PAGE_REFFED which indicated that the page refcount
>> needed to be dropped.  Perhaps there's something similar that network
>> could be doing.
> 
> Until that time ... how does this look as a quick hack to avoid
> reverting the slab change?
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index d6fed25243c3..ca08a923ac6d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1520,7 +1520,10 @@ static inline void folio_get(struct folio *folio)
>  
>  static inline void get_page(struct page *page)
>  {
> -	folio_get(page_folio(page));
> +	struct folio *folio = page_folio(page);
> +	if (WARN_ON_ONCE(folio_test_slab(folio)))
> +		return;
> +	folio_get(folio);

Maybe we could do just VM_WARN_ON_ONCE if we fix the known offender fully.

>  }
>  
>  static inline __must_check bool try_get_page(struct page *page)
> @@ -1614,6 +1617,8 @@ static inline void put_page(struct page *page)
>  {
>  	struct folio *folio = page_folio(page);
>  
> +	if (folio_test_slab(folio))
> +		return;

And that means knowing where the put_page() side is and skipping it for slab
pages like for the get_page() side below. Hannes could you apply this patch
but also with WARN_ON_ONCE here in put_page()? And tell us where the
put_page() splats. Thanks!

>  	folio_put(folio);
>  }
>  
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 65f550cb5081..8c7fdb7d8c8f 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1190,8 +1190,12 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
>  		if (!n)
>  			return -ENOMEM;
>  		p = *pages;
> -		for (int k = 0; k < n; k++)
> -			get_page(p[k] = page + k);
> +		for (int k = 0; k < n; k++) {
> +			struct folio *folio = page_folio(page);
> +			p[k] = page + k;
> +			if (!folio_test_slab(folio))
> +				folio_get(folio);
> +		}
>  		maxsize = min_t(size_t, maxsize, n * PAGE_SIZE - *start);
>  		i->count -= maxsize;
>  		i->iov_offset += maxsize;



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Kernel oops with 6.14 when enabling TLS
  2025-03-04 18:05                             ` Matthew Wilcox
  2025-03-04 18:31                               ` Vlastimil Babka
@ 2025-03-04 19:39                               ` Hannes Reinecke
  2025-03-04 19:44                                 ` Vlastimil Babka
  1 sibling, 1 reply; 36+ messages in thread
From: Hannes Reinecke @ 2025-03-04 19:39 UTC (permalink / raw)
  To: Matthew Wilcox, Hannes Reinecke
  Cc: Vlastimil Babka, Boris Pismenny, John Fastabend, Jakub Kicinski,
	Sagi Grimberg, linux-nvme, linux-block, linux-mm, Harry Yoo,
	netdev

On 3/4/25 19:05, Matthew Wilcox wrote:
> On Tue, Mar 04, 2025 at 04:53:09PM +0000, Matthew Wilcox wrote:
>> Right, that's what happened in the block layer.  We mark the bio with
>> BIO_PAGE_PINNED if the pincount needs to be dropped.  As a transitional
>> period, we had BIO_PAGE_REFFED which indicated that the page refcount
>> needed to be dropped.  Perhaps there's something similar that network
>> could be doing.
> 
> Until that time ... how does this look as a quick hack to avoid
> reverting the slab change?
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index d6fed25243c3..ca08a923ac6d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1520,7 +1520,10 @@ static inline void folio_get(struct folio *folio)
>   
>   static inline void get_page(struct page *page)
>   {
> -	folio_get(page_folio(page));
> +	struct folio *folio = page_folio(page);
> +	if (WARN_ON_ONCE(folio_test_slab(folio)))
> +		return;
> +	folio_get(folio);
>   }
>   
>   static inline __must_check bool try_get_page(struct page *page)
> @@ -1614,6 +1617,8 @@ static inline void put_page(struct page *page)
>   {
>   	struct folio *folio = page_folio(page);
>   
> +	if (folio_test_slab(folio))
> +		return;
>   	folio_put(folio);
>   }
>   
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 65f550cb5081..8c7fdb7d8c8f 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1190,8 +1190,12 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
>   		if (!n)
>   			return -ENOMEM;
>   		p = *pages;
> -		for (int k = 0; k < n; k++)
> -			get_page(p[k] = page + k);
> +		for (int k = 0; k < n; k++) {
> +			struct folio *folio = page_folio(page);
> +			p[k] = page + k;
> +			if (!folio_test_slab(folio))
> +				folio_get(folio);
> +		}
>   		maxsize = min_t(size_t, maxsize, n * PAGE_SIZE - *start);
>   		i->count -= maxsize;
>   		i->iov_offset += maxsize;
> 

Good news and bad news ...
Good news: TLS works again!
Bad news: no errors.

Question to the wise: this is not the only place in iov_iter.c where we 
do a 'get_page()'. Do we leave them and wait for others to report 
regressions, knowing fully well that the current code _has_ issues?
Or shouldn't we rather clean them up?

I guess the real fix would be to fiddle with the 'bio_add_page()' logic;
we are always adding a 'page' reference to the bio, completely ignoring
whether this page is a slab page or a normal one.

Discussion at LSF, maybe?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Kernel oops with 6.14 when enabling TLS
  2025-03-04 19:39                               ` Hannes Reinecke
@ 2025-03-04 19:44                                 ` Vlastimil Babka
  2025-03-05  7:14                                   ` Hannes Reinecke
  2025-03-05  8:20                                   ` Hannes Reinecke
  0 siblings, 2 replies; 36+ messages in thread
From: Vlastimil Babka @ 2025-03-04 19:44 UTC (permalink / raw)
  To: Hannes Reinecke, Matthew Wilcox, Hannes Reinecke
  Cc: Boris Pismenny, John Fastabend, Jakub Kicinski, Sagi Grimberg,
	linux-nvme, linux-block, linux-mm, Harry Yoo, netdev

On 3/4/25 20:39, Hannes Reinecke wrote:
> On 3/4/25 19:05, Matthew Wilcox wrote:
>> On Tue, Mar 04, 2025 at 04:53:09PM +0000, Matthew Wilcox wrote:
>>> Right, that's what happened in the block layer.  We mark the bio with
>>> BIO_PAGE_PINNED if the pincount needs to be dropped.  As a transitional
>>> period, we had BIO_PAGE_REFFED which indicated that the page refcount
>>> needed to be dropped.  Perhaps there's something similar that network
>>> could be doing.
>> 
>> Until that time ... how does this look as a quick hack to avoid
>> reverting the slab change?
>> 
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index d6fed25243c3..ca08a923ac6d 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1520,7 +1520,10 @@ static inline void folio_get(struct folio *folio)
>>   
>>   static inline void get_page(struct page *page)
>>   {
>> -	folio_get(page_folio(page));
>> +	struct folio *folio = page_folio(page);
>> +	if (WARN_ON_ONCE(folio_test_slab(folio)))
>> +		return;
>> +	folio_get(folio);
>>   }
>>   
>>   static inline __must_check bool try_get_page(struct page *page)
>> @@ -1614,6 +1617,8 @@ static inline void put_page(struct page *page)
>>   {
>>   	struct folio *folio = page_folio(page);
>>   
>> +	if (folio_test_slab(folio))
>> +		return;
>>   	folio_put(folio);
>>   }
>>   
>> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
>> index 65f550cb5081..8c7fdb7d8c8f 100644
>> --- a/lib/iov_iter.c
>> +++ b/lib/iov_iter.c
>> @@ -1190,8 +1190,12 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
>>   		if (!n)
>>   			return -ENOMEM;
>>   		p = *pages;
>> -		for (int k = 0; k < n; k++)
>> -			get_page(p[k] = page + k);
>> +		for (int k = 0; k < n; k++) {
>> +			struct folio *folio = page_folio(page);
>> +			p[k] = page + k;
>> +			if (!folio_test_slab(folio))
>> +				folio_get(folio);
>> +		}
>>   		maxsize = min_t(size_t, maxsize, n * PAGE_SIZE - *start);
>>   		i->count -= maxsize;
>>   		i->iov_offset += maxsize;
>> 
> 
> Good news and bad news ...
> Good news: TLS works again!
> Bad news: no errors.

Wait, did you add a WARN_ON_ONCE() to the put_page() as I suggested? If yes
and there was no error, it would have to be leaking the page. Or the path
uses folio_put() and we'd need to put the warning there.

> Question to the wise: this is not the only place in iov_iter.c where we 
> do a 'get_page()'. Do we leave them and wait for others to report 
> regressions, knowing fully well that the current code _has_ issues?
> Or shouldn't we rather clean them up?
> 
> I guess the real fix would be to fiddle with the 'bio_add_page()' logic;
> we are always adding a 'page' reference to the bio, completely ignoring
> whether this page is a slab page or a normal one.
> 
> Discussion at LSF, maybe?
> 
> Cheers,
> 
> Hannes



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Kernel oops with 6.14 when enabling TLS
  2025-03-04 19:44                                 ` Vlastimil Babka
@ 2025-03-05  7:14                                   ` Hannes Reinecke
  2025-03-05  8:20                                   ` Hannes Reinecke
  1 sibling, 0 replies; 36+ messages in thread
From: Hannes Reinecke @ 2025-03-05  7:14 UTC (permalink / raw)
  To: Vlastimil Babka, Matthew Wilcox, Hannes Reinecke
  Cc: Boris Pismenny, John Fastabend, Jakub Kicinski, Sagi Grimberg,
	linux-nvme, linux-block, linux-mm, Harry Yoo, netdev

On 3/4/25 20:44, Vlastimil Babka wrote:
> On 3/4/25 20:39, Hannes Reinecke wrote:
>> On 3/4/25 19:05, Matthew Wilcox wrote:
>>> On Tue, Mar 04, 2025 at 04:53:09PM +0000, Matthew Wilcox wrote:
>>>> Right, that's what happened in the block layer.  We mark the bio with
>>>> BIO_PAGE_PINNED if the pincount needs to be dropped.  As a transitional
>>>> period, we had BIO_PAGE_REFFED which indicated that the page refcount
>>>> needed to be dropped.  Perhaps there's something similar that network
>>>> could be doing.
>>>
>>> Until that time ... how does this look as a quick hack to avoid
>>> reverting the slab change?
>>>
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index d6fed25243c3..ca08a923ac6d 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -1520,7 +1520,10 @@ static inline void folio_get(struct folio *folio)
>>>    
>>>    static inline void get_page(struct page *page)
>>>    {
>>> -	folio_get(page_folio(page));
>>> +	struct folio *folio = page_folio(page);
>>> +	if (WARN_ON_ONCE(folio_test_slab(folio)))
>>> +		return;
>>> +	folio_get(folio);
>>>    }
>>>    
>>>    static inline __must_check bool try_get_page(struct page *page)
>>> @@ -1614,6 +1617,8 @@ static inline void put_page(struct page *page)
>>>    {
>>>    	struct folio *folio = page_folio(page);
>>>    
>>> +	if (folio_test_slab(folio))
>>> +		return;
>>>    	folio_put(folio);
>>>    }
>>>    
>>> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
>>> index 65f550cb5081..8c7fdb7d8c8f 100644
>>> --- a/lib/iov_iter.c
>>> +++ b/lib/iov_iter.c
>>> @@ -1190,8 +1190,12 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
>>>    		if (!n)
>>>    			return -ENOMEM;
>>>    		p = *pages;
>>> -		for (int k = 0; k < n; k++)
>>> -			get_page(p[k] = page + k);
>>> +		for (int k = 0; k < n; k++) {
>>> +			struct folio *folio = page_folio(page);
>>> +			p[k] = page + k;
>>> +			if (!folio_test_slab(folio))
>>> +				folio_get(folio);
>>> +		}
>>>    		maxsize = min_t(size_t, maxsize, n * PAGE_SIZE - *start);
>>>    		i->count -= maxsize;
>>>    		i->iov_offset += maxsize;
>>>
>>
>> Good news and bad news ...
>> Good news: TLS works again!
>> Bad news: no errors.
> 
> Wait, did you add a WARN_ON_ONCE() to the put_page() as I suggested? If yes
> and there was no error, it would have to be leaking the page. Or the path
> uses folio_put() and we'd need to put the warning there.
> 
Oh, no, I didn't. Just added the WARN_ON to get_page().
Let me try ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Kernel oops with 6.14 when enabling TLS
  2025-03-04 19:44                                 ` Vlastimil Babka
  2025-03-05  7:14                                   ` Hannes Reinecke
@ 2025-03-05  8:20                                   ` Hannes Reinecke
  2025-03-05  8:58                                     ` Vlastimil Babka
  1 sibling, 1 reply; 36+ messages in thread
From: Hannes Reinecke @ 2025-03-05  8:20 UTC (permalink / raw)
  To: Vlastimil Babka, Hannes Reinecke, Matthew Wilcox
  Cc: Boris Pismenny, John Fastabend, Jakub Kicinski, Sagi Grimberg,
	linux-nvme, linux-block, linux-mm, Harry Yoo, netdev

On 3/4/25 20:44, Vlastimil Babka wrote:
> On 3/4/25 20:39, Hannes Reinecke wrote:
[ .. ]
>>
>> Good news and bad news ...
>> Good news: TLS works again!
>> Bad news: no errors.
> 
> Wait, did you add a WARN_ON_ONCE() to the put_page() as I suggested? If yes
> and there was no error, it would have to be leaking the page. Or the path
> uses folio_put() and we'd need to put the warning there.
> 
That triggers:
[   42.364339] page dumped because: VM_WARN_ON_FOLIO(folio_test_slab(folio))
[   42.364379] ------------[ cut here ]------------
[   42.375500] WARNING: CPU: 0 PID: 236 at ./include/linux/mm.h:1564 
sk_msg_free_elem+0x157/0x180
[   42.375642] Modules linked in: tls(E) nvme_tcp(E) af_packet(E) 
iscsi_ibft(E) iscsi_boot_sysfs(E) xfs(E) nls_iso8859_1(E) nls_cp437(E) 
vfat(E) fat(E) iTCO_wdt(E) intel_pmc_bxt(E) intel_rapl_msr(E) 
iTCO_vendor_support(E) intel_rapl_common(E) i2c_i801(E) bnxt_en(E) 
i2c_mux(E) lpc_ich(E) mfd_core(E) i2c_smbus(E) virtio_balloon(E) 
joydev(E) button(E) nvme_fabrics(E) nvme_keyring(E) nvme_core(E) fuse(E) 
nvme_auth(E) efi_pstore(E) configfs(E) dmi_sysfs(E) ip_tables(E) 
x_tables(E) hid_generic(E) usbhid(E) ahci(E) libahci(E) libata(E) 
virtio_scsi(E) sd_mod(E) scsi_dh_emc(E) scsi_dh_rdac(E) scsi_dh_alua(E) 
qxl(E) sg(E) ghash_clmulni_intel(E) xhci_pci(E) drm_client_lib(E) 
drm_exec(E) drm_ttm_helper(E) sha512_ssse3(E) xhci_hcd(E) ttm(E) 
sha256_ssse3(E) drm_kms_helper(E) scsi_mod(E) sha1_ssse3(E) usbcore(E) 
scsi_common(E) drm(E) serio_raw(E) btrfs(E) blake2b_generic(E) xor(E) 
raid6_pq(E) efivarfs(E) qemu_fw_cfg(E) virtio_rng(E) aesni_intel(E) 
crypto_simd(E) cryptd(E)
[   42.393292] CPU: 0 UID: 0 PID: 236 Comm: kworker/0:1H Kdump: loaded 
Tainted: G            E      6.14.0-rc4-default+ #316 
cadaa81909a6170d00e1f47f3fc0db03c6a03650
[   42.393303] Tainted: [E]=UNSIGNED_MODULE
[   42.393305] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
0.0.0 02/06/2015
[   42.393310] Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
[   42.393323] RIP: 0010:sk_msg_free_elem+0x157/0x180
[   42.393331] Code: ff 48 c7 c6 d0 42 4f 82 48 89 ef e8 b3 63 8a ff 0f 
0b 48 8d 6a ff e9 6c ff ff ff 48 c7 c6 a0 42 4f 82 48 89 ef e8 99 63 8a 
ff <0f> 0b e9 c7 fe ff ff 2b 87 78 01 00 00 8b 97 c0 00 00 00 29 d0 ba
[   42.393336] RSP: 0018:ffffc9000040b798 EFLAGS: 00010282
[   42.393341] RAX: 000000000000003d RBX: ffff888110ab0858 RCX: 
0000000000000027
[   42.393344] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 
ffff88817f423748
[   42.393347] RBP: ffffea0004295e00 R08: 0000000000000000 R09: 
0000000000000001
[   42.393350] R10: ffffc9000040b780 R11: ffffc9000040b4e0 R12: 
0000000000000400
[   42.393353] R13: ffff888110ab0818 R14: 0000000000000002 R15: 
ffff88810fa669d8
[   42.393361] FS:  0000000000000000(0000) GS:ffff88817f400000(0000) 
knlGS:0000000000000000
[   42.393365] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   42.393369] CR2: 00007f56a6ea6da4 CR3: 000000011bfc0000 CR4: 
0000000000350ef0
[   42.416071] Call Trace:
[   42.416078]  <TASK>
[   42.416084]  ? __warn+0x85/0x130
[   42.416095]  ? sk_msg_free_elem+0x157/0x180
[   42.418893]  ? report_bug+0xf8/0x1e0
[   42.418904]  ? handle_bug+0x50/0xa0
[   42.418910]  ? exc_invalid_op+0x13/0x60
[   42.418916]  ? asm_exc_invalid_op+0x16/0x20
[   42.418935]  ? sk_msg_free_elem+0x157/0x180
[   42.423206]  ? sk_msg_free_elem+0x157/0x180
[   42.423215]  __sk_msg_free+0x4f/0x100
[   42.423224]  tls_tx_records+0x118/0x190 [tls 
80cce2d02933ba636eb5845a829121ac309b44ed]
[   42.426506]  bpf_exec_tx_verdict+0x249/0x5e0 [tls 
80cce2d02933ba636eb5845a829121ac309b44ed]
[   42.426519]  ? srso_return_thunk+0x5/0x5f
[   42.426526]  ? __pfx_stack_trace_consume_entry+0x10/0x10
[   42.426572]  tls_sw_sendmsg+0x72f/0x9f0 [tls 
80cce2d02933ba636eb5845a829121ac309b44ed]
[   42.432016]  __sock_sendmsg+0x98/0xc0
[   42.432025]  sock_sendmsg+0x5c/0xa0
[   42.432030]  ? srso_return_thunk+0x5/0x5f
[   42.432034]  ? __sock_sendmsg+0x98/0xc0
[   42.432040]  ? srso_return_thunk+0x5/0x5f
[   42.436134]  ? sock_sendmsg+0x5c/0xa0
[   42.436146]  nvme_tcp_try_send_data+0x13f/0x410 [nvme_tcp 
9f4f1c84141d3edfcd3e478eb7c2fb638b4a92b3]
[   42.436159]  ? srso_return_thunk+0x5/0x5f
[   42.439452]  ? sched_balance_newidle+0x2f6/0x400
[   42.439468]  nvme_tcp_try_send+0x299/0x330 [nvme_tcp 
9f4f1c84141d3edfcd3e478eb7c2fb638b4a92b3]
[   42.439479]  nvme_tcp_io_work+0x37/0xb0 [nvme_tcp 
9f4f1c84141d3edfcd3e478eb7c2fb638b4a92b3]
[   42.443603]  process_scheduled_works+0x97/0x400
[   42.443614]  ? __pfx_worker_thread+0x10/0x10
[   42.443619]  worker_thread+0x105/0x240
[   42.443625]  ? __pfx_worker_thread+0x10/0x10
[   42.443630]  kthread+0xec/0x200
[   42.443639]  ? __pfx_kthread+0x10/0x10
[   42.443646]  ret_from_fork+0x30/0x50
[   42.443652]  ? __pfx_kthread+0x10/0x10
[   42.443658]  ret_from_fork_asm+0x1a/0x30
[   42.451127]  </TASK>
[   42.451131] ---[ end trace 0000000000000000 ]---

Not surprisingly, though, as the original code did a get_page(), so
there had to be a corresponding put_page() somewhere.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.com                               +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Kernel oops with 6.14 when enabling TLS
  2025-03-05  8:20                                   ` Hannes Reinecke
@ 2025-03-05  8:58                                     ` Vlastimil Babka
  2025-03-05 11:43                                       ` Hannes Reinecke
  0 siblings, 1 reply; 36+ messages in thread
From: Vlastimil Babka @ 2025-03-05  8:58 UTC (permalink / raw)
  To: Hannes Reinecke, Hannes Reinecke, Matthew Wilcox
  Cc: Boris Pismenny, John Fastabend, Jakub Kicinski, Sagi Grimberg,
	linux-nvme, linux-block, linux-mm, Harry Yoo, netdev

On 3/5/25 09:20, Hannes Reinecke wrote:
> On 3/4/25 20:44, Vlastimil Babka wrote:
>> On 3/4/25 20:39, Hannes Reinecke wrote:
> [ .. ]
>>>
>>> Good news and bad news ...
>>> Good news: TLS works again!
>>> Bad news: no errors.
>> 
>> Wait, did you add a WARN_ON_ONCE() to the put_page() as I suggested? If yes
>> and there was no error, it would have to be leaking the page. Or the path
>> uses folio_put() and we'd need to put the warning there.
>> 
> That triggers:
...
> Not surprisingly, though, as the original code did a get_page(), so
> there had to be a corresponding put_page() somewhere.

Is is this one? If there's no more warning afterwards, that should be it.

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 61f3f3d4e528..b37d99cec069 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -182,9 +182,14 @@ static int sk_msg_free_elem(struct sock *sk, struct sk_msg *msg, u32 i,
 
        /* When the skb owns the memory we free it from consume_skb path. */
        if (!msg->skb) {
+               struct folio *folio;
+
                if (charge)
                        sk_mem_uncharge(sk, len);
-               put_page(sg_page(sge));
+
+               folio = page_folio(sg_page(sge));
+               if (!folio_test_slab(folio))
+                       folio_put(folio);
        }
        memset(sge, 0, sizeof(*sge));
        return len;



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Kernel oops with 6.14 when enabling TLS
  2025-03-05  8:58                                     ` Vlastimil Babka
@ 2025-03-05 11:43                                       ` Hannes Reinecke
  2025-03-05 18:11                                         ` Networking people smell funny and make poor life choices Matthew Wilcox
  2025-03-06  9:15                                         ` Kernel oops with 6.14 when enabling TLS Vlastimil Babka
  0 siblings, 2 replies; 36+ messages in thread
From: Hannes Reinecke @ 2025-03-05 11:43 UTC (permalink / raw)
  To: Vlastimil Babka, Hannes Reinecke, Matthew Wilcox
  Cc: Boris Pismenny, John Fastabend, Jakub Kicinski, Sagi Grimberg,
	linux-nvme, linux-block, linux-mm, Harry Yoo, netdev

On 3/5/25 09:58, Vlastimil Babka wrote:
> On 3/5/25 09:20, Hannes Reinecke wrote:
>> On 3/4/25 20:44, Vlastimil Babka wrote:
>>> On 3/4/25 20:39, Hannes Reinecke wrote:
>> [ .. ]
>>>>
>>>> Good news and bad news ...
>>>> Good news: TLS works again!
>>>> Bad news: no errors.
>>>
>>> Wait, did you add a WARN_ON_ONCE() to the put_page() as I suggested? If yes
>>> and there was no error, it would have to be leaking the page. Or the path
>>> uses folio_put() and we'd need to put the warning there.
>>>
>> That triggers:
> ...
>> Not surprisingly, though, as the original code did a get_page(), so
>> there had to be a corresponding put_page() somewhere.
> 
> Is is this one? If there's no more warning afterwards, that should be it.
> 
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index 61f3f3d4e528..b37d99cec069 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -182,9 +182,14 @@ static int sk_msg_free_elem(struct sock *sk, struct sk_msg *msg, u32 i,
>   
>          /* When the skb owns the memory we free it from consume_skb path. */
>          if (!msg->skb) {
> +               struct folio *folio;
> +
>                  if (charge)
>                          sk_mem_uncharge(sk, len);
> -               put_page(sg_page(sge));
> +
> +               folio = page_folio(sg_page(sge));
> +               if (!folio_test_slab(folio))
> +                       folio_put(folio);
>          }
>          memset(sge, 0, sizeof(*sge));
>          return len;
> 
> 
Oh, sure. But what annoys me: why do we have to care?

When doing I/O _all_ data is stuffed into bvecs via
bio_add_page(), and after that information about the
origin is lost; any iteration on the bio will be a bvec
iteration.
Previously we could just do a bvec iteration, get a reference
for each page, and start processing.
Now suddenly the caller has to check if it's a slab page and don't
get a reference for that. Not only that, he also has to remember
to _not_ drop the reference when he's done.
And, of course, tracing get_page() and the corresponding put_page()
calls through all the layers.
Really?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Networking people smell funny and make poor life choices
  2025-03-05 11:43                                       ` Hannes Reinecke
@ 2025-03-05 18:11                                         ` Matthew Wilcox
  2025-03-06  0:46                                           ` Cong Wang
                                                             ` (2 more replies)
  2025-03-06  9:15                                         ` Kernel oops with 6.14 when enabling TLS Vlastimil Babka
  1 sibling, 3 replies; 36+ messages in thread
From: Matthew Wilcox @ 2025-03-05 18:11 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Vlastimil Babka, Hannes Reinecke, Boris Pismenny, John Fastabend,
	Jakub Kicinski, Sagi Grimberg, linux-nvme, linux-block, linux-mm,
	Harry Yoo, netdev

On Wed, Mar 05, 2025 at 12:43:02PM +0100, Hannes Reinecke wrote:
> Oh, sure. But what annoys me: why do we have to care?
> 
> When doing I/O _all_ data is stuffed into bvecs via
> bio_add_page(), and after that information about the
> origin is lost; any iteration on the bio will be a bvec
> iteration.
> Previously we could just do a bvec iteration, get a reference
> for each page, and start processing.
> Now suddenly the caller has to check if it's a slab page and don't
> get a reference for that. Not only that, he also has to remember
> to _not_ drop the reference when he's done.
> And, of course, tracing get_page() and the corresponding put_page()
> calls through all the layers.

Networking needs to follow block's lead and STOP GETTING REFCOUNTS ON
PAGES.  That will speed up networking (eliminates two atomic operations per
page).  And of course, it will eliminate this hack in the MM.  I think
we do need to put this hack into the MM for now, but it needs to go away
again as quickly as possible.

What worries me is that nobody in networking has replied to this thread
yet.  Do they not care?  Let's see if a subject line change will help
with that.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Networking people smell funny and make poor life choices
  2025-03-05 18:11                                         ` Networking people smell funny and make poor life choices Matthew Wilcox
@ 2025-03-06  0:46                                           ` Cong Wang
  2025-03-12 15:09                                           ` Christoph Hellwig
  2025-03-13  9:43                                           ` David Laight
  2 siblings, 0 replies; 36+ messages in thread
From: Cong Wang @ 2025-03-06  0:46 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hannes Reinecke, Vlastimil Babka, Hannes Reinecke,
	Boris Pismenny, John Fastabend, Jakub Kicinski, Sagi Grimberg,
	linux-nvme, linux-block, linux-mm, Harry Yoo, netdev

On Wed, Mar 05, 2025 at 06:11:24PM +0000, Matthew Wilcox wrote:
> On Wed, Mar 05, 2025 at 12:43:02PM +0100, Hannes Reinecke wrote:
> > Oh, sure. But what annoys me: why do we have to care?
> > 
> > When doing I/O _all_ data is stuffed into bvecs via
> > bio_add_page(), and after that information about the
> > origin is lost; any iteration on the bio will be a bvec
> > iteration.
> > Previously we could just do a bvec iteration, get a reference
> > for each page, and start processing.
> > Now suddenly the caller has to check if it's a slab page and don't
> > get a reference for that. Not only that, he also has to remember
> > to _not_ drop the reference when he's done.
> > And, of course, tracing get_page() and the corresponding put_page()
> > calls through all the layers.
> 
> Networking needs to follow block's lead and STOP GETTING REFCOUNTS ON
> PAGES.  That will speed up networking (eliminates two atomic operations per
> page).  And of course, it will eliminate this hack in the MM.  I think
> we do need to put this hack into the MM for now, but it needs to go away
> again as quickly as possible.
> 
> What worries me is that nobody in networking has replied to this thread
> yet.  Do they not care?  Let's see if a subject line change will help
> with that.

Since it triggered a kernel crash, I am pretty sure people care. How
about sending out a patch to get more attentions?

I am not sure what patterns here you are suggesting to change w.r.t page
refcount, but at least using AI copilot or whatever automation tool should
be very handy.

Thanks.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Kernel oops with 6.14 when enabling TLS
  2025-03-05 11:43                                       ` Hannes Reinecke
  2025-03-05 18:11                                         ` Networking people smell funny and make poor life choices Matthew Wilcox
@ 2025-03-06  9:15                                         ` Vlastimil Babka
  1 sibling, 0 replies; 36+ messages in thread
From: Vlastimil Babka @ 2025-03-06  9:15 UTC (permalink / raw)
  To: Hannes Reinecke, Hannes Reinecke, Matthew Wilcox
  Cc: Boris Pismenny, John Fastabend, Jakub Kicinski, Sagi Grimberg,
	linux-nvme, linux-block, linux-mm, Harry Yoo, netdev,
	David Howells

On 3/5/25 12:43, Hannes Reinecke wrote:
> On 3/5/25 09:58, Vlastimil Babka wrote:
>> On 3/5/25 09:20, Hannes Reinecke wrote:
>>> On 3/4/25 20:44, Vlastimil Babka wrote:
>>>> On 3/4/25 20:39, Hannes Reinecke wrote:
>>> [ .. ]
>>>>>
>>>>> Good news and bad news ...
>>>>> Good news: TLS works again!
>>>>> Bad news: no errors.
>>>>
>>>> Wait, did you add a WARN_ON_ONCE() to the put_page() as I suggested? If yes
>>>> and there was no error, it would have to be leaking the page. Or the path
>>>> uses folio_put() and we'd need to put the warning there.
>>>>
>>> That triggers:
>> ...
>>> Not surprisingly, though, as the original code did a get_page(), so
>>> there had to be a corresponding put_page() somewhere.
>> 
>> Is is this one? If there's no more warning afterwards, that should be it.
>> 
>> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
>> index 61f3f3d4e528..b37d99cec069 100644
>> --- a/net/core/skmsg.c
>> +++ b/net/core/skmsg.c
>> @@ -182,9 +182,14 @@ static int sk_msg_free_elem(struct sock *sk, struct sk_msg *msg, u32 i,
>>   
>>          /* When the skb owns the memory we free it from consume_skb path. */
>>          if (!msg->skb) {
>> +               struct folio *folio;
>> +
>>                  if (charge)
>>                          sk_mem_uncharge(sk, len);
>> -               put_page(sg_page(sge));
>> +
>> +               folio = page_folio(sg_page(sge));
>> +               if (!folio_test_slab(folio))
>> +                       folio_put(folio);
>>          }
>>          memset(sge, 0, sizeof(*sge));
>>          return len;
>> 
>> 
> Oh, sure. But what annoys me: why do we have to care?
> 
> When doing I/O _all_ data is stuffed into bvecs via
> bio_add_page(), and after that information about the
> origin is lost; any iteration on the bio will be a bvec
> iteration.
> Previously we could just do a bvec iteration, get a reference
> for each page, and start processing.

AFAIU there's BIO_PAGE_PINNED that controls whether the pages are pinned, as
there are usecases where it makes sense to do that (userspace pages?). And
__bio_release_pages() can be removing the last pin and freeing the pages.

But this is a case where the buffer is a kmalloc() allocation, so somebody
has to do the corresponding kfree() when the messages are processed. A pin
on the slab folio where the kmalloc() resides helps nothing and as willy
says it's just unnecessary overhead of atomic allocations.

> Now suddenly the caller has to check if it's a slab page and don't
> get a reference for that. Not only that, he also has to remember
> to _not_ drop the reference when he's done.

The caller did kmalloc() and will have to do kfree(). I guess it's about
telling the intermediate layers via something similar like BIO_PAGE_PINNED
whether the pages should be pinned or not.

> And, of course, tracing get_page() and the corresponding put_page()
> calls through all the layers.
> Really?
> 
> Cheers,
> 
> Hannes



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Networking people smell funny and make poor life choices
  2025-03-05 18:11                                         ` Networking people smell funny and make poor life choices Matthew Wilcox
  2025-03-06  0:46                                           ` Cong Wang
@ 2025-03-12 15:09                                           ` Christoph Hellwig
  2025-03-12 18:28                                             ` James R. Bergsten
  2025-03-13  9:43                                           ` David Laight
  2 siblings, 1 reply; 36+ messages in thread
From: Christoph Hellwig @ 2025-03-12 15:09 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hannes Reinecke, Vlastimil Babka, Hannes Reinecke,
	Boris Pismenny, John Fastabend, Jakub Kicinski, Sagi Grimberg,
	linux-nvme, linux-block, linux-mm, Harry Yoo, netdev

On Wed, Mar 05, 2025 at 06:11:24PM +0000, Matthew Wilcox wrote:
> Networking needs to follow block's lead and STOP GETTING REFCOUNTS ON
> PAGES.

The block layer never took references on pages.  The direct I/O helpers
that just happened to set in block/ did hold references and abused some
field in the bio for it (and still do for the pinning), but the reference
was (and the pin now is) owned by the submitter.

The block layer model has always been that the submitter needs to ensure
memory stays allocated until the I/O has completed.  Which IMHO is the
only sane model for dealing with memory lifetimes vs I/O, and something
networking absolutely should follow.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: Networking people smell funny and make poor life choices
  2025-03-12 15:09                                           ` Christoph Hellwig
@ 2025-03-12 18:28                                             ` James R. Bergsten
  0 siblings, 0 replies; 36+ messages in thread
From: James R. Bergsten @ 2025-03-12 18:28 UTC (permalink / raw)
  To: 'Christoph Hellwig', 'Matthew Wilcox'
  Cc: 'Hannes Reinecke', 'Vlastimil Babka',
	'Hannes Reinecke', 'Boris Pismenny',
	'John Fastabend', 'Jakub Kicinski',
	'Sagi Grimberg',
	linux-nvme, linux-block, linux-mm, 'Harry Yoo',
	netdev

OK another "unnecessary" old-timer storage/network story/disruption to your otherwise relevant discussions (thank the subject line). If you're too busy, just don't read it. 😊

Around 1985, Gene Amdahl founded a company called Andor. Its original purpose (as was with everything Gene did) was to build the smallest plug-compatible mainframe.  When it was designed, someone noted it had no physical room for the humongous "Bus and Tag" cables needed for peripherals, so Gene raised a bit more money and started a storage project too.

When the Loma Prieta earthquake happened in 1989, PG&E, the local utility, lost the datacenter containing all of the information needed to repair their utilities, so the service people had to do this from memory.  The Public Utilities Commission didn't find this terribly funny, so they said PG&E had to create a second datacenter out of the area immediately and have backups there within about 24 hours, shorter as time went on.  So, they shut down the primary site every night, dumped to tape, then drove it up to Sacramento where these were restored.  They named this CTAM for "Chevy Truck Access Method."

Somehow Gene and friends heard about this and, as they already had a processor, device simulation and devices, if they added some sort of networking interface, they could have a local unit and a remote unit doing this backup, eliminating the truck.  BTW the "front end" storage group all came from Memorex.  The "back end" group mostly from Amdahl.

This actually (somewhat) worked, and a couple of units were installed in beta sites.  Sadly, Gene ran out of money (or at least didn't accept the terms offered) and buggered off to start yet another mainframe company which never shipped anything.

I was the last Engineering VP at Andor, so when it folded, I grabbed a few of the people and started a similar company but for the open systems market instead.  We named it "Ark" at my wife's suggestion as was like Noah's Ark - "disaster recovery" and "two of everything."  We mostly bootstrapped, did ship product, and were acquired by LSI Logic who were getting beaten around the head as EMC had a remote solution, but LSI didn't. I got about a dozen US Patents Issued and enough money to finally buy a house in Silicon Valley.

Our (SCSI-based) device had front end ports for the host(s), back-end ones for the devices, and side ones for the networking.  Lots of features, some you folks are only doing recently.  Looked like devices to hosts, hosts to devices.

Anyway, the point of all of this is that when we sold it to customers, the storage people looked at the network ports with confusion and dismay (some hadn't even ordered the network lines and caused months of delay), while the network people looked at the device ports as if they were full of Tasmanian devils.

Turned out, both network and storage expertise were very rare commodities.  This was largely why most iSCSI startups failed, they either did a storage product or a networking product. We pilled this off because I am stupid but stubborn and wrote the RTOS myself (Linux was in its infancy and the other RTOS's sucked).  Seemed a good idea at the time.  Have white papers online if anybody is interested.

So, networking people may smell funny, but to them storage people come from another galaxy.  Working in this industry at all could be considered a poor life choice but that's for another time.

Sorry. You can go back to work now.
Jim B

-----Original Message-----
From: Linux-nvme <linux-nvme-bounces@lists.infradead.org> On Behalf Of Christoph Hellwig
Sent: Wednesday, March 12, 2025 8:09 AM
To: Matthew Wilcox <willy@infradead.org>
Cc: Hannes Reinecke <hare@suse.de>; Vlastimil Babka <vbabka@suse.cz>; Hannes Reinecke <hare@suse.com>; Boris Pismenny <borisp@nvidia.com>; John Fastabend <john.fastabend@gmail.com>; Jakub Kicinski <kuba@kernel.org>; Sagi Grimberg <sagi@grimberg.me>; linux-nvme@lists.infradead.org; linux-block@vger.kernel.org; linux-mm@kvack.org; Harry Yoo <harry.yoo@oracle.com>; netdev@vger.kernel.org
Subject: Re: Networking people smell funny and make poor life choices

On Wed, Mar 05, 2025 at 06:11:24PM +0000, Matthew Wilcox wrote:
> Networking needs to follow block's lead and STOP GETTING REFCOUNTS ON 
> PAGES.

The block layer never took references on pages.  The direct I/O helpers that just happened to set in block/ did hold references and abused some field in the bio for it (and still do for the pinning), but the reference was (and the pin now is) owned by the submitter.

The block layer model has always been that the submitter needs to ensure memory stays allocated until the I/O has completed.  Which IMHO is the only sane model for dealing with memory lifetimes vs I/O, and something networking absolutely should follow.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Networking people smell funny and make poor life choices
  2025-03-05 18:11                                         ` Networking people smell funny and make poor life choices Matthew Wilcox
  2025-03-06  0:46                                           ` Cong Wang
  2025-03-12 15:09                                           ` Christoph Hellwig
@ 2025-03-13  9:43                                           ` David Laight
  2 siblings, 0 replies; 36+ messages in thread
From: David Laight @ 2025-03-13  9:43 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hannes Reinecke, Vlastimil Babka, Hannes Reinecke,
	Boris Pismenny, John Fastabend, Jakub Kicinski, Sagi Grimberg,
	linux-nvme, linux-block, linux-mm, Harry Yoo, netdev

On Wed, 5 Mar 2025 18:11:24 +0000
Matthew Wilcox <willy@infradead.org> wrote:


> What worries me is that nobody in networking has replied to this thread
> yet.  Do they not care?  Let's see if a subject line change will help
> with that.
> 

I like being smelly :-(


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2025-03-13  9:44 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <08c29e4b-2f71-4b6d-8046-27e407214d8c@suse.com>
2025-03-03  7:48 ` Kernel oops with 6.14 when enabling TLS Hannes Reinecke
2025-03-03 11:06   ` Hannes Reinecke
2025-03-03 12:57     ` Hannes Reinecke
2025-03-03 13:57     ` Matthew Wilcox
2025-03-03 14:05       ` Hannes Reinecke
2025-03-03 14:27   ` Matthew Wilcox
2025-03-03 14:42     ` Matthew Wilcox
2025-03-03 15:12       ` Vlastimil Babka
2025-03-03 15:39       ` Hannes Reinecke
2025-03-03 15:48         ` Matthew Wilcox
2025-03-03 16:15           ` Vlastimil Babka
2025-03-03 22:02             ` Vlastimil Babka
2025-03-04  7:58               ` Hannes Reinecke
2025-03-04  8:18                 ` Vlastimil Babka
2025-03-04 10:20                   ` Hannes Reinecke
2025-03-04 10:26                     ` Vlastimil Babka
2025-03-04 15:11                       ` Hannes Reinecke
2025-03-04 15:29                       ` Vlastimil Babka
2025-03-04 16:20                         ` Hannes Reinecke
2025-03-04 16:14                       ` Matthew Wilcox
2025-03-04 16:32                         ` Hannes Reinecke
2025-03-04 16:53                           ` Matthew Wilcox
2025-03-04 18:05                             ` Matthew Wilcox
2025-03-04 18:31                               ` Vlastimil Babka
2025-03-04 19:39                               ` Hannes Reinecke
2025-03-04 19:44                                 ` Vlastimil Babka
2025-03-05  7:14                                   ` Hannes Reinecke
2025-03-05  8:20                                   ` Hannes Reinecke
2025-03-05  8:58                                     ` Vlastimil Babka
2025-03-05 11:43                                       ` Hannes Reinecke
2025-03-05 18:11                                         ` Networking people smell funny and make poor life choices Matthew Wilcox
2025-03-06  0:46                                           ` Cong Wang
2025-03-12 15:09                                           ` Christoph Hellwig
2025-03-12 18:28                                             ` James R. Bergsten
2025-03-13  9:43                                           ` David Laight
2025-03-06  9:15                                         ` Kernel oops with 6.14 when enabling TLS Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox