* [PATCH RFC 0/1] fix for large folio split race in page cache @ 2026-03-05 18:34 Chris J Arges 2026-03-05 18:34 ` [PATCH RFC 1/1] mm/filemap: handle large folio split race in page cache lookups Chris J Arges 0 siblings, 1 reply; 11+ messages in thread From: Chris J Arges @ 2026-03-05 18:34 UTC (permalink / raw) To: willy, akpm, william.kucharski Cc: linux-fsdevel, linux-mm, linux-kernel, kernel-team, Chris J Arges In production we've seen crashes on 6.18.7+ with the following signature below. These machines have high memory pressure, were using xfs file-systems, and generally journalctl was the comm when we oops. After some crash-dump analysis we determined that this was a race condition. We tried to create a more self-contained reproducer for this issue, but unfortunately were unable to do so. This patch will be applied internally as a mitigation for the issue, but will take time to validate fully (ensuring we don't see crashes over a longer time). We are looking for feedback to see if this could be a valid fix or if there are other approaches that we should look into. An earlier email I posted with some analysis is here https://lore.kernel.org/lkml/aYN3JC_Kdgw5G2Ik@861G6M3/T/#u Thanks, --chris Call Trace: ``` aops:xfs_address_space_operations ino:5000126 dentry name(?):"system@d737aaecce5449038a638f9e18bbf5f5-0000000004e06fa7-00064" flags: 0xeffff8000001ad(locked|waiters|referenced|uptodate|lru|active|node=3|zone=2|lastcpupid=0x1ffff) raw: 00effff8000001ad ffaa3c6b85b73ec8 ffaa3c6b85b73e08 ff4e378b0e95dea8 raw: 000000000000737a 0000000000000000 00000002ffffffff ff4e379527691b00 page dumped because: VM_BUG_ON_FOLIO(!folio_contains(folio, index)) ------------[ cut here ]------------ kernel BUG at mm/filemap.c:3519! Oops: invalid opcode: 0000 [#1] SMP NOPTI CPU: 0 UID: 0 PID: 49159 Comm: journalctl Kdump: loaded Tainted: G W O 6.18.7-cloudflare-2026.1.15 #1 PREEMPT(voluntary) Tainted: [W]=WARN, [O]=OOT_MODULE Hardware name: MiTAC TC55-B8051-G12/S8051GM, BIOS V1.08 09/16/2025 RIP: 0010:filemap_fault+0xa61/0x1410 Code: 48 8b 4c 24 10 4c 8b 44 24 08 48 85 c9 0f 84 82 fa ff ff 49 89 cd e9 bc f9 ff ff 48 c7 c6 20 44 d0 86 4c 89 c7 e8 3f 1c 04 00 <0f> 0b 48 8d 7b 18 4c 89 44 24 08 4c 89 1c 24 e8 0b 97 e3 ff 4c 8b RSP: 0000:ff6fd043bed0fcb0 EFLAGS: 00010246 RAX: 0000000000000043 RBX: ff4e378b0e95dea8 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000001 RDI: ff4e375cef81c4c0 RBP: 000000000000737b R08: 0000000000000000 R09: ff6fd043bed0fb48 R10: ff4e37b4ecc3ffa8 R11: 0000000000000003 R12: 0000000000000000 R13: ff4e375c4fa17680 R14: ff4e378b0e95dd38 R15: ff6fd043bed0fde8 FS: 00007f6c5b8b4980(0000) GS:ff4e375d67864000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f6c48b7b050 CR3: 0000005065d34006 CR4: 0000000000771ef0 PKRU: 55555554 Call Trace: <TASK> ? mod_memcg_state+0x80/0x1c0 __do_fault+0x31/0xd0 do_fault+0x2e6/0x710 __handle_mm_fault+0x7b3/0xe50 ? srso_alias_return_thunk+0x5/0xfbef5 ? anon_pipe_write+0x27e/0x670 handle_mm_fault+0xaa/0x2a0 do_user_addr_fault+0x208/0x660 exc_page_fault+0x77/0x170 asm_exc_page_fault+0x26/0x30 RIP: 0033:0x7f6c5b67c3dc Code: e2 ff 66 66 2e 0f 1f 84 00 00 00 00 00 90 41 55 41 54 55 53 48 83 ec 18 48 85 ff 0f 84 bd 01 00 00 48 85 f6 0f 84 d4 01 00 00 <48> 8b 5e 08 48 89 cd 48 85 db 74 60 48 83 fb 0f 0f 86 86 00 00 00 RSP: 002b:00007ffe78c072e0 EFLAGS: 00010206 RAX: 0000000000000000 RBX: 000000000737b048 RCX: 000000000737b048 RDX: 0000000000000003 RSI: 00007f6c48b7b048 RDI: 000055bc3b28dee0 RBP: 000055bc3b28dee0 R08: 0000000000000010 R09: 000055bc3b28df18 R10: 0000000000000001 R11: 00007f6c5b679fa0 R12: 0000000000000003 R13: 00007ffe78c07450 R14: 00007ffe78c07450 R15: 00007f6c48b7b048 </TASK> ``` Chris J Arges (1): mm/filemap: handle large folio split race in page cache lookups mm/filemap.c | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-) -- 2.43.0 ^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH RFC 1/1] mm/filemap: handle large folio split race in page cache lookups 2026-03-05 18:34 [PATCH RFC 0/1] fix for large folio split race in page cache Chris J Arges @ 2026-03-05 18:34 ` Chris J Arges 2026-03-05 19:24 ` Matthew Wilcox 0 siblings, 1 reply; 11+ messages in thread From: Chris J Arges @ 2026-03-05 18:34 UTC (permalink / raw) To: willy, akpm, william.kucharski Cc: linux-fsdevel, linux-mm, linux-kernel, kernel-team, Chris J Arges We have been hitting VM_BUG_ON_FOLIO(!folio_contains(folio, index)) in production environments. These machines are using XFS with large folio support enabled and are under high memory pressure. From reading the code it seems plausible that folio splits due to memory reclaim are racing with filemap_fault() serving mmap page faults. The existing code checks for truncation (folio->mapping != mapping) and retries, but there does not appear to be equivalent handling for the split case. The result is: kernel BUG at mm/filemap.c:3519! VM_BUG_ON_FOLIO(!folio_contains(folio, index), folio) This RFC patch extends the existing truncation retry checks to also cover the case where the folio no longer contains the target index. Fixes: e292e6d644ce ("filemap: Convert filemap_fault to folio") Signed-off-by: Chris J Arges <carges@cloudflare.com> --- mm/filemap.c | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 6cd7974d4ada..334d3f700beb 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1954,13 +1954,13 @@ struct folio *__filemap_get_folio_mpol(struct address_space *mapping, folio_lock(folio); } - /* Has the page been truncated? */ - if (unlikely(folio->mapping != mapping)) { + /* Has the page been truncated or split? */ + if (unlikely(folio->mapping != mapping) || + unlikely(!folio_contains(folio, index))) { folio_unlock(folio); folio_put(folio); goto repeat; } - VM_BUG_ON_FOLIO(!folio_contains(folio, index), folio); } if (fgp_flags & FGP_ACCESSED) @@ -2179,10 +2179,9 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start, if (!folio_trylock(folio)) goto put; if (folio->mapping != mapping || - folio_test_writeback(folio)) + folio_test_writeback(folio) || + !folio_contains(folio, xas.xa_index)) goto unlock; - VM_BUG_ON_FOLIO(!folio_contains(folio, xas.xa_index), - folio); } else { nr = 1 << xas_get_order(&xas); base = xas.xa_index & ~(nr - 1); @@ -3570,13 +3569,13 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) if (!lock_folio_maybe_drop_mmap(vmf, folio, &fpin)) goto out_retry; - /* Did it get truncated? */ - if (unlikely(folio->mapping != mapping)) { + /* Did it get truncated or split? */ + if (unlikely(folio->mapping != mapping) || + unlikely(!folio_contains(folio, index))) { folio_unlock(folio); folio_put(folio); goto retry_find; } - VM_BUG_ON_FOLIO(!folio_contains(folio, index), folio); /* * We have a locked folio in the page cache, now we need to check -- 2.43.0 ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH RFC 1/1] mm/filemap: handle large folio split race in page cache lookups 2026-03-05 18:34 ` [PATCH RFC 1/1] mm/filemap: handle large folio split race in page cache lookups Chris J Arges @ 2026-03-05 19:24 ` Matthew Wilcox 2026-03-06 14:13 ` Kiryl Shutsemau 0 siblings, 1 reply; 11+ messages in thread From: Matthew Wilcox @ 2026-03-05 19:24 UTC (permalink / raw) To: Chris J Arges Cc: akpm, william.kucharski, linux-fsdevel, linux-mm, linux-kernel, kernel-team On Thu, Mar 05, 2026 at 12:34:33PM -0600, Chris J Arges wrote: > We have been hitting VM_BUG_ON_FOLIO(!folio_contains(folio, index)) in > production environments. These machines are using XFS with large folio > support enabled and are under high memory pressure. > > >From reading the code it seems plausible that folio splits due to memory > reclaim are racing with filemap_fault() serving mmap page faults. > > The existing code checks for truncation (folio->mapping != mapping) and > retries, but there does not appear to be equivalent handling for the > split case. The result is: > > kernel BUG at mm/filemap.c:3519! > VM_BUG_ON_FOLIO(!folio_contains(folio, index), folio) This didn't occur to me as a possibility because filemap_get_entry() is _supposed_ to take care of it. But if this patch fixes it, then we need to understand why it works. folio_split() needs to be sure that it's the only one holding a reference to the folio. To that end, it calculates the expected refcount of the folio, and freezes it (sets the refcount to 0 if the refcount is the expected value). Once filemap_get_entry() has incremented the refcount, freezing will fail. But of course, we can race. filemap_get_entry() can load a folio first, the entire folio_split can happen, then it calls folio_try_get() and succeeds, but it no longer covers the index we were looking for. That's what the xas_reload() is trying to prevent -- if the index is for a folio which has changed, then the xas_reload() should come back with a different folio and we goto repeat. So how did we get through this with a reference to the wrong folio? ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH RFC 1/1] mm/filemap: handle large folio split race in page cache lookups 2026-03-05 19:24 ` Matthew Wilcox @ 2026-03-06 14:13 ` Kiryl Shutsemau 2026-03-06 16:28 ` Matthew Wilcox 0 siblings, 1 reply; 11+ messages in thread From: Kiryl Shutsemau @ 2026-03-06 14:13 UTC (permalink / raw) To: Matthew Wilcox Cc: Chris J Arges, akpm, william.kucharski, linux-fsdevel, linux-mm, linux-kernel, kernel-team On Thu, Mar 05, 2026 at 07:24:38PM +0000, Matthew Wilcox wrote: > On Thu, Mar 05, 2026 at 12:34:33PM -0600, Chris J Arges wrote: > > We have been hitting VM_BUG_ON_FOLIO(!folio_contains(folio, index)) in > > production environments. These machines are using XFS with large folio > > support enabled and are under high memory pressure. > > > > >From reading the code it seems plausible that folio splits due to memory > > reclaim are racing with filemap_fault() serving mmap page faults. > > > > The existing code checks for truncation (folio->mapping != mapping) and > > retries, but there does not appear to be equivalent handling for the > > split case. The result is: > > > > kernel BUG at mm/filemap.c:3519! > > VM_BUG_ON_FOLIO(!folio_contains(folio, index), folio) > > This didn't occur to me as a possibility because filemap_get_entry() > is _supposed_ to take care of it. But if this patch fixes it, then > we need to understand why it works. > > folio_split() needs to be sure that it's the only one holding a reference > to the folio. To that end, it calculates the expected refcount of the > folio, and freezes it (sets the refcount to 0 if the refcount is the > expected value). Once filemap_get_entry() has incremented the refcount, > freezing will fail. > > But of course, we can race. filemap_get_entry() can load a folio first, > the entire folio_split can happen, then it calls folio_try_get() and > succeeds, but it no longer covers the index we were looking for. That's > what the xas_reload() is trying to prevent -- if the index is for a > folio which has changed, then the xas_reload() should come back with a > different folio and we goto repeat. > > So how did we get through this with a reference to the wrong folio? What would xas_reload() return if we raced with split and index pointed to a tail page before the split? Wouldn't it return the folio that was a head and check will pass? -- Kiryl Shutsemau / Kirill A. Shutemov ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH RFC 1/1] mm/filemap: handle large folio split race in page cache lookups 2026-03-06 14:13 ` Kiryl Shutsemau @ 2026-03-06 16:28 ` Matthew Wilcox 2026-03-06 18:36 ` Kiryl Shutsemau 2026-03-06 20:11 ` Chris Arges 0 siblings, 2 replies; 11+ messages in thread From: Matthew Wilcox @ 2026-03-06 16:28 UTC (permalink / raw) To: Kiryl Shutsemau Cc: Chris J Arges, akpm, william.kucharski, linux-fsdevel, linux-mm, linux-kernel, kernel-team On Fri, Mar 06, 2026 at 02:13:26PM +0000, Kiryl Shutsemau wrote: > On Thu, Mar 05, 2026 at 07:24:38PM +0000, Matthew Wilcox wrote: > > folio_split() needs to be sure that it's the only one holding a reference > > to the folio. To that end, it calculates the expected refcount of the > > folio, and freezes it (sets the refcount to 0 if the refcount is the > > expected value). Once filemap_get_entry() has incremented the refcount, > > freezing will fail. > > > > But of course, we can race. filemap_get_entry() can load a folio first, > > the entire folio_split can happen, then it calls folio_try_get() and > > succeeds, but it no longer covers the index we were looking for. That's > > what the xas_reload() is trying to prevent -- if the index is for a > > folio which has changed, then the xas_reload() should come back with a > > different folio and we goto repeat. > > > > So how did we get through this with a reference to the wrong folio? > > What would xas_reload() return if we raced with split and index pointed > to a tail page before the split? > > Wouldn't it return the folio that was a head and check will pass? It's not supposed to return the head in this case. But, check the code: if (!node) return xa_head(xas->xa); if (IS_ENABLED(CONFIG_XARRAY_MULTI)) { offset = (xas->xa_index >> node->shift) & XA_CHUNK_MASK; entry = xa_entry(xas->xa, node, offset); if (!xa_is_sibling(entry)) return entry; offset = xa_to_sibling(entry); } return xa_entry(xas->xa, node, offset); (obviously CONFIG_XARRAY_MULTI is enabled) !node is almost certainly not true -- that's only the case if there's a single entry at offset 0, and we're talking about a situation where we have a large folio. I think we have two cases to consider; one where we've allocated a new node because we split an entry from order >=6 to order <6, and one where we just split an entry that stays at the same level in the tree. So let's say we're looking up an entry at index 1499 and first we got a folio that is at index 1024 order 9. So first, let's look at what happens if it's split into two order-8 folios. We get a reference on the first one, then we calculate offset as ((1499 >> 6) & 63) which is 23. Unless folio splitting is buggy, the original folio is in slot 16 and has sibling entries in 17,18,19 and the new folio is in slot 20 and has sibling entries in 21,22,23. So we should find a sibling entry in slot 23 that points to 20, then return the new folio in slot 20 which would mismatch the old folio that we got a refcount on. Then let's consider what happens if we split the index at 1499 into an order-0 folio. folio split allocated a new node and put it at offset 23 (and populated the new node, but we don't need to be concerned with that here). This time the lookup finds the new node and actually returns the node instead of a folio. But that's OK, because we'ree just checking for pointer equality, and there's no way this node compares equal to any folio we found (not least because it has a low bit set to indicate this is a node and not a pointer). So again the pointer equality check fails and we drop the speculative refcount we obtained and retry the loop. Have I missed something? Maybe a memory ordering problem? ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH RFC 1/1] mm/filemap: handle large folio split race in page cache lookups 2026-03-06 16:28 ` Matthew Wilcox @ 2026-03-06 18:36 ` Kiryl Shutsemau 2026-03-06 18:41 ` Matthew Wilcox 2026-03-06 20:11 ` Chris Arges 1 sibling, 1 reply; 11+ messages in thread From: Kiryl Shutsemau @ 2026-03-06 18:36 UTC (permalink / raw) To: Matthew Wilcox Cc: Chris J Arges, akpm, william.kucharski, linux-fsdevel, linux-mm, linux-kernel, kernel-team On Fri, Mar 06, 2026 at 04:28:19PM +0000, Matthew Wilcox wrote: > On Fri, Mar 06, 2026 at 02:13:26PM +0000, Kiryl Shutsemau wrote: > > On Thu, Mar 05, 2026 at 07:24:38PM +0000, Matthew Wilcox wrote: > > > folio_split() needs to be sure that it's the only one holding a reference > > > to the folio. To that end, it calculates the expected refcount of the > > > folio, and freezes it (sets the refcount to 0 if the refcount is the > > > expected value). Once filemap_get_entry() has incremented the refcount, > > > freezing will fail. > > > > > > But of course, we can race. filemap_get_entry() can load a folio first, > > > the entire folio_split can happen, then it calls folio_try_get() and > > > succeeds, but it no longer covers the index we were looking for. That's > > > what the xas_reload() is trying to prevent -- if the index is for a > > > folio which has changed, then the xas_reload() should come back with a > > > different folio and we goto repeat. > > > > > > So how did we get through this with a reference to the wrong folio? > > > > What would xas_reload() return if we raced with split and index pointed > > to a tail page before the split? > > > > Wouldn't it return the folio that was a head and check will pass? > > It's not supposed to return the head in this case. But, check the code: > > if (!node) > return xa_head(xas->xa); > if (IS_ENABLED(CONFIG_XARRAY_MULTI)) { > offset = (xas->xa_index >> node->shift) & XA_CHUNK_MASK; > entry = xa_entry(xas->xa, node, offset); > if (!xa_is_sibling(entry)) > return entry; > offset = xa_to_sibling(entry); > } > return xa_entry(xas->xa, node, offset); > > (obviously CONFIG_XARRAY_MULTI is enabled) > > !node is almost certainly not true -- that's only the case if there's a > single entry at offset 0, and we're talking about a situation where we > have a large folio. > > I think we have two cases to consider; one where we've allocated a new > node because we split an entry from order >=6 to order <6, and one where > we just split an entry that stays at the same level in the tree. > > So let's say we're looking up an entry at index 1499 and first we got > a folio that is at index 1024 order 9. So first, let's look at what > happens if it's split into two order-8 folios. We get a reference on the > first one, then we calculate offset as ((1499 >> 6) & 63) which is 23. > Unless folio splitting is buggy, the original folio is in slot 16 and > has sibling entries in 17,18,19 and the new folio is in slot 20 and has > sibling entries in 21,22,23. So we should find a sibling entry in slot > 23 that points to 20, then return the new folio in slot 20 which would > mismatch the old folio that we got a refcount on. > > Then let's consider what happens if we split the index at 1499 into an > order-0 folio. folio split allocated a new node and put it at offset 23 > (and populated the new node, but we don't need to be concerned with that > here). This time the lookup finds the new node and actually returns the > node instead of a folio. But that's OK, because we'ree just checking > for pointer equality, and there's no way this node compares equal to > any folio we found (not least because it has a low bit set to indicate > this is a node and not a pointer). So again the pointer equality check > fails and we drop the speculative refcount we obtained and retry the loop. Thanks for the analysis. It is very helpful. I don't understand xarray internals. > Have I missed something? Maybe a memory ordering problem? I also considered reclaim/refault scenario, but I don't see anything. Maybe memory ordering. Who knows. I guess we need more breadcrumbs. The proposed change doesn't fix anything, but hides the problem. It would be better to downgrade the VM_BUG_ON_FOLIO() to a warning + retry. -- Kiryl Shutsemau / Kirill A. Shutemov ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH RFC 1/1] mm/filemap: handle large folio split race in page cache lookups 2026-03-06 18:36 ` Kiryl Shutsemau @ 2026-03-06 18:41 ` Matthew Wilcox 2026-03-06 20:20 ` Kiryl Shutsemau 0 siblings, 1 reply; 11+ messages in thread From: Matthew Wilcox @ 2026-03-06 18:41 UTC (permalink / raw) To: Kiryl Shutsemau Cc: Chris J Arges, akpm, william.kucharski, linux-fsdevel, linux-mm, linux-kernel, kernel-team On Fri, Mar 06, 2026 at 06:36:30PM +0000, Kiryl Shutsemau wrote: > The proposed change doesn't fix anything, but hides the problem. > It would be better to downgrade the VM_BUG_ON_FOLIO() to a warning + > retry. The trouble is that a retry only happens to work in ... whatever scenario this is. If there's a persistent corruption of the radix tree, a retry might be an infinite loop which isn't terribly helpful. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH RFC 1/1] mm/filemap: handle large folio split race in page cache lookups 2026-03-06 18:41 ` Matthew Wilcox @ 2026-03-06 20:20 ` Kiryl Shutsemau 0 siblings, 0 replies; 11+ messages in thread From: Kiryl Shutsemau @ 2026-03-06 20:20 UTC (permalink / raw) To: Matthew Wilcox Cc: Chris J Arges, akpm, william.kucharski, linux-fsdevel, linux-mm, linux-kernel, kernel-team On Fri, Mar 06, 2026 at 06:41:08PM +0000, Matthew Wilcox wrote: > On Fri, Mar 06, 2026 at 06:36:30PM +0000, Kiryl Shutsemau wrote: > > The proposed change doesn't fix anything, but hides the problem. > > It would be better to downgrade the VM_BUG_ON_FOLIO() to a warning + > > retry. > > The trouble is that a retry only happens to work in ... whatever scenario > this is. If there's a persistent corruption of the radix tree, a retry > might be an infinite loop which isn't terribly helpful. Whether the problem is transient can be useful. -- Kiryl Shutsemau / Kirill A. Shutemov ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH RFC 1/1] mm/filemap: handle large folio split race in page cache lookups 2026-03-06 16:28 ` Matthew Wilcox 2026-03-06 18:36 ` Kiryl Shutsemau @ 2026-03-06 20:11 ` Chris Arges 2026-03-06 20:21 ` Kiryl Shutsemau 1 sibling, 1 reply; 11+ messages in thread From: Chris Arges @ 2026-03-06 20:11 UTC (permalink / raw) To: Matthew Wilcox Cc: Kiryl Shutsemau, akpm, william.kucharski, linux-fsdevel, linux-mm, linux-kernel, kernel-team On 2026-03-06 16:28:19, Matthew Wilcox wrote: > On Fri, Mar 06, 2026 at 02:13:26PM +0000, Kiryl Shutsemau wrote: > > On Thu, Mar 05, 2026 at 07:24:38PM +0000, Matthew Wilcox wrote: > > > folio_split() needs to be sure that it's the only one holding a reference > > > to the folio. To that end, it calculates the expected refcount of the > > > folio, and freezes it (sets the refcount to 0 if the refcount is the > > > expected value). Once filemap_get_entry() has incremented the refcount, > > > freezing will fail. > > > > > > But of course, we can race. filemap_get_entry() can load a folio first, > > > the entire folio_split can happen, then it calls folio_try_get() and > > > succeeds, but it no longer covers the index we were looking for. That's > > > what the xas_reload() is trying to prevent -- if the index is for a > > > folio which has changed, then the xas_reload() should come back with a > > > different folio and we goto repeat. > > > > > > So how did we get through this with a reference to the wrong folio? > > > > What would xas_reload() return if we raced with split and index pointed > > to a tail page before the split? > > > > Wouldn't it return the folio that was a head and check will pass? > > It's not supposed to return the head in this case. But, check the code: > > if (!node) > return xa_head(xas->xa); > if (IS_ENABLED(CONFIG_XARRAY_MULTI)) { > offset = (xas->xa_index >> node->shift) & XA_CHUNK_MASK; > entry = xa_entry(xas->xa, node, offset); > if (!xa_is_sibling(entry)) > return entry; > offset = xa_to_sibling(entry); > } > return xa_entry(xas->xa, node, offset); > > (obviously CONFIG_XARRAY_MULTI is enabled) > Yes we have this CONFIG enabled. Also FWIW, happy to run some additional experiments or more debugging. We _can_ reproduce this, as a machine hits this about every day on a sample of ~128 machines. We also do get crashdumps so we can poke around there as needed. I was going to deploy this patch onto a subset of machines, but reading through this thread I'm a bit concerned if a retry doesn't actually fix the problem, then we will just loop on this condition and hang. --chris > !node is almost certainly not true -- that's only the case if there's a > single entry at offset 0, and we're talking about a situation where we > have a large folio. > > I think we have two cases to consider; one where we've allocated a new > node because we split an entry from order >=6 to order <6, and one where > we just split an entry that stays at the same level in the tree. > > So let's say we're looking up an entry at index 1499 and first we got > a folio that is at index 1024 order 9. So first, let's look at what > happens if it's split into two order-8 folios. We get a reference on the > first one, then we calculate offset as ((1499 >> 6) & 63) which is 23. > Unless folio splitting is buggy, the original folio is in slot 16 and > has sibling entries in 17,18,19 and the new folio is in slot 20 and has > sibling entries in 21,22,23. So we should find a sibling entry in slot > 23 that points to 20, then return the new folio in slot 20 which would > mismatch the old folio that we got a refcount on. > > Then let's consider what happens if we split the index at 1499 into an > order-0 folio. folio split allocated a new node and put it at offset 23 > (and populated the new node, but we don't need to be concerned with that > here). This time the lookup finds the new node and actually returns the > node instead of a folio. But that's OK, because we'ree just checking > for pointer equality, and there's no way this node compares equal to > any folio we found (not least because it has a low bit set to indicate > this is a node and not a pointer). So again the pointer equality check > fails and we drop the speculative refcount we obtained and retry the loop. > > Have I missed something? Maybe a memory ordering problem? ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH RFC 1/1] mm/filemap: handle large folio split race in page cache lookups 2026-03-06 20:11 ` Chris Arges @ 2026-03-06 20:21 ` Kiryl Shutsemau 2026-03-06 20:58 ` Chris Arges 0 siblings, 1 reply; 11+ messages in thread From: Kiryl Shutsemau @ 2026-03-06 20:21 UTC (permalink / raw) To: Chris Arges Cc: Matthew Wilcox, akpm, william.kucharski, linux-fsdevel, linux-mm, linux-kernel, kernel-team On Fri, Mar 06, 2026 at 02:11:22PM -0600, Chris Arges wrote: > On 2026-03-06 16:28:19, Matthew Wilcox wrote: > > On Fri, Mar 06, 2026 at 02:13:26PM +0000, Kiryl Shutsemau wrote: > > > On Thu, Mar 05, 2026 at 07:24:38PM +0000, Matthew Wilcox wrote: > > > > folio_split() needs to be sure that it's the only one holding a reference > > > > to the folio. To that end, it calculates the expected refcount of the > > > > folio, and freezes it (sets the refcount to 0 if the refcount is the > > > > expected value). Once filemap_get_entry() has incremented the refcount, > > > > freezing will fail. > > > > > > > > But of course, we can race. filemap_get_entry() can load a folio first, > > > > the entire folio_split can happen, then it calls folio_try_get() and > > > > succeeds, but it no longer covers the index we were looking for. That's > > > > what the xas_reload() is trying to prevent -- if the index is for a > > > > folio which has changed, then the xas_reload() should come back with a > > > > different folio and we goto repeat. > > > > > > > > So how did we get through this with a reference to the wrong folio? > > > > > > What would xas_reload() return if we raced with split and index pointed > > > to a tail page before the split? > > > > > > Wouldn't it return the folio that was a head and check will pass? > > > > It's not supposed to return the head in this case. But, check the code: > > > > if (!node) > > return xa_head(xas->xa); > > if (IS_ENABLED(CONFIG_XARRAY_MULTI)) { > > offset = (xas->xa_index >> node->shift) & XA_CHUNK_MASK; > > entry = xa_entry(xas->xa, node, offset); > > if (!xa_is_sibling(entry)) > > return entry; > > offset = xa_to_sibling(entry); > > } > > return xa_entry(xas->xa, node, offset); > > > > (obviously CONFIG_XARRAY_MULTI is enabled) > > > Yes we have this CONFIG enabled. > > Also FWIW, happy to run some additional experiments or more debugging. We _can_ > reproduce this, as a machine hits this about every day on a sample of ~128 > machines. We also do get crashdumps so we can poke around there as needed. > > I was going to deploy this patch onto a subset of machines, but reading through > this thread I'm a bit concerned if a retry doesn't actually fix the problem, > then we will just loop on this condition and hang. I would be useful to know if the condition is persistent or if retry "fixes" the problem. -- Kiryl Shutsemau / Kirill A. Shutemov ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH RFC 1/1] mm/filemap: handle large folio split race in page cache lookups 2026-03-06 20:21 ` Kiryl Shutsemau @ 2026-03-06 20:58 ` Chris Arges 0 siblings, 0 replies; 11+ messages in thread From: Chris Arges @ 2026-03-06 20:58 UTC (permalink / raw) To: Kiryl Shutsemau Cc: Matthew Wilcox, akpm, william.kucharski, linux-fsdevel, linux-mm, linux-kernel, kernel-team On 2026-03-06 20:21:59, Kiryl Shutsemau wrote: > On Fri, Mar 06, 2026 at 02:11:22PM -0600, Chris Arges wrote: > > On 2026-03-06 16:28:19, Matthew Wilcox wrote: > > > On Fri, Mar 06, 2026 at 02:13:26PM +0000, Kiryl Shutsemau wrote: > > > > On Thu, Mar 05, 2026 at 07:24:38PM +0000, Matthew Wilcox wrote: > > > > > folio_split() needs to be sure that it's the only one holding a reference > > > > > to the folio. To that end, it calculates the expected refcount of the > > > > > folio, and freezes it (sets the refcount to 0 if the refcount is the > > > > > expected value). Once filemap_get_entry() has incremented the refcount, > > > > > freezing will fail. > > > > > > > > > > But of course, we can race. filemap_get_entry() can load a folio first, > > > > > the entire folio_split can happen, then it calls folio_try_get() and > > > > > succeeds, but it no longer covers the index we were looking for. That's > > > > > what the xas_reload() is trying to prevent -- if the index is for a > > > > > folio which has changed, then the xas_reload() should come back with a > > > > > different folio and we goto repeat. > > > > > > > > > > So how did we get through this with a reference to the wrong folio? > > > > > > > > What would xas_reload() return if we raced with split and index pointed > > > > to a tail page before the split? > > > > > > > > Wouldn't it return the folio that was a head and check will pass? > > > > > > It's not supposed to return the head in this case. But, check the code: > > > > > > if (!node) > > > return xa_head(xas->xa); > > > if (IS_ENABLED(CONFIG_XARRAY_MULTI)) { > > > offset = (xas->xa_index >> node->shift) & XA_CHUNK_MASK; > > > entry = xa_entry(xas->xa, node, offset); > > > if (!xa_is_sibling(entry)) > > > return entry; > > > offset = xa_to_sibling(entry); > > > } > > > return xa_entry(xas->xa, node, offset); > > > > > > (obviously CONFIG_XARRAY_MULTI is enabled) > > > > > Yes we have this CONFIG enabled. > > > > Also FWIW, happy to run some additional experiments or more debugging. We _can_ > > reproduce this, as a machine hits this about every day on a sample of ~128 > > machines. We also do get crashdumps so we can poke around there as needed. > > > > I was going to deploy this patch onto a subset of machines, but reading through > > this thread I'm a bit concerned if a retry doesn't actually fix the problem, > > then we will just loop on this condition and hang. > > I would be useful to know if the condition is persistent or if retry > "fixes" the problem. Fair enough. I suppose it's either crashing or locking up. Will deploy early next week and see what happens. --chris ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2026-03-06 20:58 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2026-03-05 18:34 [PATCH RFC 0/1] fix for large folio split race in page cache Chris J Arges 2026-03-05 18:34 ` [PATCH RFC 1/1] mm/filemap: handle large folio split race in page cache lookups Chris J Arges 2026-03-05 19:24 ` Matthew Wilcox 2026-03-06 14:13 ` Kiryl Shutsemau 2026-03-06 16:28 ` Matthew Wilcox 2026-03-06 18:36 ` Kiryl Shutsemau 2026-03-06 18:41 ` Matthew Wilcox 2026-03-06 20:20 ` Kiryl Shutsemau 2026-03-06 20:11 ` Chris Arges 2026-03-06 20:21 ` Kiryl Shutsemau 2026-03-06 20:58 ` Chris Arges
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox