[PATCH RFC 0/1] fix for large folio split race in page cache

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC 0/1] fix for large folio split race in page cache
@ 2026-03-05 18:34 Chris J Arges
  2026-03-05 18:34 ` [PATCH RFC 1/1] mm/filemap: handle large folio split race in page cache lookups Chris J Arges
  0 siblings, 1 reply; 11+ messages in thread
From: Chris J Arges @ 2026-03-05 18:34 UTC (permalink / raw)
  To: willy, akpm, william.kucharski
  Cc: linux-fsdevel, linux-mm, linux-kernel, kernel-team, Chris J Arges

In production we've seen crashes on 6.18.7+ with the following
signature below. These machines have high memory pressure, were using
xfs file-systems, and generally journalctl was the comm when we oops.

After some crash-dump analysis we determined that this was a race
condition. We tried to create a more self-contained reproducer for this
issue, but unfortunately were unable to do so. This patch will be
applied internally as a mitigation for the issue, but will take time
to validate fully (ensuring we don't see crashes over a longer time). We
are looking for feedback to see if this could be a valid fix or if there
are other approaches that we should look into.

An earlier email I posted with some analysis is here
https://lore.kernel.org/lkml/aYN3JC_Kdgw5G2Ik@861G6M3/T/#u

Thanks,
--chris

Call Trace:
```
aops:xfs_address_space_operations ino:5000126 dentry name(?):"system@d737aaecce5449038a638f9e18bbf5f5-0000000004e06fa7-00064"
flags: 0xeffff8000001ad(locked|waiters|referenced|uptodate|lru|active|node=3|zone=2|lastcpupid=0x1ffff)
raw: 00effff8000001ad ffaa3c6b85b73ec8 ffaa3c6b85b73e08 ff4e378b0e95dea8
raw: 000000000000737a 0000000000000000 00000002ffffffff ff4e379527691b00
page dumped because: VM_BUG_ON_FOLIO(!folio_contains(folio, index))
------------[ cut here ]------------
kernel BUG at mm/filemap.c:3519!
Oops: invalid opcode: 0000 [#1] SMP NOPTI
CPU: 0 UID: 0 PID: 49159 Comm: journalctl Kdump: loaded Tainted: G        W  O        6.18.7-cloudflare-2026.1.15 #1 PREEMPT(voluntary)
Tainted: [W]=WARN, [O]=OOT_MODULE
Hardware name: MiTAC TC55-B8051-G12/S8051GM, BIOS V1.08 09/16/2025
RIP: 0010:filemap_fault+0xa61/0x1410
Code: 48 8b 4c 24 10 4c 8b 44 24 08 48 85 c9 0f 84 82 fa ff ff 49 89 cd e9 bc f9 ff ff 48 c7 c6 20 44 d0 86 4c 89 c7 e8 3f 1c 04 00 <0f> 0b 48 8d 7b 18 4c 89 44 24 08 4c 89 1c 24 e8 0b 97 e3 ff 4c 8b
RSP: 0000:ff6fd043bed0fcb0 EFLAGS: 00010246
RAX: 0000000000000043 RBX: ff4e378b0e95dea8 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ff4e375cef81c4c0
RBP: 000000000000737b R08: 0000000000000000 R09: ff6fd043bed0fb48
R10: ff4e37b4ecc3ffa8 R11: 0000000000000003 R12: 0000000000000000
R13: ff4e375c4fa17680 R14: ff4e378b0e95dd38 R15: ff6fd043bed0fde8
FS:  00007f6c5b8b4980(0000) GS:ff4e375d67864000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6c48b7b050 CR3: 0000005065d34006 CR4: 0000000000771ef0
PKRU: 55555554
Call Trace:
 <TASK>
 ? mod_memcg_state+0x80/0x1c0
 __do_fault+0x31/0xd0
 do_fault+0x2e6/0x710
 __handle_mm_fault+0x7b3/0xe50
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? anon_pipe_write+0x27e/0x670
 handle_mm_fault+0xaa/0x2a0
 do_user_addr_fault+0x208/0x660
 exc_page_fault+0x77/0x170
 asm_exc_page_fault+0x26/0x30
RIP: 0033:0x7f6c5b67c3dc
Code: e2 ff 66 66 2e 0f 1f 84 00 00 00 00 00 90 41 55 41 54 55 53 48 83 ec 18 48 85 ff 0f 84 bd 01 00 00 48 85 f6 0f 84 d4 01 00 00 <48> 8b 5e 08 48 89 cd 48 85 db 74 60 48 83 fb 0f 0f 86 86 00 00 00
RSP: 002b:00007ffe78c072e0 EFLAGS: 00010206
RAX: 0000000000000000 RBX: 000000000737b048 RCX: 000000000737b048
RDX: 0000000000000003 RSI: 00007f6c48b7b048 RDI: 000055bc3b28dee0
RBP: 000055bc3b28dee0 R08: 0000000000000010 R09: 000055bc3b28df18
R10: 0000000000000001 R11: 00007f6c5b679fa0 R12: 0000000000000003
R13: 00007ffe78c07450 R14: 00007ffe78c07450 R15: 00007f6c48b7b048
 </TASK>
```

Chris J Arges (1):
  mm/filemap: handle large folio split race in page cache lookups

 mm/filemap.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

--
2.43.0



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH RFC 1/1] mm/filemap: handle large folio split race in page cache lookups
  2026-03-05 18:34 [PATCH RFC 0/1] fix for large folio split race in page cache Chris J Arges
@ 2026-03-05 18:34 ` Chris J Arges
  2026-03-05 19:24   ` Matthew Wilcox
  0 siblings, 1 reply; 11+ messages in thread
From: Chris J Arges @ 2026-03-05 18:34 UTC (permalink / raw)
  To: willy, akpm, william.kucharski
  Cc: linux-fsdevel, linux-mm, linux-kernel, kernel-team, Chris J Arges

We have been hitting VM_BUG_ON_FOLIO(!folio_contains(folio, index)) in
production environments. These machines are using XFS with large folio
support enabled and are under high memory pressure.

From reading the code it seems plausible that folio splits due to memory
reclaim are racing with filemap_fault() serving mmap page faults.

The existing code checks for truncation (folio->mapping != mapping) and
retries, but there does not appear to be equivalent handling for the
split case. The result is:

  kernel BUG at mm/filemap.c:3519!
  VM_BUG_ON_FOLIO(!folio_contains(folio, index), folio)

This RFC patch extends the existing truncation retry checks to also
cover the case where the folio no longer contains the target index.

Fixes: e292e6d644ce ("filemap: Convert filemap_fault to folio")
Signed-off-by: Chris J Arges <carges@cloudflare.com>
---
 mm/filemap.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 6cd7974d4ada..334d3f700beb 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1954,13 +1954,13 @@ struct folio *__filemap_get_folio_mpol(struct address_space *mapping,
 			folio_lock(folio);
 		}

-		/* Has the page been truncated? */
-		if (unlikely(folio->mapping != mapping)) {
+		/* Has the page been truncated or split? */
+		if (unlikely(folio->mapping != mapping) ||
+		    unlikely(!folio_contains(folio, index))) {
 			folio_unlock(folio);
 			folio_put(folio);
 			goto repeat;
 		}
-		VM_BUG_ON_FOLIO(!folio_contains(folio, index), folio);
 	}

 	if (fgp_flags & FGP_ACCESSED)
@@ -2179,10 +2179,9 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start,
 			if (!folio_trylock(folio))
 				goto put;
 			if (folio->mapping != mapping ||
-			    folio_test_writeback(folio))
+			    folio_test_writeback(folio) ||
+			    !folio_contains(folio, xas.xa_index))
 				goto unlock;
-			VM_BUG_ON_FOLIO(!folio_contains(folio, xas.xa_index),
-					folio);
 		} else {
 			nr = 1 << xas_get_order(&xas);
 			base = xas.xa_index & ~(nr - 1);
@@ -3570,13 +3569,13 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 	if (!lock_folio_maybe_drop_mmap(vmf, folio, &fpin))
 		goto out_retry;

-	/* Did it get truncated? */
-	if (unlikely(folio->mapping != mapping)) {
+	/* Did it get truncated or split? */
+	if (unlikely(folio->mapping != mapping) ||
+	    unlikely(!folio_contains(folio, index))) {
 		folio_unlock(folio);
 		folio_put(folio);
 		goto retry_find;
 	}
-	VM_BUG_ON_FOLIO(!folio_contains(folio, index), folio);

 	/*
 	 * We have a locked folio in the page cache, now we need to check
--
2.43.0



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC 1/1] mm/filemap: handle large folio split race in page cache lookups
  2026-03-05 18:34 ` [PATCH RFC 1/1] mm/filemap: handle large folio split race in page cache lookups Chris J Arges
@ 2026-03-05 19:24   ` Matthew Wilcox
  2026-03-06 14:13     ` Kiryl Shutsemau
  0 siblings, 1 reply; 11+ messages in thread
From: Matthew Wilcox @ 2026-03-05 19:24 UTC (permalink / raw)
  To: Chris J Arges
  Cc: akpm, william.kucharski, linux-fsdevel, linux-mm, linux-kernel,
	kernel-team

On Thu, Mar 05, 2026 at 12:34:33PM -0600, Chris J Arges wrote:
> We have been hitting VM_BUG_ON_FOLIO(!folio_contains(folio, index)) in
> production environments. These machines are using XFS with large folio
> support enabled and are under high memory pressure.
> 
> >From reading the code it seems plausible that folio splits due to memory
> reclaim are racing with filemap_fault() serving mmap page faults.
> 
> The existing code checks for truncation (folio->mapping != mapping) and
> retries, but there does not appear to be equivalent handling for the
> split case. The result is:
> 
>   kernel BUG at mm/filemap.c:3519!
>   VM_BUG_ON_FOLIO(!folio_contains(folio, index), folio)

This didn't occur to me as a possibility because filemap_get_entry()
is _supposed_ to take care of it.  But if this patch fixes it, then
we need to understand why it works.

folio_split() needs to be sure that it's the only one holding a reference
to the folio.  To that end, it calculates the expected refcount of the
folio, and freezes it (sets the refcount to 0 if the refcount is the
expected value).  Once filemap_get_entry() has incremented the refcount,
freezing will fail.

But of course, we can race.  filemap_get_entry() can load a folio first,
the entire folio_split can happen, then it calls folio_try_get() and
succeeds, but it no longer covers the index we were looking for.  That's
what the xas_reload() is trying to prevent -- if the index is for a
folio which has changed, then the xas_reload() should come back with a
different folio and we goto repeat.

So how did we get through this with a reference to the wrong folio?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC 1/1] mm/filemap: handle large folio split race in page cache lookups
  2026-03-05 19:24   ` Matthew Wilcox
@ 2026-03-06 14:13     ` Kiryl Shutsemau
  2026-03-06 16:28       ` Matthew Wilcox
  0 siblings, 1 reply; 11+ messages in thread
From: Kiryl Shutsemau @ 2026-03-06 14:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Chris J Arges, akpm, william.kucharski, linux-fsdevel, linux-mm,
	linux-kernel, kernel-team

On Thu, Mar 05, 2026 at 07:24:38PM +0000, Matthew Wilcox wrote:
> On Thu, Mar 05, 2026 at 12:34:33PM -0600, Chris J Arges wrote:
> > We have been hitting VM_BUG_ON_FOLIO(!folio_contains(folio, index)) in
> > production environments. These machines are using XFS with large folio
> > support enabled and are under high memory pressure.
> > 
> > >From reading the code it seems plausible that folio splits due to memory
> > reclaim are racing with filemap_fault() serving mmap page faults.
> > 
> > The existing code checks for truncation (folio->mapping != mapping) and
> > retries, but there does not appear to be equivalent handling for the
> > split case. The result is:
> > 
> >   kernel BUG at mm/filemap.c:3519!
> >   VM_BUG_ON_FOLIO(!folio_contains(folio, index), folio)
> 
> This didn't occur to me as a possibility because filemap_get_entry()
> is _supposed_ to take care of it.  But if this patch fixes it, then
> we need to understand why it works.
> 
> folio_split() needs to be sure that it's the only one holding a reference
> to the folio.  To that end, it calculates the expected refcount of the
> folio, and freezes it (sets the refcount to 0 if the refcount is the
> expected value).  Once filemap_get_entry() has incremented the refcount,
> freezing will fail.
> 
> But of course, we can race.  filemap_get_entry() can load a folio first,
> the entire folio_split can happen, then it calls folio_try_get() and
> succeeds, but it no longer covers the index we were looking for.  That's
> what the xas_reload() is trying to prevent -- if the index is for a
> folio which has changed, then the xas_reload() should come back with a
> different folio and we goto repeat.
> 
> So how did we get through this with a reference to the wrong folio?

What would xas_reload() return if we raced with split and index pointed
to a tail page before the split?

Wouldn't it return the folio that was a head and check will pass?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC 1/1] mm/filemap: handle large folio split race in page cache lookups
  2026-03-06 14:13     ` Kiryl Shutsemau
@ 2026-03-06 16:28       ` Matthew Wilcox
  2026-03-06 18:36         ` Kiryl Shutsemau
  2026-03-06 20:11         ` Chris Arges
  0 siblings, 2 replies; 11+ messages in thread
From: Matthew Wilcox @ 2026-03-06 16:28 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Chris J Arges, akpm, william.kucharski, linux-fsdevel, linux-mm,
	linux-kernel, kernel-team

On Fri, Mar 06, 2026 at 02:13:26PM +0000, Kiryl Shutsemau wrote:
> On Thu, Mar 05, 2026 at 07:24:38PM +0000, Matthew Wilcox wrote:
> > folio_split() needs to be sure that it's the only one holding a reference
> > to the folio.  To that end, it calculates the expected refcount of the
> > folio, and freezes it (sets the refcount to 0 if the refcount is the
> > expected value).  Once filemap_get_entry() has incremented the refcount,
> > freezing will fail.
> > 
> > But of course, we can race.  filemap_get_entry() can load a folio first,
> > the entire folio_split can happen, then it calls folio_try_get() and
> > succeeds, but it no longer covers the index we were looking for.  That's
> > what the xas_reload() is trying to prevent -- if the index is for a
> > folio which has changed, then the xas_reload() should come back with a
> > different folio and we goto repeat.
> > 
> > So how did we get through this with a reference to the wrong folio?
> 
> What would xas_reload() return if we raced with split and index pointed
> to a tail page before the split?
> 
> Wouldn't it return the folio that was a head and check will pass?

It's not supposed to return the head in this case.  But, check the code:

        if (!node)
                return xa_head(xas->xa);
        if (IS_ENABLED(CONFIG_XARRAY_MULTI)) {
                offset = (xas->xa_index >> node->shift) & XA_CHUNK_MASK;
                entry = xa_entry(xas->xa, node, offset);
                if (!xa_is_sibling(entry))
                        return entry;
                offset = xa_to_sibling(entry);
        }
        return xa_entry(xas->xa, node, offset);

(obviously CONFIG_XARRAY_MULTI is enabled)

!node is almost certainly not true -- that's only the case if there's a
single entry at offset 0, and we're talking about a situation where we
have a large folio.

I think we have two cases to consider; one where we've allocated a new
node because we split an entry from order >=6 to order <6, and one where
we just split an entry that stays at the same level in the tree.

So let's say we're looking up an entry at index 1499 and first we got
a folio that is at index 1024 order 9.  So first, let's look at what
happens if it's split into two order-8 folios.  We get a reference on the
first one, then we calculate offset as ((1499 >> 6) & 63) which is 23.
Unless folio splitting is buggy, the original folio is in slot 16 and
has sibling entries in 17,18,19 and the new folio is in slot 20 and has
sibling entries in 21,22,23.  So we should find a sibling entry in slot
23 that points to 20, then return the new folio in slot 20 which would
mismatch the old folio that we got a refcount on.

Then let's consider what happens if we split the index at 1499 into an
order-0 folio.  folio split allocated a new node and put it at offset 23
(and populated the new node, but we don't need to be concerned with that
here).  This time the lookup finds the new node and actually returns the
node instead of a folio.  But that's OK, because we'ree just checking
for pointer equality, and there's no way this node compares equal to
any folio we found (not least because it has a low bit set to indicate
this is a node and not a pointer).  So again the pointer equality check
fails and we drop the speculative refcount we obtained and retry the loop.

Have I missed something?  Maybe a memory ordering problem?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC 1/1] mm/filemap: handle large folio split race in page cache lookups
  2026-03-06 16:28       ` Matthew Wilcox
@ 2026-03-06 18:36         ` Kiryl Shutsemau
  2026-03-06 18:41           ` Matthew Wilcox
  2026-03-06 20:11         ` Chris Arges
  1 sibling, 1 reply; 11+ messages in thread
From: Kiryl Shutsemau @ 2026-03-06 18:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Chris J Arges, akpm, william.kucharski, linux-fsdevel, linux-mm,
	linux-kernel, kernel-team

On Fri, Mar 06, 2026 at 04:28:19PM +0000, Matthew Wilcox wrote:
> On Fri, Mar 06, 2026 at 02:13:26PM +0000, Kiryl Shutsemau wrote:
> > On Thu, Mar 05, 2026 at 07:24:38PM +0000, Matthew Wilcox wrote:
> > > folio_split() needs to be sure that it's the only one holding a reference
> > > to the folio.  To that end, it calculates the expected refcount of the
> > > folio, and freezes it (sets the refcount to 0 if the refcount is the
> > > expected value).  Once filemap_get_entry() has incremented the refcount,
> > > freezing will fail.
> > > 
> > > But of course, we can race.  filemap_get_entry() can load a folio first,
> > > the entire folio_split can happen, then it calls folio_try_get() and
> > > succeeds, but it no longer covers the index we were looking for.  That's
> > > what the xas_reload() is trying to prevent -- if the index is for a
> > > folio which has changed, then the xas_reload() should come back with a
> > > different folio and we goto repeat.
> > > 
> > > So how did we get through this with a reference to the wrong folio?
> > 
> > What would xas_reload() return if we raced with split and index pointed
> > to a tail page before the split?
> > 
> > Wouldn't it return the folio that was a head and check will pass?
> 
> It's not supposed to return the head in this case.  But, check the code:
> 
>         if (!node)
>                 return xa_head(xas->xa);
>         if (IS_ENABLED(CONFIG_XARRAY_MULTI)) {
>                 offset = (xas->xa_index >> node->shift) & XA_CHUNK_MASK;
>                 entry = xa_entry(xas->xa, node, offset);
>                 if (!xa_is_sibling(entry))
>                         return entry;
>                 offset = xa_to_sibling(entry);
>         }
>         return xa_entry(xas->xa, node, offset);
> 
> (obviously CONFIG_XARRAY_MULTI is enabled)
> 
> !node is almost certainly not true -- that's only the case if there's a
> single entry at offset 0, and we're talking about a situation where we
> have a large folio.
> 
> I think we have two cases to consider; one where we've allocated a new
> node because we split an entry from order >=6 to order <6, and one where
> we just split an entry that stays at the same level in the tree.
> 
> So let's say we're looking up an entry at index 1499 and first we got
> a folio that is at index 1024 order 9.  So first, let's look at what
> happens if it's split into two order-8 folios.  We get a reference on the
> first one, then we calculate offset as ((1499 >> 6) & 63) which is 23.
> Unless folio splitting is buggy, the original folio is in slot 16 and
> has sibling entries in 17,18,19 and the new folio is in slot 20 and has
> sibling entries in 21,22,23.  So we should find a sibling entry in slot
> 23 that points to 20, then return the new folio in slot 20 which would
> mismatch the old folio that we got a refcount on.
> 
> Then let's consider what happens if we split the index at 1499 into an
> order-0 folio.  folio split allocated a new node and put it at offset 23
> (and populated the new node, but we don't need to be concerned with that
> here).  This time the lookup finds the new node and actually returns the
> node instead of a folio.  But that's OK, because we'ree just checking
> for pointer equality, and there's no way this node compares equal to
> any folio we found (not least because it has a low bit set to indicate
> this is a node and not a pointer).  So again the pointer equality check
> fails and we drop the speculative refcount we obtained and retry the loop.

Thanks for the analysis. It is very helpful. I don't understand xarray
internals.

> Have I missed something?  Maybe a memory ordering problem?

I also considered reclaim/refault scenario, but I don't see anything.

Maybe memory ordering. Who knows. I guess we need more breadcrumbs.

The proposed change doesn't fix anything, but hides the problem.
It would be better to downgrade the VM_BUG_ON_FOLIO() to a warning +
retry.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC 1/1] mm/filemap: handle large folio split race in page cache lookups
  2026-03-06 18:36         ` Kiryl Shutsemau
@ 2026-03-06 18:41           ` Matthew Wilcox
  2026-03-06 20:20             ` Kiryl Shutsemau
  0 siblings, 1 reply; 11+ messages in thread
From: Matthew Wilcox @ 2026-03-06 18:41 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Chris J Arges, akpm, william.kucharski, linux-fsdevel, linux-mm,
	linux-kernel, kernel-team

On Fri, Mar 06, 2026 at 06:36:30PM +0000, Kiryl Shutsemau wrote:
> The proposed change doesn't fix anything, but hides the problem.
> It would be better to downgrade the VM_BUG_ON_FOLIO() to a warning +
> retry.

The trouble is that a retry only happens to work in ... whatever scenario
this is.  If there's a persistent corruption of the radix tree, a retry
might be an infinite loop which isn't terribly helpful.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC 1/1] mm/filemap: handle large folio split race in page cache lookups
  2026-03-06 18:41           ` Matthew Wilcox
@ 2026-03-06 20:20             ` Kiryl Shutsemau
  0 siblings, 0 replies; 11+ messages in thread
From: Kiryl Shutsemau @ 2026-03-06 20:20 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Chris J Arges, akpm, william.kucharski, linux-fsdevel, linux-mm,
	linux-kernel, kernel-team

On Fri, Mar 06, 2026 at 06:41:08PM +0000, Matthew Wilcox wrote:
> On Fri, Mar 06, 2026 at 06:36:30PM +0000, Kiryl Shutsemau wrote:
> > The proposed change doesn't fix anything, but hides the problem.
> > It would be better to downgrade the VM_BUG_ON_FOLIO() to a warning +
> > retry.
> 
> The trouble is that a retry only happens to work in ... whatever scenario
> this is.  If there's a persistent corruption of the radix tree, a retry
> might be an infinite loop which isn't terribly helpful.

Whether the problem is transient can be useful.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC 1/1] mm/filemap: handle large folio split race in page cache lookups
  2026-03-06 16:28       ` Matthew Wilcox
  2026-03-06 18:36         ` Kiryl Shutsemau
@ 2026-03-06 20:11         ` Chris Arges
  2026-03-06 20:21           ` Kiryl Shutsemau
  1 sibling, 1 reply; 11+ messages in thread
From: Chris Arges @ 2026-03-06 20:11 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kiryl Shutsemau, akpm, william.kucharski, linux-fsdevel,
	linux-mm, linux-kernel, kernel-team

On 2026-03-06 16:28:19, Matthew Wilcox wrote:
> On Fri, Mar 06, 2026 at 02:13:26PM +0000, Kiryl Shutsemau wrote:
> > On Thu, Mar 05, 2026 at 07:24:38PM +0000, Matthew Wilcox wrote:
> > > folio_split() needs to be sure that it's the only one holding a reference
> > > to the folio.  To that end, it calculates the expected refcount of the
> > > folio, and freezes it (sets the refcount to 0 if the refcount is the
> > > expected value).  Once filemap_get_entry() has incremented the refcount,
> > > freezing will fail.
> > > 
> > > But of course, we can race.  filemap_get_entry() can load a folio first,
> > > the entire folio_split can happen, then it calls folio_try_get() and
> > > succeeds, but it no longer covers the index we were looking for.  That's
> > > what the xas_reload() is trying to prevent -- if the index is for a
> > > folio which has changed, then the xas_reload() should come back with a
> > > different folio and we goto repeat.
> > > 
> > > So how did we get through this with a reference to the wrong folio?
> > 
> > What would xas_reload() return if we raced with split and index pointed
> > to a tail page before the split?
> > 
> > Wouldn't it return the folio that was a head and check will pass?
> 
> It's not supposed to return the head in this case.  But, check the code:
> 
>         if (!node)
>                 return xa_head(xas->xa);
>         if (IS_ENABLED(CONFIG_XARRAY_MULTI)) {
>                 offset = (xas->xa_index >> node->shift) & XA_CHUNK_MASK;
>                 entry = xa_entry(xas->xa, node, offset);
>                 if (!xa_is_sibling(entry))
>                         return entry;
>                 offset = xa_to_sibling(entry);
>         }
>         return xa_entry(xas->xa, node, offset);
> 
> (obviously CONFIG_XARRAY_MULTI is enabled)
>
Yes we have this CONFIG enabled.

Also FWIW, happy to run some additional experiments or more debugging. We _can_
reproduce this, as a machine hits this about every day on a sample of ~128
machines. We also do get crashdumps so we can poke around there as needed.

I was going to deploy this patch onto a subset of machines, but reading through
this thread I'm a bit concerned if a retry doesn't actually fix the problem,
then we will just loop on this condition and hang.

--chris

> !node is almost certainly not true -- that's only the case if there's a
> single entry at offset 0, and we're talking about a situation where we
> have a large folio.
> 
> I think we have two cases to consider; one where we've allocated a new
> node because we split an entry from order >=6 to order <6, and one where
> we just split an entry that stays at the same level in the tree.
> 
> So let's say we're looking up an entry at index 1499 and first we got
> a folio that is at index 1024 order 9.  So first, let's look at what
> happens if it's split into two order-8 folios.  We get a reference on the
> first one, then we calculate offset as ((1499 >> 6) & 63) which is 23.
> Unless folio splitting is buggy, the original folio is in slot 16 and
> has sibling entries in 17,18,19 and the new folio is in slot 20 and has
> sibling entries in 21,22,23.  So we should find a sibling entry in slot
> 23 that points to 20, then return the new folio in slot 20 which would
> mismatch the old folio that we got a refcount on.
> 
> Then let's consider what happens if we split the index at 1499 into an
> order-0 folio.  folio split allocated a new node and put it at offset 23
> (and populated the new node, but we don't need to be concerned with that
> here).  This time the lookup finds the new node and actually returns the
> node instead of a folio.  But that's OK, because we'ree just checking
> for pointer equality, and there's no way this node compares equal to
> any folio we found (not least because it has a low bit set to indicate
> this is a node and not a pointer).  So again the pointer equality check
> fails and we drop the speculative refcount we obtained and retry the loop.
> 
> Have I missed something?  Maybe a memory ordering problem?


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC 1/1] mm/filemap: handle large folio split race in page cache lookups
  2026-03-06 20:11         ` Chris Arges
@ 2026-03-06 20:21           ` Kiryl Shutsemau
  2026-03-06 20:58             ` Chris Arges
  0 siblings, 1 reply; 11+ messages in thread
From: Kiryl Shutsemau @ 2026-03-06 20:21 UTC (permalink / raw)
  To: Chris Arges
  Cc: Matthew Wilcox, akpm, william.kucharski, linux-fsdevel, linux-mm,
	linux-kernel, kernel-team

On Fri, Mar 06, 2026 at 02:11:22PM -0600, Chris Arges wrote:
> On 2026-03-06 16:28:19, Matthew Wilcox wrote:
> > On Fri, Mar 06, 2026 at 02:13:26PM +0000, Kiryl Shutsemau wrote:
> > > On Thu, Mar 05, 2026 at 07:24:38PM +0000, Matthew Wilcox wrote:
> > > > folio_split() needs to be sure that it's the only one holding a reference
> > > > to the folio.  To that end, it calculates the expected refcount of the
> > > > folio, and freezes it (sets the refcount to 0 if the refcount is the
> > > > expected value).  Once filemap_get_entry() has incremented the refcount,
> > > > freezing will fail.
> > > > 
> > > > But of course, we can race.  filemap_get_entry() can load a folio first,
> > > > the entire folio_split can happen, then it calls folio_try_get() and
> > > > succeeds, but it no longer covers the index we were looking for.  That's
> > > > what the xas_reload() is trying to prevent -- if the index is for a
> > > > folio which has changed, then the xas_reload() should come back with a
> > > > different folio and we goto repeat.
> > > > 
> > > > So how did we get through this with a reference to the wrong folio?
> > > 
> > > What would xas_reload() return if we raced with split and index pointed
> > > to a tail page before the split?
> > > 
> > > Wouldn't it return the folio that was a head and check will pass?
> > 
> > It's not supposed to return the head in this case.  But, check the code:
> > 
> >         if (!node)
> >                 return xa_head(xas->xa);
> >         if (IS_ENABLED(CONFIG_XARRAY_MULTI)) {
> >                 offset = (xas->xa_index >> node->shift) & XA_CHUNK_MASK;
> >                 entry = xa_entry(xas->xa, node, offset);
> >                 if (!xa_is_sibling(entry))
> >                         return entry;
> >                 offset = xa_to_sibling(entry);
> >         }
> >         return xa_entry(xas->xa, node, offset);
> > 
> > (obviously CONFIG_XARRAY_MULTI is enabled)
> >
> Yes we have this CONFIG enabled.
> 
> Also FWIW, happy to run some additional experiments or more debugging. We _can_
> reproduce this, as a machine hits this about every day on a sample of ~128
> machines. We also do get crashdumps so we can poke around there as needed.
> 
> I was going to deploy this patch onto a subset of machines, but reading through
> this thread I'm a bit concerned if a retry doesn't actually fix the problem,
> then we will just loop on this condition and hang.

I would be useful to know if the condition is persistent or if retry
"fixes" the problem.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC 1/1] mm/filemap: handle large folio split race in page cache lookups
  2026-03-06 20:21           ` Kiryl Shutsemau
@ 2026-03-06 20:58             ` Chris Arges
  0 siblings, 0 replies; 11+ messages in thread
From: Chris Arges @ 2026-03-06 20:58 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Matthew Wilcox, akpm, william.kucharski, linux-fsdevel, linux-mm,
	linux-kernel, kernel-team

On 2026-03-06 20:21:59, Kiryl Shutsemau wrote:
> On Fri, Mar 06, 2026 at 02:11:22PM -0600, Chris Arges wrote:
> > On 2026-03-06 16:28:19, Matthew Wilcox wrote:
> > > On Fri, Mar 06, 2026 at 02:13:26PM +0000, Kiryl Shutsemau wrote:
> > > > On Thu, Mar 05, 2026 at 07:24:38PM +0000, Matthew Wilcox wrote:
> > > > > folio_split() needs to be sure that it's the only one holding a reference
> > > > > to the folio.  To that end, it calculates the expected refcount of the
> > > > > folio, and freezes it (sets the refcount to 0 if the refcount is the
> > > > > expected value).  Once filemap_get_entry() has incremented the refcount,
> > > > > freezing will fail.
> > > > > 
> > > > > But of course, we can race.  filemap_get_entry() can load a folio first,
> > > > > the entire folio_split can happen, then it calls folio_try_get() and
> > > > > succeeds, but it no longer covers the index we were looking for.  That's
> > > > > what the xas_reload() is trying to prevent -- if the index is for a
> > > > > folio which has changed, then the xas_reload() should come back with a
> > > > > different folio and we goto repeat.
> > > > > 
> > > > > So how did we get through this with a reference to the wrong folio?
> > > > 
> > > > What would xas_reload() return if we raced with split and index pointed
> > > > to a tail page before the split?
> > > > 
> > > > Wouldn't it return the folio that was a head and check will pass?
> > > 
> > > It's not supposed to return the head in this case.  But, check the code:
> > > 
> > >         if (!node)
> > >                 return xa_head(xas->xa);
> > >         if (IS_ENABLED(CONFIG_XARRAY_MULTI)) {
> > >                 offset = (xas->xa_index >> node->shift) & XA_CHUNK_MASK;
> > >                 entry = xa_entry(xas->xa, node, offset);
> > >                 if (!xa_is_sibling(entry))
> > >                         return entry;
> > >                 offset = xa_to_sibling(entry);
> > >         }
> > >         return xa_entry(xas->xa, node, offset);
> > > 
> > > (obviously CONFIG_XARRAY_MULTI is enabled)
> > >
> > Yes we have this CONFIG enabled.
> > 
> > Also FWIW, happy to run some additional experiments or more debugging. We _can_
> > reproduce this, as a machine hits this about every day on a sample of ~128
> > machines. We also do get crashdumps so we can poke around there as needed.
> > 
> > I was going to deploy this patch onto a subset of machines, but reading through
> > this thread I'm a bit concerned if a retry doesn't actually fix the problem,
> > then we will just loop on this condition and hang.
> 
> I would be useful to know if the condition is persistent or if retry
> "fixes" the problem.

Fair enough. I suppose it's either crashing or locking up. Will deploy early
next week and see what happens.
--chris


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-03-06 20:58 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-03-05 18:34 [PATCH RFC 0/1] fix for large folio split race in page cache Chris J Arges
2026-03-05 18:34 ` [PATCH RFC 1/1] mm/filemap: handle large folio split race in page cache lookups Chris J Arges
2026-03-05 19:24   ` Matthew Wilcox
2026-03-06 14:13     ` Kiryl Shutsemau
2026-03-06 16:28       ` Matthew Wilcox
2026-03-06 18:36         ` Kiryl Shutsemau
2026-03-06 18:41           ` Matthew Wilcox
2026-03-06 20:20             ` Kiryl Shutsemau
2026-03-06 20:11         ` Chris Arges
2026-03-06 20:21           ` Kiryl Shutsemau
2026-03-06 20:58             ` Chris Arges

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox