One or more of the originally attached files triggered the rule module.access.rule.exestrip_notify The following attachments were deleted from the original message. radixcheck.py Original Message: On 9/18/24 2:37 AM, Jens Axboe wrote: > On 9/17/24 7:25 AM, Matthew Wilcox wrote: >> On Tue, Sep 17, 2024 at 01:13:05PM +0200, Chris Mason wrote: >>> On 9/17/24 5:32 AM, Matthew Wilcox wrote: >>>> On Mon, Sep 16, 2024 at 10:47:10AM +0200, Chris Mason wrote: >>>>> I've got a bunch of assertions around incorrect folio->mapping and I'm >>>>> trying to bash on the ENOMEM for readahead case. There's a GFP_NOWARN >>>>> on those, and our systems do run pretty short on ram, so it feels right >>>>> at least. We'll see. >>>> >>>> I've been running with some variant of this patch the whole way across >>>> the Atlantic, and not hit any problems. But maybe with the right >>>> workload ...? >>>> >>>> There are two things being tested here. One is whether we have a >>>> cross-linked node (ie a node that's in two trees at the same time). >>>> The other is whether the slab allocator is giving us a node that already >>>> contains non-NULL entries. >>>> >>>> If you could throw this on top of your kernel, we might stand a chance >>>> of catching the problem sooner. If it is one of these problems and not >>>> something weirder. >>>> >>> >>> This fires in roughly 10 seconds for me on top of v6.11. Since array seems >>> to always be 1, I'm not sure if the assertion is right, but hopefully you >>> can trigger yourself. >> >> Whoops. >> >> $ git grep XA_RCU_FREE >> lib/xarray.c:#define XA_RCU_FREE ((struct xarray *)1) >> lib/xarray.c: node->array = XA_RCU_FREE; >> >> so you walked into a node which is currently being freed by RCU. Which >> isn't a problem, of course. I don't know why I do that; it doesn't seem >> like anyone tests it. The jetlag is seriously kicking in right now, >> so I'm going to refrain from saying anything more because it probably >> won't be coherent. > > Based on a modified reproducer from Chris (N threads reading from a > file, M threads dropping pages), I can pretty quickly reproduce the > xas_descend() spin on 6.9 in a vm with 128 cpus. Here's some debugging > output with a modified version of your patch too, that ignores > XA_RCU_FREE: Jens and I are running slightly different versions of reader.c, but we're seeing the same thing. v6.11 is lasts all night long, and reverting those two commits falls over in about 5 minutes or less. I switched from a VM to bare metal, and managed to hit an assertion I'd added to filemap_get_read_batch() (should look familiar): { struct address_space *fmapping = READ_ONCE(folio->mapping); BUG_ON(fmapping && fmapping != mapping); } Walking the xarray in the crashdump shows that it's probably the same corruption I saw in 5.19. drgn is printing like so: print("0x%x mapping 0x%x radix index %d page index %d flags 0x%x (%s) size %d" % (page.address_of_(), page.mapping.value_(), index, page.index, page.flags, decode_page_flags(page), folio._folio_nr_pages)) And I attached radixcheck.py if you want to see the full script. These are all from the correct mapping: 0xffffea0088b17200 mapping 0xffff88a22a9614e8 radix index 53 page index 53 flags 0x15ffff000000000c (PG_referenced|PG_uptodate|PG_reported) size 59472 0xffffea008773e940 mapping 0xffff88a22a9614e8 radix index 54 page index 54 flags 0x15ffff000000000c (PG_referenced|PG_uptodate|PG_reported) size 4244589144 0xffffea0084ad1d00 mapping 0xffff88a22a9614e8 radix index 55 page index 55 flags 0x15ffff000000000c (PG_referenced|PG_uptodate|PG_reported) size 4040059330 0xffffea0088c9d840 mapping 0xffff88a22a9614e8 radix index 56 page index 56 flags 0x15ffff000000000c (PG_referenced|PG_uptodate|PG_reported) size 5958 0xffffea00879c6300 mapping 0xffff88a22a9614e8 radix index 57 page index 57 flags 0x15ffff000000000c (PG_referenced|PG_uptodate|PG_reported) size 112 0xffffea0086630980 mapping 0xffff88a22a9614e8 radix index 58 page index 58 flags 0x15ffff000000000c (PG_referenced|PG_uptodate|PG_reported) size 4025236287 0xffffea0008eb6580 mapping 0xffff88a22a9614e8 radix index 59 page index 59 flags 0x5ffff000000012c (PG_referenced|PG_uptodate|PG_lru|PG_active|PG_reported) size 269 0xffffea00072db000 mapping 0xffff88a22a9614e8 radix index 60 page index 60 flags 0x5ffff000000416c (PG_referenced|PG_uptodate|PG_lru|PG_head|PG_active|PG_private|PG_reported) size 4 0xffffea000919b600 mapping 0xffff88a22a9614e8 radix index 64 page index 64 flags 0x5ffff000000416c (PG_referenced|PG_uptodate|PG_lru|PG_head|PG_active|PG_private|PG_reported) size 4 These last 3 are not: 0xffffea0008fa7000 mapping 0xffff888124910768 radix index 208 page index 192 flags 0x5ffff000000416c (PG_referenced|PG_uptodate|PG_lru|PG_head|PG_active|PG_private|PG_reported) size 64 0xffffea0008fa7000 mapping 0xffff888124910768 radix index 224 page index 192 flags 0x5ffff000000416c (PG_referenced|PG_uptodate|PG_lru|PG_head|PG_active|PG_private|PG_reported) size 64 0xffffea0008fa7000 mapping 0xffff888124910768 radix index 240 page index 192 flags 0x5ffff000000416c (PG_referenced|PG_uptodate|PG_lru|PG_head|PG_active|PG_private|PG_reported) size 64 I think the bug was in __filemap_add_folio()'s usage of xarray_split_alloc() and the tree changing before taking the lock. It's just a guess, but that was always my biggest suspect. To reproduce, I used: mkfs.xfs -f mount some_device /xfs for x in `seq 1 8` ; do fallocate -l100m /xfs/file$x ./reader /xfs/file$x & done New reader.c attached. Jens changed his so that every reader thread was using its own offset in the file, and he found that reproduced more consistently. -chris