linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Alison Schofield <alison.schofield@intel.com>,
	Alistair Popple <apopple@nvidia.com>
Cc: linux-mm@kvack.org, nvdimm@lists.linux.dev
Subject: Re: [BUG Report] 6.15-rc1 RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250
Date: Wed, 9 Apr 2025 10:40:17 +0200	[thread overview]
Message-ID: <322e93d6-3fe2-48e9-84a9-c387cef41013@redhat.com> (raw)
In-Reply-To: <Z_W9Oeg-D9FhImf3@aschofie-mobl2.lan>

On 09.04.25 02:20, Alison Schofield wrote:
> Hi David, because this bisected to a patch you posted
> Hi Alistair,  because vmf_insert_page_mkwrite() is in the path

Hi!

> 
> A DAX unit test began failing on 6.15-rc1. I chased it as described below, but
> need XFS and/or your Folio/tail page accounting knowledge to take it further.
> 
> A DAX XFS mappings that is SHARED and R/W fails when the folio is
> unexpectedly NULL. Note that XFS PRIVATE always succeeds and XFS SHARED,
> READ_ONLY works fine. Also note that it works all the ways with EXT4.
> 

Huh, but why is the folio NULL?

insert_page_into_pte_locked() does "folio = page_folio(page)" and then 
even calls folio_get(folio) before calling folio_add_file_rmap_pte().

folio_add_file_rmap_ptes()->__folio_add_file_rmap() just passes the 
folio pointer along.

The RIP seems to be in __lruvec_stat_mod_folio(), so I assume we end up 
in __folio_mod_stat()->__lruvec_stat_mod_folio().

There, we call folio_memcg(folio). Likely we're not getting NULL back, 
which we could handle, but instead "0000000000000b00"

So maybe the memcg we get is "almost NULL", and not the folio ?

> [  417.796271] BUG: kernel NULL pointer dereference, address: 0000000000000b00
> [  417.796982] #PF: supervisor read access in kernel mode
> [  417.797540] #PF: error_code(0x0000) - not-present page
> [  417.798123] PGD 2a5c5067 P4D 2a5c5067 PUD 2a5c6067 PMD 0
> [  417.798690] Oops: Oops: 0000 [#1] SMP NOPTI
> [  417.799178] CPU: 5 UID: 0 PID: 1515 Comm: mmap Tainted: G           O        6.15.0-rc1-dirty #158 PREEMPT(voluntary)
> [  417.800150] Tainted: [O]=OOT_MODULE
> [  417.800583] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> [  417.801358] RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250
> [  417.801948] Code: 85 97 00 00 00 48 8b 43 38 48 89 c3 48 83 e3 f8 a8 02 0f 85 1a 01 00 00 48 85 db 0f 84 28 01 00 00 66 90 49 63 86 80 3e 00 00 <48> 8b 9c c3 00 09 00 00 48 83 c3 40 4c 3b b3 c0 00 00 00 0f 85 68
> [  417.803662] RSP: 0000:ffffc90002be3a08 EFLAGS: 00010206
> [  417.804234] RAX: 0000000000000000 RBX: 0000000000000200 RCX: 0000000000000002
> [  417.804984] RDX: ffffffff815652d7 RSI: 0000000000000000 RDI: ffffffff82a2beae
> [  417.805689] RBP: ffffc90002be3a28 R08: 0000000000000000 R09: 0000000000000000
> [  417.806384] R10: ffffea0007000040 R11: ffff888376ffe000 R12: 0000000000000001
> [  417.807099] R13: 0000000000000012 R14: ffff88807fe4ab40 R15: ffff888029210580
> [  417.807801] FS:  00007f339fa7a740(0000) GS:ffff8881fa9b9000(0000) knlGS:0000000000000000
> [  417.808570] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  417.809193] CR2: 0000000000000b00 CR3: 000000002a4f0004 CR4: 0000000000370ef0
> [  417.809925] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  417.810622] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  417.811353] Call Trace:
> [  417.811709]  <TASK>
> [  417.812038]  folio_add_file_rmap_ptes+0x143/0x230
> [  417.812566]  insert_page_into_pte_locked+0x1ee/0x3c0
> [  417.813132]  insert_page+0x78/0xf0
> [  417.813558]  vmf_insert_page_mkwrite+0x55/0xa0
> [  417.814088]  dax_fault_iter+0x484/0x7b0
> [  417.814542]  dax_iomap_pte_fault+0x1ca/0x620
> [  417.815055]  dax_iomap_fault+0x39/0x40
> [  417.815499]  __xfs_write_fault+0x139/0x380
> [  417.815995]  ? __handle_mm_fault+0x5e5/0x1a60
> [  417.816483]  xfs_write_fault+0x41/0x50
> [  417.816966]  xfs_filemap_fault+0x3b/0xe0
> [  417.817424]  __do_fault+0x31/0x180
> [  417.817859]  __handle_mm_fault+0xee1/0x1a60
> [  417.818325]  ? debug_smp_processor_id+0x17/0x20
> [  417.818844]  handle_mm_fault+0xe1/0x2b0
> [  417.819286]  do_user_addr_fault+0x217/0x630
> [  417.819747]  ? rcu_is_watching+0x11/0x50
> [  417.820185]  exc_page_fault+0x6c/0x210
> [  417.820599]  asm_exc_page_fault+0x27/0x30
> [  417.821080] RIP: 0033:0x40130c
> [  417.821461] Code: 89 7d d8 48 89 75 d0 e8 94 ff ff ff 48 c7 45 f8 00 00 00 00 48 8b 45 d8 48 89 45 f0 eb 18 48 8b 45 f0 48 8d 50 08 48 89 55 f0 <48> c7 00 01 00 00 00 48 83 45 f8 01 48 8b 45 d0 48 c1 e8 03 48 39
> [  417.823156] RSP: 002b:00007ffcc82a8cb0 EFLAGS: 00010287
> [  417.823703] RAX: 00007f336f5f5000 RBX: 00007ffcc82a8f08 RCX: 0000000067f5a1da
> [  417.824382] RDX: 00007f336f5f5008 RSI: 0000000000000000 RDI: 0000000000036a98
> [  417.825096] RBP: 00007ffcc82a8ce0 R08: 00007f339fa84000 R09: 00000000004040b0
> [  417.825769] R10: 00007f339fa8a200 R11: 00007f339fa8a7b0 R12: 0000000000000000
> [  417.826438] R13: 00007ffcc82a8f28 R14: 0000000000403e18 R15: 00007f339fac3000
> [  417.827148]  </TASK>
> [  417.827461] Modules linked in: nd_pmem(O) dax_pmem(O) nd_btt(O) nfit(O) nd_e820(O) libnvdimm(O) nfit_test_iomap(O)
> [  417.828404] CR2: 0000000000000b00
> [  417.828807] ---[ end trace 0000000000000000 ]---
> [  417.829293] RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250
> 
> 
> And then, looking at the page passed to vmf_insert_page_mkwrite():
> 
> [   55.468109] flags: 0x300000000002009(locked|uptodate|reserved|node=0|zone=3)

reserved might indicate ZONE_DEVICE. But zone=3 might or might not be 
ZONE_DEVICE (depending on the kernel config).

> [   55.468674] raw: 0300000000002009 ffff888028c27b20 00000000ffffffff ffff888033b69b88
> [   55.469270] raw: 000000000000fff5 0000000000000000 00000001ffffffff 0000000000000200
> [   55.469835] page dumped because: ALISON dump locked & uptodate pages

Do you have the other (earlier) output from __dump_page(), especially if 
this page is part of a large folio etc?

Trying to decipher:

0300000000002009 -> "unsigned long flags"
ffff888028c27b20 -> big union

As the big union overlays "unsigned long compound_head", and the last 
bit is not set, this should be a *small folio*.

That would mean that "0000000000000200" would be "unsigned long memcg_data".

0x200 might have been the folio_nr_pages before the large folio was 
split. Likely, we are not clearing that when splitting the large folio, 
resulting in a false-positive "memcg_data" after the split.

> 
> ^ That's different:  locked|uptodate. Other page flags arriving here are
> not locked | uptodate.
> 
> Git bisect says this is first bad patch (6.14 --> 6.15-rc1)
> 4996fc547f5b ("mm: let _folio_nr_pages overlay memcg_data in first tail page")
> 
> Experimenting a bit with the patch, UN-defining NR_PAGES_IN_LARGE_FOLIO,
> avoids the problem.
> 
> The way that patch is reusing memory in tail pages and the fact that it
> only fails in XFS (not ext4) suggests the XFS is depending on tail pages
> in a way that ext4 does not.

IIRC, XFS supports large folios but ext4 does not. But I don't really 
know how that interacts with DAX (if the same thing applies). Ordinary 
XFS large folio tests seem to work just fine, so the question is what 
DAX-specific is happening here.

When we free large folios back to the buddy, we set "folio->_nr_pages = 
0", to make the "page->memcg_data" check in page_bad_reason() happy. 
Also, just before the large folio split for ordinary large folios, we 
set "folio->_nr_pages = 0".

Maybe there is something missing in ZONE_DEVICE freeing/splitting code 
of large folios, where we should do the same, to make sure that all 
page->memcg_data is actually 0?

I assume so. Let me dig.

-- 
Cheers,

David / dhildenb



  reply	other threads:[~2025-04-09  8:40 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-09  0:20 Alison Schofield
2025-04-09  8:40 ` David Hildenbrand [this message]
2025-04-09  8:55   ` David Hildenbrand
2025-04-09 20:08     ` Dan Williams
2025-04-09 20:25       ` David Hildenbrand
2025-04-09 21:13         ` Alison Schofield
2025-04-09 21:41         ` Dan Williams
2025-04-10  8:48           ` Christoph Hellwig
2025-04-09 19:03   ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=322e93d6-3fe2-48e9-84a9-c387cef41013@redhat.com \
    --to=david@redhat.com \
    --cc=alison.schofield@intel.com \
    --cc=apopple@nvidia.com \
    --cc=linux-mm@kvack.org \
    --cc=nvdimm@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox