linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Hugh Dickins <hughd@google.com>,
	"Liam R. Howlett" <Liam.Howlett@Oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>,
	Sanan Hasanov <sanan.hasanov@knights.ucf.edu>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"contact@pgazz.com" <contact@pgazz.com>,
	"syzkaller@googlegroups.com" <syzkaller@googlegroups.com>,
	Huang Ying <ying.huang@intel.com>
Subject: Re: kernel BUG in page_add_anon_rmap
Date: Mon, 30 Jan 2023 10:03:45 +0100	[thread overview]
Message-ID: <67dfd817-073e-9abb-316f-689ba8193965@redhat.com> (raw)
In-Reply-To: <9d8fb9c-1b81-67cd-e55b-34517388e1ab@google.com>

>>>
>>> I reproduced on next-20230127 (did not try upstream yet).
> 
> Upstream's fine; on next-20230127 (with David's repro) it bisects to
> 5ddaec50023e ("mm/mmap: remove __vma_adjust()").  I think I'd better
> hand on to Liam, rather than delay you by puzzling over it further myself.
> 

Thanks for identifying the problematic commit! ...

>>>
>>> I think two key things are that a) THP are set to "always" and b) we have a
>>> NUMA setup [I assume].
>>>
>>> The relevant bits:
>>>
>>> [  439.886738] page:00000000c4de9000 refcount:513 mapcount:2
>>> mapping:0000000000000000 index:0x20003 pfn:0x14ee03
>>> [  439.893758] head:000000003d5b75a4 order:9 entire_mapcount:0
>>> nr_pages_mapped:511 pincount:0
>>> [  439.899611] memcg:ffff986dc4689000
>>> [  439.902207] anon flags:
>>> 0x17ffffc009003f(locked|referenced|uptodate|dirty|lru|active|head|swapbacked|node=0|zone=2|lastcpupid=0x1fffff)
>>> [  439.910737] raw: 0017ffffc0020000 ffffe952c53b8001 ffffe952c53b80c8
>>> dead000000000400
>>> [  439.916268] raw: 0000000000000000 0000000000000000 0000000000000001
>>> 0000000000000000
>>> [  439.921773] head: 0017ffffc009003f ffffe952c538b108 ffff986de35a0010
>>> ffff98714338a001
>>> [  439.927360] head: 0000000000020000 0000000000000000 00000201ffffffff
>>> ffff986dc4689000
>>> [  439.932341] page dumped because: VM_BUG_ON_PAGE(!first && (flags & ((
>>> rmap_t)((((1UL))) << (0)))))
>>>
>>>
>>> Indeed, the mapcount of the subpage is 2 instead of 1. The subpage is only
>>> mapped into a single
>>> page table (no fork() or similar).
> 
> Yes, that mapcount:2 is weird; and what's also weird is the index:0x20003:
> what is remove_migration_pte(), in an mbind(0x20002000,...), doing with
> index:0x20003?

I was assuming the whole folio would get migrated. As you raise below, 
it's all a bit unclear once THP get involved and dealing with mbind() 
and page migration.

>>>
>>> I created this reduced reproducer that triggers 100%:
> 
> Very helpful, thank you.
> 
>>>
>>>
>>> #include <stdint.h>
>>> #include <unistd.h>
>>> #include <sys/mman.h>
>>> #include <numaif.h>
>>>
>>> int main(void)
>>> {
>>> 	mmap((void*)0x20000000ul, 0x1000000ul, PROT_READ|PROT_WRITE|PROT_EXEC,
>>> 	     MAP_ANONYMOUS|MAP_FIXED|MAP_PRIVATE, -1, 0ul);
>>> 	madvise((void*)0x20000000ul, 0x1000000ul, MADV_HUGEPAGE);
>>>
>>> 	*(uint32_t*)0x20000080 = 0x80000;
>>> 	mlock((void*)0x20001000ul, 0x2000ul);
>>> 	mlock((void*)0x20000000ul, 0x3000ul);
> 
> It's not an mlock() issue in particular: quickly established by
> substituting madvise(,, MADV_NOHUGEPAGE) for those mlock() calls.
> Looks like a vma splitting issue now.

Gah, should have tried something like that first before suspecting it's 
mlock related. :)

> 
>>> 	mbind((void*)0x20002000ul, 0x1000ul, MPOL_LOCAL, NULL, 0x7fful,
>>> 	MPOL_MF_MOVE);
> 
> I guess it will turn out not to be relevant to this particular syzbug,
> but what do we expect an mbind() of just 0x1000 of a THP to do?
> 
> It's a subject I've wrestled with unsuccessfully in the past: I found
> myself arriving at one conclusion (split THP) in one place, and a contrary
> conclusion (widen range) in another place, and never had time to work out
> one unified answer.

I'm aware of a similar issue with long-term page pinning: we might want 
to pin a 4k portion of a THP, but will end up blocking the whole THP 
from getting migrated/swapped/split/freed/ ... until we unpin (ever?). I 
wrote a reproducer [1] a while ago to show how you can effectively steal 
most THP in the system using comparatively small memlock limit using 
io_uring ...

In theory, we could split the THP before long-term pinning only a 
subregion ... but what if we cannot split the THP because it's already 
pinned (previous pinning request that covered the whole THP)? Copying 
instead of splitting would also not be possible, if the page is already 
pinned ... so we'd never want to allow long-term pinning a THP ... but 
that means that we would have to fail pinning if splitting the THP fails 
and that there would be performance-consequences for THP users :/

Non-trivial ... just like mlocking only a part of a THP or mbinding 
different parts of a THP to different nodes ...

[1] 
https://gitlab.com/davidhildenbrand/scratchspace/-/blob/main/io_uring_thp.c

-- 
Thanks,

David / dhildenb



  reply	other threads:[~2023-01-30  9:03 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-25 23:59 Sanan Hasanov
2023-01-26  0:13 ` Andrew Morton
2023-01-26 18:57 ` Matthew Wilcox
2023-01-26 19:00   ` Sanan Hasanov
2023-01-27 11:44   ` David Hildenbrand
2023-01-27 17:02     ` Hugh Dickins
2023-01-29  6:49       ` Hugh Dickins
2023-01-30  9:03         ` David Hildenbrand [this message]
2023-01-30  9:26           ` David Hildenbrand
2023-01-30 16:11         ` Matthew Wilcox
2023-01-31  1:16           ` Hillf Danton
2023-01-30 19:20         ` Yang Shi
2023-01-30 19:26         ` Liam R. Howlett

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=67dfd817-073e-9abb-316f-689ba8193965@redhat.com \
    --to=david@redhat.com \
    --cc=Liam.Howlett@Oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=contact@pgazz.com \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=sanan.hasanov@knights.ucf.edu \
    --cc=syzkaller@googlegroups.com \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox