From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D3066C48BF6 for ; Mon, 4 Mar 2024 09:10:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5A4306B009F; Mon, 4 Mar 2024 04:10:08 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 554596B00A0; Mon, 4 Mar 2024 04:10:08 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 41BF86B00A1; Mon, 4 Mar 2024 04:10:08 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 2C55D6B009F for ; Mon, 4 Mar 2024 04:10:08 -0500 (EST) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 00D59C09FC for ; Mon, 4 Mar 2024 09:10:07 +0000 (UTC) X-FDA: 81858784896.07.ACF780D Received: from szxga07-in.huawei.com (szxga07-in.huawei.com [45.249.212.35]) by imf09.hostedemail.com (Postfix) with ESMTP id 7F4D9140015 for ; Mon, 4 Mar 2024 09:10:04 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=none; spf=pass (imf09.hostedemail.com: domain of linmiaohe@huawei.com designates 45.249.212.35 as permitted sender) smtp.mailfrom=linmiaohe@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709543405; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=u6j5xqo8G6pkPAOWREzi1Ok3hkCZIOHbwljjOZMbUu4=; b=jrG9KgZj3Kq4CYcUsPKeAXLF+wyC+Rjkf4Cj5dTggiiWSzWIxRyb4G2vSkTNI0gm98ydxm yngwY2nZRo22fGXqWmpCRxdsh+nQExWcLHOyhFQrftvmKsafV6PmLHYO34JW4oVr6JYIGk 4Mv7aIr56kZ18AkXz9VQEFod1O7V9Yg= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=none; spf=pass (imf09.hostedemail.com: domain of linmiaohe@huawei.com designates 45.249.212.35 as permitted sender) smtp.mailfrom=linmiaohe@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709543405; a=rsa-sha256; cv=none; b=vZCaKsH2ZMT8TqTWfet11NMty0EJg1j1XH6k0f7VSbliGM50d92jjl1s/PtT76sBsa3Mqt bJXRFV/ibBL5KYjWCdV7IIfcz3OLPZpnnZ/3Mgk+vQH2lc86jHN08OG2xv4PycJXWoh6zy CHhC6KXKRRQp9dVOqRUjN7QdZskAoQM= Received: from mail.maildlp.com (unknown [172.19.163.17]) by szxga07-in.huawei.com (SkyGuard) with ESMTP id 4TpCWv5jVsz1Q9j6; Mon, 4 Mar 2024 17:07:39 +0800 (CST) Received: from canpemm500002.china.huawei.com (unknown [7.192.104.244]) by mail.maildlp.com (Postfix) with ESMTPS id 74BED1A0172; Mon, 4 Mar 2024 17:09:59 +0800 (CST) Received: from [10.173.135.154] (10.173.135.154) by canpemm500002.china.huawei.com (7.192.104.244) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.35; Mon, 4 Mar 2024 17:09:59 +0800 Subject: Re: [PATCH 0/5] Remove some races around folio_test_hugetlb To: "Matthew Wilcox (Oracle)" CC: Oscar Salvador , Linux-MM References: <20240301214712.2853147-1-willy@infradead.org> From: Miaohe Lin Message-ID: Date: Mon, 4 Mar 2024 17:09:58 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.6.0 MIME-Version: 1.0 In-Reply-To: <20240301214712.2853147-1-willy@infradead.org> Content-Type: text/plain; charset="utf-8" Content-Language: en-US Content-Transfer-Encoding: 7bit X-Originating-IP: [10.173.135.154] X-ClientProxiedBy: dggems705-chm.china.huawei.com (10.3.19.182) To canpemm500002.china.huawei.com (7.192.104.244) X-Rspamd-Queue-Id: 7F4D9140015 X-Rspam-User: X-Stat-Signature: 345e9ih1xkweq57ojk3bz3kmyc7t5w6c X-Rspamd-Server: rspam01 X-HE-Tag: 1709543404-316722 X-HE-Meta: U2FsdGVkX18OJfMusDc4zwfTgYU5DBqDqZa+mog4tE5hFFr2jVzcQx8RqM2w9zmdQ4W6AAFjQP3fkrPYxL3hsequgrRungPucPk59+KJjuXdB+Oth0gTLc51mHg4e8byEXA9nbnVb5z9Q7lWg2cfRD8q4T8mt5hjWFLwstzcDxwhPm8apCxZVHAcNa5lYizb8y3kEfS3CM81VnN/7bn5WJb80Y9Z142xi+9G/98HVOxCYEgH1IjyEDF/gh9OP+bE0AM6NnC8obVuXSFC05qx+gEV8CD3R4d9ze5tTz+mJlbWtzJJQCqYHPWRscxe1znagz+6L4uSjTKDxlq7FBJXvfEiUVwsSZzmKy3G+IKDUfXHyD7yz1lBDexP8Cbwq07jcD+4MoD8NXe9IcFV2XCNTp9nVpoMQymIMWSC1NhfJKnw7kKkGNoHZv3e3eyosDzgkU9ik1dSc8o8QyUelzP+Mciy0A7x6YsvbOCGoT01h42pHtQyvI6bMJSwJy0rx1Kiashb7f5umwwwsAzpyKW55bRNMxgmNaOFGZ7f04EuEooahSNicSh5cdAg5XnLziECOdCDVVy7+0++Lt6gcT7LM0hb1OutLm54RwsQAdF3TUxKouCm2N3dMYzUv2NmCDqMrbOBI/meiwJKK4RRjV2cTIqaMfFlvR35P6PyTQBOZx09kmLaP2LsiBNmkOgttwnckCEdeyLr0CuFBrP9FwDsM7b4cdQvaZ0wBTIwfcYMnsMM75taZKMfKkERF3gEaMYqZ9N3kmcTm0jQhI1Ra2gK/wwWgKKWH5ehI+8PUCDJnsbLA6l1u5CPPDZxc+Pge3VwYgZSFIxYe+pbVz68bPbIGmsEYLaMgSlomButvGfjeTehrcxhNjnaYcb04RabtEQJ5xV3UKBaogsKnFd/xYfzn5Eq6ZqjshlFyfkyv/vsV02Wexr4trjm8QDifWbX5KATm3T1v5k2seq1x3pLoPK Uyz4UB4b TzZ+5jRkDAbfBqj/TG974rPLkLD6RkGxWNuK3itYzM0s0Sus47jcD6jb1VFEhQ/fzaH0L4wkduM5yc50EtpLDNXOnIkhZMUb9pyXrPQweAWUF9UL7NpgAQyyN1iQHr18hJMpCCXccVSnqg40VWGLz6s+UcYXf9Vt2lYmd1yfWtEso8W0FY3PS4N1wBFZGnUaxuYFVigBqev7fVC+gw21Admvl+8I35UbX+h1p0Ih8QOgX9SZd4jZ3/PV8pjvVgJnfzvGgj3pjiHlITBUvE7RvDrSvF1FKi/drajWriCHW/mynNY0MEIYa/iASeBAjQe/2AgdDQLvuBkmOzwgDBEoplP74CVoWPb2kZNVs X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/3/2 5:47, Matthew Wilcox (Oracle) wrote: > Oscar and I have been exchanging a bit of email recently about the > bug reported here: > https://lore.kernel.org/all/ZXNhGsX32y19a2Xv@casper.infradead.org Thanks for your patch. > > I've come to the conclusion that folio_test_hugetlb() is just too fragile > as it can give both false positives and false negatives, as well as > resulting in the above bug. With this patch series, it becomes a lot > more robust. In the memory-failure case, we always hold the hugetlb_lock > so it's perfectly reliable. In the compaction caase, it's unreliable, but > the failures are acceptable and we recheck after taking the hugetlb_lock. I encountered similar issues with PageSwapCache check when doing memory-failure test: [66258.945079] page:00000000135e1205 refcount:1 mapcount:0 mapping:0000000000000000 index:0x9b pfn:0xa04e9a [66258.949096] head:0000000038449724 order:9 entire_mapcount:1 nr_pages_mapped:0 pincount:0 [66258.949485] memcg:ffff95fb43379000 [66258.950334] anon flags: 0x6fffc00000a0068(uptodate|lru|head|mappedtodisk|swapbacked|node=1|zone=2|lastcpupid=0x3fff) [66258.951212] page_type: 0xffffffff() [66258.951882] raw: 06fffc0000000000 ffffc89628138001 dead000000000122 dead000000000400 [66258.952273] raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000 [66258.952884] head: 06fffc00000a0068 ffffc896218a8008 ffffc89621680008 ffff95fb4349c439 [66258.953239] head: 0000000700000600 0000000000000000 00000001ffffffff ffff95fb43379000 [66258.953725] page dumped because: VM_BUG_ON_PAGE(PageTail(page)) [66258.954497] ------------[ cut here ]------------ [66258.954937] kernel BUG at include/linux/page-flags.h:313! [66258.956502] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [66258.957001] CPU: 14 PID: 174237 Comm: page-types Kdump: loaded Not tainted 6.8.0-rc1-00162-gd162e170f118 #11 [66258.957001] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [66258.958415] RIP: 0010:folio_flags.constprop.0+0x1c/0x50 [66258.958415] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 90 48 8b 57 08 48 89 f8 83 e2 01 74 12 48 c7 c6 a0 59 34 a7 48 89 c7 e8 b5 60 e8 ff 90 <0f> 0b 66 90 c3 cc cc cc cc f7 c7 ff 0f 00 00 75 1a 48 8b 17 83 e2 [66258.958415] RSP: 0018:ffffa0f38ae53e00 EFLAGS: 00000282 [66258.958415] RAX: 0000000000000033 RBX: 0000000000000000 RCX: ffff96031fd9c9c8 [66258.958415] RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff96031fd9c9c0 [66258.958415] RBP: ffffc8962813a680 R08: ffffffffa7756f88 R09: 0000000000009ffb [66258.962155] R10: 000000000000054a R11: ffffffffa7726fa0 R12: 06fffc0000000000 [66258.962155] R13: 0000000000000000 R14: 00007fff93bf1348 R15: 0000000000a04e9a [66258.962155] FS: 00007f47cc5c4740(0000) GS:ffff96031fd80000(0000) knlGS:0000000000000000 [66258.962155] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [66258.962155] CR2: 00007fff93c7b000 CR3: 0000000850c28000 CR4: 00000000000006f0 [66258.962155] Call Trace: [66258.962155] [66258.965730] ? die+0x32/0x90 [66258.965730] ? do_trap+0xdf/0x110 [66258.965730] ? folio_flags.constprop.0+0x1c/0x50 [66258.965730] ? do_error_trap+0x8b/0x110 [66258.965730] ? folio_flags.constprop.0+0x1c/0x50 [66258.965730] ? folio_flags.constprop.0+0x1c/0x50 [66258.965730] ? exc_invalid_op+0x53/0x70 [66258.965730] ? folio_flags.constprop.0+0x1c/0x50 [66258.965730] ? asm_exc_invalid_op+0x1a/0x20 [66258.965730] ? folio_flags.constprop.0+0x1c/0x50 [66258.965730] stable_page_flags+0x210/0x940 [66258.965730] kpageflags_read+0x97/0xf0 [66258.965730] vfs_read+0xa0/0x370 [66258.965730] __x64_sys_pread64+0x90/0xc0 [66258.965730] do_syscall_64+0xcd/0x1e0 [66258.965730] entry_SYSCALL_64_after_hwframe+0x6f/0x77 [66258.965730] RIP: 0033:0x7f47cc31274a [66258.969711] Code: 44 24 78 00 00 00 00 e9 2b f1 ff ff 0f 1f 40 00 f3 0f 1e fa 49 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 11 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 5e c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24 [66258.969711] RSP: 002b:00007fff93af1298 EFLAGS: 00000246 ORIG_RAX: 0000000000000011 [66258.969711] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f47cc31274a [66258.969711] RDX: 0000000000000008 RSI: 00007fff93bf1340 RDI: 0000000000000004 [66258.969711] RBP: 00007fff93af12e0 R08: 0000000000000001 R09: 8100000000a04e99 [66258.969711] R10: 00000000050274d0 R11: 0000000000000246 R12: 00007fff93cf1588 [66258.972680] R13: 0000000000404af1 R14: 000000000040ad78 R15: 00007f47cc609040 [66258.972680] [66258.972680] Modules linked in: mce_inject hwpoison_inject After debugging, I think below race leads to the above panic: CPU1 CPU2 kpageflags_read stable_page_flags PageSwapCache() check 4k page without page refcnt held folio_test_swapcache(page_folio(page)); folio_test_swapbacked(folio) && /* page is swapbacked. */ page is freed into buddy and merged into larger order. page is allocated as THP tail page. test_bit(PG_swapcache, folio_flags(folio, 0)); /* BUG_ON PageTail check in folio_flags. It's tail page now! */ So the PageSwapCache test is fragile too. Any thought on how to fix this 'similar' issue? Thanks. > > The cost of this reliability is that we now consume the word I recently > freed in folio->page[1]. I think this is acceptable; we've still gained > a completely reliable folio_test_hugetlb() (which we didn't have before > I started messing around with the folio dtors). Non-hugetlb users > can use large_id as a pointer to something else entirely, or even as a > non-pointer, as long as they can guarantee it can't conflict (ie don't > use it as a bitfield). > > So far, this is working for me. Some stress testing would be appreciated. > > Matthew Wilcox (Oracle) (5): > hugetlb: Make folio_test_hugetlb safer to call > hugetlb: Add hugetlb_pfn_folio > memory-failure: Use hugetlb_pfn_folio > memory-failure: Reorganise get_huge_page_for_hwpoison() > compaction: Use hugetlb_pfn_folio in isolate_migratepages_block > > include/linux/hugetlb.h | 13 ++----- > include/linux/mm.h | 8 ----- > include/linux/mm_types.h | 4 ++- > include/linux/page-flags.h | 25 +++---------- > kernel/vmcore_info.c | 3 +- > mm/compaction.c | 16 ++++----- > mm/huge_memory.c | 10 ++---- > mm/hugetlb.c | 72 +++++++++++++++++++++++++++++--------- > mm/memory-failure.c | 14 +++++--- > 9 files changed, 87 insertions(+), 78 deletions(-) >