From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8EFB5C54E41 for ; Wed, 6 Mar 2024 13:42:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0691E6B0072; Wed, 6 Mar 2024 08:42:13 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 018076B0075; Wed, 6 Mar 2024 08:42:12 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E4A5F6B0078; Wed, 6 Mar 2024 08:42:12 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id D0B706B0072 for ; Wed, 6 Mar 2024 08:42:12 -0500 (EST) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 9F4291C1034 for ; Wed, 6 Mar 2024 13:42:12 +0000 (UTC) X-FDA: 81866728104.11.128E3DA Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf01.hostedemail.com (Postfix) with ESMTP id 666D740005 for ; Wed, 6 Mar 2024 13:42:10 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=none; spf=pass (imf01.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709732531; a=rsa-sha256; cv=none; b=x9FIHZ7KyxvZ0nIdt1etKAddvSJpi0IfH64hz7qqOQIsGM5hRtWxgnaAXucTUaNEJMsNAq SP1azree1mcwTYwWxwBajaAC0gGolhz5F413dAs/mHr7SjAln0pvy+JAUKDuHS9uCryr0g iS7hZy+VH5ZjIEqrXnPjNsFsGCwlb2o= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=none; spf=pass (imf01.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709732531; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=19NPSakPHvUYyQwu6j8sp1ebiXpVsO4PcXWrfi5U2R8=; b=uY+Od3c+a2MYT+/W4Wac9Psh81mpyOHJnuD7f7iESV1zlvOotx3DtNq7w7FMyi/zQgoaoH nGF+Cau6bKuc1pv2jFPiSZaqlxCEfSXOHVXQ3IXp/tXhAScGE2kPLd8seKEJ32E2Yf0jpj k0Me+UUZ4NGEdd51ZW0aFR0XTywygMM= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id CDC731FB; Wed, 6 Mar 2024 05:42:45 -0800 (PST) Received: from [10.57.68.241] (unknown [10.57.68.241]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id EE6303F762; Wed, 6 Mar 2024 05:42:07 -0800 (PST) Message-ID: <367a14f7-340e-4b29-90ae-bc3fcefdd5f4@arm.com> Date: Wed, 6 Mar 2024 13:42:06 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed Content-Language: en-GB To: "Matthew Wilcox (Oracle)" , Andrew Morton Cc: linux-mm@kvack.org References: <20240227174254.710559-1-willy@infradead.org> <20240227174254.710559-11-willy@infradead.org> From: Ryan Roberts In-Reply-To: <20240227174254.710559-11-willy@infradead.org> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 666D740005 X-Stat-Signature: b936sdpgynmkohc4s11xce39hcg3gd7i X-HE-Tag: 1709732530-388820 X-HE-Meta: U2FsdGVkX18FvbDpgF2O0N4TlW4SbrZOJ8kuoQFT1q5uxkXkPGcA6Jc5G6XAQaXlw71OlT54qsonNR0/bdOj41WGYJhdas1Rq93dPSbmofiOG5Tx3SiUs5bTOMkzKlbAYpw1UiLvvs0o7DKeMs5maRgOLV7feEA9PasZOvnlOlISVe7JnQhZQ9fCay9jPhmWoem+Dfug2/RyFj/2tg2B15DNJVxmUfDhIa/+Wl1lSOXsNFndcQ1Bji6B+ti0b6V6pGcGscwSMYUZDFHyVf1nXRWSHOd3PVdsoQWN1z2md/70JJX3BQ/i1rILGcJW7Bg7a9jQwRCBIeqAR1j5twZa6mNCLq07zINHGpG/n+SDb/jP9sbOdSOp58mjwtcDVeXy3/3G2DHiGHq/t4EXisOJ0pVCQSrWuE9gYF7ZJbHxv18NeVpf93UnM1t+WasuVKGUoI8/MciLpBn1VL8NaJ+jg7He6mGWHvKr+rLjzRTL5RpKITS5IaPlQJSFqcrXFKACqixK2Tp6tXvBnGQv5JeNboIhzAjZvO6lGfbrBHNPQ02ywPBT9RMq3Oo8kPvYpDKCrflYebbkQtGDm7KdmOm0NCa/cEqAUa8b1mRDD8XvFxunurR5265eQeLFfXfjAFtMmqwh9/xh6wobqfi4EN/uKYzC1UnuCOxwRvGxM1Qs11pxRg5PQQtSQzcthTdLCEHbgaN++Fgd3PDkloPEryrHo4hC4QCyIlt/96MsOdWYNgSPC/haU4PPwrQTTL1WmLXeE2PzCdFHK7qjjJv1widPbWzUli9YopicX7hWazfl434ZrbleTb6botTXDVOmfoQ9spqaiMR6oukHjc5ixJDWY25t6pcX3LdW X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Matthew, Afraid I have another bug for you... On 27/02/2024 17:42, Matthew Wilcox (Oracle) wrote: > Hugetlb folios still get special treatment, but normal large folios > can now be freed by free_unref_folios(). This should have a reasonable > performance impact, TBD. > > Signed-off-by: Matthew Wilcox (Oracle) > Reviewed-by: Ryan Roberts When running some swap tests with this change (which is in mm-stable) present, I see BadThings(TM). Usually I see a "bad page state" followed by a delay of a few seconds, followed by an oops or NULL pointer deref. Bisect points to this change, and if I revert it, the problem goes away. Here is one example, running against mm-unstable (a7f399ae964e): [ 76.239466] BUG: Bad page state in process usemem pfn:2554a0 [ 76.240196] kernel BUG at include/linux/mm.h:1120! [ 76.240198] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP [ 76.240724] dump_backtrace+0x98/0xf8 [ 76.241523] Modules linked in: [ 76.241943] show_stack+0x20/0x38 [ 76.242282] [ 76.242680] dump_stack_lvl+0x48/0x60 [ 76.242855] CPU: 2 PID: 62 Comm: kcompactd0 Not tainted 6.8.0-rc5-00456-ga7f399ae964e #16 [ 76.243278] dump_stack+0x18/0x28 [ 76.244138] Hardware name: linux,dummy-virt (DT) [ 76.244510] bad_page+0x88/0x128 [ 76.244995] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 76.245370] free_page_is_bad_report+0xa4/0xb8 [ 76.246101] pc : migrate_folio_done+0x140/0x150 [ 76.246572] __free_pages_ok+0x370/0x4b0 [ 76.247048] lr : migrate_folio_done+0x140/0x150 [ 76.247489] destroy_large_folio+0x94/0x108 [ 76.247971] sp : ffff800083f5b8d0 [ 76.248451] __folio_put_large+0x70/0xc0 [ 76.248807] x29: ffff800083f5b8d0 [ 76.249256] __folio_put+0xac/0xc0 [ 76.249260] deferred_split_scan+0x234/0x340 [ 76.249607] x28: 0000000000000000 [ 76.249997] do_shrink_slab+0x144/0x460 [ 76.250444] x27: ffff800083f5bb30 [ 76.250829] shrink_slab+0x2e0/0x4e0 [ 76.251234] [ 76.251604] shrink_node+0x204/0x8a0 [ 76.251979] x26: 0000000000000001 [ 76.252147] do_try_to_free_pages+0xd0/0x568 [ 76.252527] x25: 0000000000000010 [ 76.252881] try_to_free_mem_cgroup_pages+0x128/0x2d0 [ 76.253337] x24: fffffc0008552800 [ 76.253687] try_charge_memcg+0x12c/0x650 [ 76.254219] [ 76.254583] __mem_cgroup_charge+0x6c/0xd0 [ 76.255013] x23: ffff0000e6f353a8 [ 76.255181] __handle_mm_fault+0xe90/0x16a8 [ 76.255624] x22: ffff0013f5fa59c0 [ 76.255977] handle_mm_fault+0x70/0x2b0 [ 76.256413] x21: 0000000000000000 [ 76.256756] do_page_fault+0x100/0x4c0 [ 76.257177] [ 76.257540] do_translation_fault+0xb4/0xd0 [ 76.257932] x20: 0000000000000007 [ 76.258095] do_mem_abort+0x4c/0xa8 [ 76.258532] x19: fffffc0008552800 [ 76.258883] el0_da+0x2c/0x78 [ 76.259263] x18: 0000000000000010 [ 76.259616] el0t_64_sync_handler+0xe4/0x158 [ 76.259933] [ 76.260286] el0t_64_sync+0x190/0x198 [ 76.260729] x17: 3030303030303020 x16: 6666666666666666 x15: 3030303030303030 [ 76.262010] x14: 0000000000000000 x13: 7465732029732867 x12: 616c662045455246 [ 76.262746] x11: 5f54415f4b434548 x10: ffff800082e8bff8 x9 : ffff8000801276ac [ 76.263462] x8 : 00000000ffffefff x7 : ffff800082e8bff8 x6 : 0000000000000000 [ 76.264182] x5 : ffff0013f5eb9d08 x4 : 0000000000000000 x3 : 0000000000000000 [ 76.264903] x2 : 0000000000000000 x1 : ffff0000c105d640 x0 : 000000000000003e [ 76.265604] Call trace: [ 76.265865] migrate_folio_done+0x140/0x150 [ 76.266278] migrate_pages_batch+0x9ec/0xff0 [ 76.266716] migrate_pages+0xd20/0xe20 [ 76.267103] compact_zone+0x7b4/0x1000 [ 76.267460] kcompactd_do_work+0x174/0x4d8 [ 76.267869] kcompactd+0x26c/0x418 [ 76.268175] kthread+0x120/0x130 [ 76.268517] ret_from_fork+0x10/0x20 [ 76.268892] Code: aa1303e0 b000d161 9100c021 97fe0465 (d4210000) [ 76.269447] ---[ end trace 0000000000000000 ]--- [ 76.269893] note: kcompactd0[62] exited with irqs disabled [ 76.269942] page: refcount:0 mapcount:1 mapping:0000000000000000 index:0xffffbd0a0 pfn:0x2554a0 [ 76.270483] note: kcompactd0[62] exited with preempt_count 1 [ 76.271344] head: order:0 entire_mapcount:1 nr_pages_mapped:0 pincount:0 [ 76.272521] flags: 0xbfffc0000080058(uptodate|dirty|head|swapbacked|node=0|zone=2|lastcpupid=0xffff) [ 76.273265] page_type: 0xffffffff() [ 76.273542] raw: 0bfffc0000080058 dead000000000100 dead000000000122 0000000000000000 [ 76.274368] raw: 0000000ffffbd0a0 0000000000000000 00000000ffffffff 0000000000000000 [ 76.275043] head: 0bfffc0000080058 dead000000000100 dead000000000122 0000000000000000 [ 76.275651] head: 0000000ffffbd0a0 0000000000000000 00000000ffffffff 0000000000000000 [ 76.276407] head: 0bfffc0000000000 0000000000000000 fffffc0008552848 0000000000000000 [ 76.277064] head: 0000001000000000 0000000000000000 00000000ffffffff 0000000000000000 [ 76.277784] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0) [ 76.278502] ------------[ cut here ]------------ [ 76.278893] kernel BUG at include/linux/mm.h:1120! [ 76.279269] Internal error: Oops - BUG: 00000000f2000800 [#2] PREEMPT SMP [ 76.280144] Modules linked in: [ 76.280401] CPU: 6 PID: 1337 Comm: usemem Tainted: G B D 6.8.0-rc5-00456-ga7f399ae964e #16 [ 76.281214] Hardware name: linux,dummy-virt (DT) [ 76.281635] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 76.282256] pc : deferred_split_scan+0x2f0/0x340 [ 76.282698] lr : deferred_split_scan+0x2f0/0x340 [ 76.283082] sp : ffff80008681b830 [ 76.283426] x29: ffff80008681b830 x28: ffff0000cd4fb3c0 x27: fffffc0008552800 [ 76.284113] x26: 0000000000000001 x25: 00000000ffffffff x24: 0000000000000001 [ 76.284914] x23: 0000000000000000 x22: fffffc0008552800 x21: ffff0000e9df7820 [ 76.285590] x20: ffff80008681b898 x19: ffff0000e9df7818 x18: 0000000000000000 [ 76.286271] x17: 0000000000000001 x16: 0000000000000001 x15: ffff0000c0617210 [ 76.286927] x14: ffff0000c10b6558 x13: 0000000000000040 x12: 0000000000000228 [ 76.287543] x11: 0000000000000040 x10: 0000000000000a90 x9 : ffff800080220ed8 [ 76.288176] x8 : ffff0000cd4fbeb0 x7 : 0000000000000000 x6 : 0000000000000000 [ 76.288842] x5 : ffff0013f5f35d08 x4 : 0000000000000000 x3 : 0000000000000000 [ 76.289538] x2 : 0000000000000000 x1 : ffff0000cd4fb3c0 x0 : 000000000000003e [ 76.290201] Call trace: [ 76.290432] deferred_split_scan+0x2f0/0x340 [ 76.290856] do_shrink_slab+0x144/0x460 [ 76.291221] shrink_slab+0x2e0/0x4e0 [ 76.291513] shrink_node+0x204/0x8a0 [ 76.291831] do_try_to_free_pages+0xd0/0x568 [ 76.292192] try_to_free_mem_cgroup_pages+0x128/0x2d0 [ 76.292599] try_charge_memcg+0x12c/0x650 [ 76.292926] __mem_cgroup_charge+0x6c/0xd0 [ 76.293289] __handle_mm_fault+0xe90/0x16a8 [ 76.293713] handle_mm_fault+0x70/0x2b0 [ 76.294031] do_page_fault+0x100/0x4c0 [ 76.294343] do_translation_fault+0xb4/0xd0 [ 76.294694] do_mem_abort+0x4c/0xa8 [ 76.294968] el0_da+0x2c/0x78 [ 76.295202] el0t_64_sync_handler+0xe4/0x158 [ 76.295565] el0t_64_sync+0x190/0x198 [ 76.295860] Code: aa1603e0 d000d0e1 9100c021 97fdc715 (d4210000) [ 76.296429] ---[ end trace 0000000000000000 ]--- [ 76.296805] note: usemem[1337] exited with irqs disabled [ 76.297261] note: usemem[1337] exited with preempt_count 1 My test case is intended to stress swap: - Running in VM (on Ampere Altra) with 70 vCPUs and 80G RAM - Have a 35G block ram device (CONFIG_BLK_DEV_RAM & "brd.rd_nr=1 brd.rd_size=36700160") - the ramdisk is configured as the swap backend - run the test case in a memcg constrained to 40G (to force mem pressure) - test case has 70 processes, each allocating and writing 1G of RAM swapoff -a mkswap /dev/ram0 swapon -f /dev/ram0 cgcreate -g memory:/mmperfcgroup echo 40G > /sys/fs/cgroup/mmperfcgroup/memory.max cgexec -g memory:mmperfcgroup sudo -u $(whoami) bash Then inside that second bash shell, run this script: --8<--- function run_usemem_once { ./usemem -n 70 -O 1G | grep -v "free memory" } function run_usemem_multi { size=${1} for i in {1..2}; do echo "${size} THP ${i}" run_usemem_once done } echo never > /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled run_usemem_multi "64K" --8<--- It will usually get through the first iteration of the loop in run_usemem_multi() and fail on the second. I've never seen it get all the way through both iterations. "usemem" is from the vm-scalability suite. It just allocates and writes loads of anonymous memory (70 is concurrent processes, 1G is the amount of memory per process). Then the memory pressure from the cgroup causes lots of swap to happen. > --- > mm/swap.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/mm/swap.c b/mm/swap.c > index dce5ea67ae05..6b697d33fa5b 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -1003,12 +1003,13 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs) > if (!folio_ref_sub_and_test(folio, nr_refs)) > continue; > > - if (folio_test_large(folio)) { > + /* hugetlb has its own memcg */ > + if (folio_test_hugetlb(folio)) { This still looks reasonable to me after re-review, so I have no idea what the problem is? I recall seeing some weird crashes when I looked at this original RFC, but didn't have time to debug at the time. I wonder if the root cause is the same. If you find a smoking gun, I'm happy to test it if the above is too painful to reproduce. Thanks, Ryan > if (lruvec) { > unlock_page_lruvec_irqrestore(lruvec, flags); > lruvec = NULL; > } > - __folio_put_large(folio); > + free_huge_folio(folio); > continue; > } >