From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DAEE0C0015E for ; Wed, 26 Jul 2023 16:19:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5654E8D0001; Wed, 26 Jul 2023 12:19:49 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 515636B0072; Wed, 26 Jul 2023 12:19:49 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3DDE68D0001; Wed, 26 Jul 2023 12:19:49 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 2932F6B0071 for ; Wed, 26 Jul 2023 12:19:49 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id DB8C9C02E1 for ; Wed, 26 Jul 2023 16:19:48 +0000 (UTC) X-FDA: 81054274056.09.7765EC1 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf05.hostedemail.com (Postfix) with ESMTP id CC69710001E for ; Wed, 26 Jul 2023 16:19:46 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=t3n+mrku; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf05.hostedemail.com: domain of nathan@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=nathan@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1690388386; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=+iitGf/SLGoojFqjhHha8Mr8JzImIWK1oPkv4xNDQig=; b=NFR9uHKAP+3JU9+Bdf9vyLXcxLC/xvrtMHZZDVV4m4W+wGeLQMfi/tCLbt/9sOfoAs6t6O h9icPt1s3aDWCdtqoGeeV3hW1EtS7HM1+e9n7GZrYbusDoXzwCvPeF3uJqjT1CI/bpcDCG Dw9MIS5awyTdQy52reFbPfDfBk/rJig= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=t3n+mrku; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf05.hostedemail.com: domain of nathan@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=nathan@kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1690388386; a=rsa-sha256; cv=none; b=i3LVNica8VBDP9QXQ+SfeqFDXyagJL2iZ2/UGxWscSfkyeGGDCvaDz98JwWnRpVNPXqaQz vHAXClerj1qt0aa+kMDj0w4JYBV21GXbvrz2DXQP1KHe3whkC2/uzVfddIll3EXLL5s2wk uA7+aA7wCuS8HYvb5LOtqiL6Pr0xSdY= Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id B5EBE61A39; Wed, 26 Jul 2023 16:19:45 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 7F430C433C8; Wed, 26 Jul 2023 16:19:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1690388385; bh=q0XNGgzUjSJEz9MnAbeQ62rwy2DQXpbgmuxGrKUCKJ0=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=t3n+mrkudX9kGHBZNHm+2fmuD/KmIxToahj+uwjhkD1khveZil+gpp5keMnzShQsI 50VFInEN8b5uIzJyLWlDQSATYGJ+t/3+2Tgx54clw5NMX/I51gXUslNSk6Hoa6LjrE 9kLsuxnF1tiGfkVep48AQOmJ5k8aoV53Z3P61s5QSgEo50Ny0oGb9k92DeiHpRhYzp iK9ffk0U6vT79ONkBrEF6cktrXNeOTlhkIEkb+QcqAzzr65D/TLwuK1qBhaEEm/iiz uk4Xl2OIf85wnJnuFz8gNF7EuZ8LZodVZ0xAjNrnegz36Qam8C5YC4RchvNTa6zpcd cGo7blDEiQHZQ== Date: Wed, 26 Jul 2023 09:19:42 -0700 From: Nathan Chancellor To: Ryan Roberts Cc: Andrew Morton , Matthew Wilcox , Yin Fengwei , David Hildenbrand , Yu Zhao , Yang Shi , "Huang, Ying" , Zi Yan , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH v3 3/3] mm: Batch-zap large anonymous folio PTE mappings Message-ID: <20230726161942.GA1123863@dev-arch.thelio-3990X> References: <20230720112955.643283-1-ryan.roberts@arm.com> <20230720112955.643283-4-ryan.roberts@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20230720112955.643283-4-ryan.roberts@arm.com> X-Rspamd-Queue-Id: CC69710001E X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: chr1sxwz6s357nsjr89nqqs4j8xnftzc X-HE-Tag: 1690388386-814744 X-HE-Meta: U2FsdGVkX1+VH/DCRX+C/+ym/jaxD6I7bCbyHVGQi9PrDUYtgVaQRtDAkjpTopCqdDkUzZipFgbVV2ZoEp0Mu4W5oU/AbzcfXZZBcDZLtA6AvVvZYFSq5d03c9+f1riAZNwKqOGNI9kUjUxPD2y6GGIJhvk5gxzrQWyxmyP7psgApK+ba0D5bMzsfcE5IdhNa4UFoSJ/Qp9okOhXmK6rNPq9q0jSk0bki1Ulld8vtoUdLyuhORrH4dKazH4GyfX9NaEpDyEncErdOwTdTYu8etA+cnIE03cB0PKvWlpX5wFp2jHsZWBTz2aG377r80GeZkmGyUztJsUf8VZfnW8n4yGpe4fTk2vMF/EzlwoBfjQbBJcuYpi4rZkA1MLvCP/UGDhYfmgWhVAYaPBtk61CNIEH7FSav4XnXNiU+yKORHbRpMgKqiimF9944UxrV6hZUPFUvSqYGOIQ2UotNA2RETFHICcYBCZWLn73bDGqxBbQn8+7DCkPmpXOwZyMSYUBuXh1PvVBT5JBkuVPT8nDP//qcu2yCFLBBjt5rmm2n0aQ25qRhXW5lXc0GE4R9PyiqI7XAN81F4LC6BTMYh6ogXlsfn/jsF6z9Dql4VPhMFDLl8n1SlhO8mTyo7QQkYwxcV0pJmO8vHBtrzZAy3llO6mqtDSZJLLZ2V3CXfnr+y4HKKDi4tPBcPaU5TYYGSrNcAMfYIxGfBkvXL2Ce62ENNO3gC/RuTZHsyXQmIUoK1PYhldUE3R6/bZF58qsFGs5CHsKdRThRm2ecmoGSFkDjBdbs0HN0LXLnMPGltvda+hhKvOJx4INUPIaIYWtwH2SPhvznR52uvmK8L210Cva8bfglAReNyCv1iJxrGkiomCP19y8ougKQCuEl6JigwOQfm3FUGjwmr7Pww/d1Zs8SXPBSiRDh8vKySj5x70r+DJQzZJsu/UgQHfau61AG4rFYsaBE9/Xm6GQO6gY/Fg AgUcQSOA z+ETRwrnFD/ckqf5XJxUPTZGxXtuea3Znp8nvIe5OipqgCjsb+g58oOyd4ed3DgoR6KcqlD9fgxnX6rvO3qHDn/pYVIuFFepSGjV1EhA8hQ2tAgpQYQ1BjhY39nLpgalwek+lCk3bmQYHE9I2+fz59xodwbFsqptg6Sk8Iq5eWrl6zXGsI/lmermrVQqK3P482yZLKYsc8X0J3NGT5pNfpFXAFw+FpzJWqYBrl9Ztk2K0uELHqK0RXzr+ARRVzghnuqdCqS4aW/YEL1bm2J0+/N7Mhx60OqqbPKrouizQp0oY2KBWpp6NNtqfat8S0s+qnP0hydnkz9qsyxN8xgl04e0+v9Tn4xs8rpXXho0gba7BHB855BEpFpTcsYf6smMkF7AC7bHQrbRx3OaXVtKFV+P0zFVYcVdGooyf4j6Iulh2trHAbXVVSPEORA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi Ryan, On Thu, Jul 20, 2023 at 12:29:55PM +0100, Ryan Roberts wrote: > This allows batching the rmap removal with folio_remove_rmap_range(), > which means we avoid spuriously adding a partially unmapped folio to the > deferred split queue in the common case, which reduces split queue lock > contention. > > Previously each page was removed from the rmap individually with > page_remove_rmap(). If the first page belonged to a large folio, this > would cause page_remove_rmap() to conclude that the folio was now > partially mapped and add the folio to the deferred split queue. But > subsequent calls would cause the folio to become fully unmapped, meaning > there is no value to adding it to the split queue. > > Signed-off-by: Ryan Roberts > --- > mm/memory.c | 120 ++++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 120 insertions(+) > > diff --git a/mm/memory.c b/mm/memory.c > index 01f39e8144ef..189b1cfd823d 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1391,6 +1391,94 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma, > pte_install_uffd_wp_if_needed(vma, addr, pte, pteval); > } > > +static inline unsigned long page_cont_mapped_vaddr(struct page *page, > + struct page *anchor, unsigned long anchor_vaddr) > +{ > + unsigned long offset; > + unsigned long vaddr; > + > + offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT; > + vaddr = anchor_vaddr + offset; > + > + if (anchor > page) { > + if (vaddr > anchor_vaddr) > + return 0; > + } else { > + if (vaddr < anchor_vaddr) > + return ULONG_MAX; > + } > + > + return vaddr; > +} > + > +static int folio_nr_pages_cont_mapped(struct folio *folio, > + struct page *page, pte_t *pte, > + unsigned long addr, unsigned long end) > +{ > + pte_t ptent; > + int floops; > + int i; > + unsigned long pfn; > + struct page *folio_end; > + > + if (!folio_test_large(folio)) > + return 1; > + > + folio_end = &folio->page + folio_nr_pages(folio); > + end = min(page_cont_mapped_vaddr(folio_end, page, addr), end); > + floops = (end - addr) >> PAGE_SHIFT; > + pfn = page_to_pfn(page); > + pfn++; > + pte++; > + > + for (i = 1; i < floops; i++) { > + ptent = ptep_get(pte); > + > + if (!pte_present(ptent) || pte_pfn(ptent) != pfn) > + break; > + > + pfn++; > + pte++; > + } > + > + return i; > +} > + > +static unsigned long try_zap_anon_pte_range(struct mmu_gather *tlb, > + struct vm_area_struct *vma, > + struct folio *folio, > + struct page *page, pte_t *pte, > + unsigned long addr, int nr_pages, > + struct zap_details *details) > +{ > + struct mm_struct *mm = tlb->mm; > + pte_t ptent; > + bool full; > + int i; > + > + for (i = 0; i < nr_pages;) { > + ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); > + tlb_remove_tlb_entry(tlb, pte, addr); > + zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent); > + full = __tlb_remove_page(tlb, page, 0); > + > + if (unlikely(page_mapcount(page) < 1)) > + print_bad_pte(vma, addr, ptent, page); > + > + i++; > + page++; > + pte++; > + addr += PAGE_SIZE; > + > + if (unlikely(full)) > + break; > + } > + > + folio_remove_rmap_range(folio, page - i, i, vma); > + > + return i; > +} > + > static unsigned long zap_pte_range(struct mmu_gather *tlb, > struct vm_area_struct *vma, pmd_t *pmd, > unsigned long addr, unsigned long end, > @@ -1428,6 +1516,38 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, > page = vm_normal_page(vma, addr, ptent); > if (unlikely(!should_zap_page(details, page))) > continue; > + > + /* > + * Batch zap large anonymous folio mappings. This allows > + * batching the rmap removal, which means we avoid > + * spuriously adding a partially unmapped folio to the > + * deferrred split queue in the common case, which > + * reduces split queue lock contention. > + */ > + if (page && PageAnon(page)) { > + struct folio *folio = page_folio(page); > + int nr_pages_req, nr_pages; > + > + nr_pages_req = folio_nr_pages_cont_mapped( > + folio, page, pte, addr, end); > + > + nr_pages = try_zap_anon_pte_range(tlb, vma, > + folio, page, pte, addr, > + nr_pages_req, details); > + > + rss[mm_counter(page)] -= nr_pages; > + nr_pages--; > + pte += nr_pages; > + addr += nr_pages << PAGE_SHIFT; > + > + if (unlikely(nr_pages < nr_pages_req)) { > + force_flush = 1; > + addr += PAGE_SIZE; > + break; > + } > + continue; > + } > + > ptent = ptep_get_and_clear_full(mm, addr, pte, > tlb->fullmm); > tlb_remove_tlb_entry(tlb, pte, addr); > -- > 2.25.1 > After this change in -next as commit 904d9713b3b0 ("mm: batch-zap large anonymous folio PTE mappings"), I see the following splats several times when booting Debian's s390x configuration (which I have mirrored at [1]) in QEMU (bisect log below): $ qemu-system-s390x \ -display none \ -nodefaults \ -M s390-ccw-virtio \ -kernel arch/s390/boot/bzImage \ -initrd rootfs.cpio \ -m 512m \ -serial mon:stdio KASLR disabled: CPU has no PRNG KASLR disabled: CPU has no PRNG [ 2.502282] Linux version 6.5.0-rc3+ (nathan@dev-arch.thelio-3990X) (s390-linux-gcc (GCC) 13.1.0, GNU ld (GNU Binutils) 2.40) #1 SMP Wed Jul 26 09:14:20 MST 2023 ... [ 3.406011] Freeing initrd memory: 7004K [ 3.492739] BUG: Bad page state in process modprobe pfn:01b18 [ 3.492909] page:00000000233d9f2f refcount:0 mapcount:1 mapping:0000000000000000 index:0xdb pfn:0x1b18 [ 3.492998] flags: 0xa0004(uptodate|mappedtodisk|swapbacked|zone=0) [ 3.493195] page_type: 0x0() [ 3.493457] raw: 00000000000a0004 0000000000000100 0000000000000122 0000000000000000 [ 3.493492] raw: 00000000000000db 0000000000000000 0000000000000000 0000000000000000 [ 3.493525] page dumped because: nonzero mapcount [ 3.493549] Modules linked in: [ 3.493719] CPU: 0 PID: 38 Comm: modprobe Not tainted 6.5.0-rc3+ #1 [ 3.493814] Hardware name: QEMU 8561 QEMU (KVM/Linux) [ 3.493892] Call Trace: [ 3.494117] [<0000000000add35a>] dump_stack_lvl+0x62/0x88 [ 3.494333] [<00000000003d565a>] bad_page+0x8a/0x130 [ 3.494355] [<00000000003d6728>] free_unref_page_prepare+0x268/0x3d8 [ 3.494375] [<00000000003d9408>] free_unref_page+0x48/0x140 [ 3.494394] [<00000000003ad99c>] unmap_page_range+0x924/0x1388 [ 3.494412] [<00000000003ae54c>] unmap_vmas+0x14c/0x200 [ 3.494429] [<00000000003be2f2>] exit_mmap+0xba/0x3a0 [ 3.494447] [<0000000000147000>] __mmput+0x50/0x180 [ 3.494466] [<0000000000152a00>] do_exit+0x320/0xb40 [ 3.494484] [<0000000000153450>] do_group_exit+0x40/0xb8 [ 3.494502] [<00000000001534f6>] __s390x_sys_exit_group+0x2e/0x30 [ 3.494520] [<0000000000b05080>] __do_syscall+0x1e8/0x210 [ 3.494539] [<0000000000b15970>] system_call+0x70/0x98 [ 3.494663] Disabling lock debugging due to kernel taint [ 3.494809] BUG: Bad page map in process modprobe pte:01b1831f pmd:1fff9000 [ 3.494833] page:00000000233d9f2f refcount:0 mapcount:0 mapping:0000000000000000 index:0xdb pfn:0x1b18 [ 3.494852] flags: 0xa0004(uptodate|mappedtodisk|swapbacked|zone=0) [ 3.494866] page_type: 0xffffffff() [ 3.494882] raw: 00000000000a0004 0000000000000100 0000000000000122 0000000000000000 [ 3.494898] raw: 00000000000000db 0000000000000000 ffffffff00000000 0000000000000000 [ 3.494908] page dumped because: bad pte [ 3.494923] addr:000002aa1d75c000 vm_flags:08100071 anon_vma:000000001fffc340 mapping:000000000286d6b8 index:db [ 3.494983] file:busybox fault:shmem_fault mmap:shmem_mmap read_folio:0x0 [ 3.495247] CPU: 0 PID: 38 Comm: modprobe Tainted: G B 6.5.0-rc3+ #1 [ 3.495267] Hardware name: QEMU 8561 QEMU (KVM/Linux) [ 3.495277] Call Trace: [ 3.495285] [<0000000000add35a>] dump_stack_lvl+0x62/0x88 [ 3.495307] [<00000000003ab30e>] print_bad_pte+0x176/0x2c8 [ 3.495324] [<00000000003ae098>] unmap_page_range+0x1020/0x1388 [ 3.495341] [<00000000003ae54c>] unmap_vmas+0x14c/0x200 [ 3.495357] [<00000000003be2f2>] exit_mmap+0xba/0x3a0 [ 3.495375] [<0000000000147000>] __mmput+0x50/0x180 [ 3.495394] [<0000000000152a00>] do_exit+0x320/0xb40 [ 3.495411] [<0000000000153450>] do_group_exit+0x40/0xb8 [ 3.495429] [<00000000001534f6>] __s390x_sys_exit_group+0x2e/0x30 [ 3.495447] [<0000000000b05080>] __do_syscall+0x1e8/0x210 [ 3.495465] [<0000000000b15970>] system_call+0x70/0x98 ... The rootfs is available at [2] if it is relevant. I am happy to provide any additional information or test patches as necessary. Cheers, Nathan [1]: https://github.com/nathanchance/llvm-kernel-testing/blob/79aa4ab2edc595979366c427cb49f477c7f31c68/configs/debian/s390x.config [2]: https://github.com/ClangBuiltLinux/boot-utils/releases/download/20230707-182910/s390-rootfs.cpio.zst # bad: [0ba5d07205771c50789fd9063950aa75e7f1183f] Add linux-next specific files for 20230726 # good: [18b44bc5a67275641fb26f2c54ba7eef80ac5950] ovl: Always reevaluate the file signature for IMA git bisect start '0ba5d07205771c50789fd9063950aa75e7f1183f' '18b44bc5a67275641fb26f2c54ba7eef80ac5950' # bad: [8fe1b33ece8f8fe1377082e839817886cb8c0f81] Merge branch 'main' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git git bisect bad 8fe1b33ece8f8fe1377082e839817886cb8c0f81 # bad: [932bd67958459da3dc755b5bea7758e9d951dee5] Merge branch 'ti-next' of git://git.kernel.org/pub/scm/linux/kernel/git/ti/linux.git git bisect bad 932bd67958459da3dc755b5bea7758e9d951dee5 # bad: [a4abec0a3653fb9dfb3ea6cea2ad1d36f507ca97] Merge branch 'perf-tools-next' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git git bisect bad a4abec0a3653fb9dfb3ea6cea2ad1d36f507ca97 # bad: [5a52022bde252d090e051077af297dcfeff9fd0d] powerpc/book3s64/radix: add debug message to give more details of vmemmap allocation git bisect bad 5a52022bde252d090e051077af297dcfeff9fd0d # good: [671115657ee2403d18cb849061d7245687d9fdc5] mm/pgtable: notes on pte_offset_map[_lock]() git bisect good 671115657ee2403d18cb849061d7245687d9fdc5 # good: [26c3a4fe0eb027ff00ad42168c8732db0c0b40d7] arm64/smmu: use TLBI ASID when invalidating entire range git bisect good 26c3a4fe0eb027ff00ad42168c8732db0c0b40d7 # bad: [8585d0b53780f11cad8dad37997369949e3d5043] mm: memcg: use rstat for non-hierarchical stats git bisect bad 8585d0b53780f11cad8dad37997369949e3d5043 # bad: [9abfe35eb187c3f79af5bb07c2f9815a480c4965] mm/compaction: correct comment of candidate pfn in fast_isolate_freepages git bisect bad 9abfe35eb187c3f79af5bb07c2f9815a480c4965 # bad: [208f64c37a4e22b25b8037776c5713545eaf54fa] selftests: line buffer test program's stdout git bisect bad 208f64c37a4e22b25b8037776c5713545eaf54fa # good: [08356142587c28b86817646ff318317b5237fdeb] mmu_notifiers: rename invalidate_range notifier git bisect good 08356142587c28b86817646ff318317b5237fdeb # good: [652555287069f2c0bbbfaf262eb41638f5c87550] mm: allow deferred splitting of arbitrary large anon folios git bisect good 652555287069f2c0bbbfaf262eb41638f5c87550 # bad: [904d9713b3b0e64329b2f6d159966b5c737444ff] mm: batch-zap large anonymous folio PTE mappings git bisect bad 904d9713b3b0e64329b2f6d159966b5c737444ff # good: [9a7c14665a566fbc1adc2c35982898abc1546525] mm: implement folio_remove_rmap_range() git bisect good 9a7c14665a566fbc1adc2c35982898abc1546525 # first bad commit: [904d9713b3b0e64329b2f6d159966b5c737444ff] mm: batch-zap large anonymous folio PTE mappings