From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1FB7EC7EE2D for ; Wed, 31 May 2023 17:25:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 701216B0071; Wed, 31 May 2023 13:25:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6B0F76B0072; Wed, 31 May 2023 13:25:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5520B8E0001; Wed, 31 May 2023 13:25:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 47D7C6B0071 for ; Wed, 31 May 2023 13:25:41 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 80FCB140310 for ; Wed, 31 May 2023 17:25:40 +0000 (UTC) X-FDA: 80851227240.13.1EA44CD Received: from mail-il1-f173.google.com (mail-il1-f173.google.com [209.85.166.173]) by imf10.hostedemail.com (Postfix) with ESMTP id A7087C002F for ; Wed, 31 May 2023 17:25:38 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=CM0AuU72; spf=pass (imf10.hostedemail.com: domain of jannh@google.com designates 209.85.166.173 as permitted sender) smtp.mailfrom=jannh@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1685553938; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=f9rDw2v1LEmye3RLuD/RrrSGCh1UZ3V6JVzZJFvjbWQ=; b=I1fB9cz3/2Jh9Zhjysfc/lEDtva0YCYoikaBFh4Lw/WORS1UEY2Q8GJl1h6UfHrCY+O4gH h9ki1XORqPxOE+YfEF05CgtO9wRRJTr1A7QMigil43wPsTRpAXt/3iCaX36lBd2syIGyG0 78k6iTdalOj81HkIyfrHuFysKu8hjpI= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1685553938; a=rsa-sha256; cv=none; b=3yaB/e3fH0jA9cQG2JoO/MkpUlD/BKintmLIqgYypx5/gqweFv4OWLHaInjuwH3svGASz8 uGfJvo2uBDrtmzLHatzSWYNEazGtVzQwN4XDtQxQiWWWC4hSKWo/qBcUTN0Ge4bDRNaRlC Fpu4ikqJnejkISI9ddw6PkWUw8iXiDs= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=CM0AuU72; spf=pass (imf10.hostedemail.com: domain of jannh@google.com designates 209.85.166.173 as permitted sender) smtp.mailfrom=jannh@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-il1-f173.google.com with SMTP id e9e14a558f8ab-33bf12b5fb5so7355ab.1 for ; Wed, 31 May 2023 10:25:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1685553938; x=1688145938; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=f9rDw2v1LEmye3RLuD/RrrSGCh1UZ3V6JVzZJFvjbWQ=; b=CM0AuU72AcPSh131wQ9crT8iFbojgiYaMR0ZSQFK8kjs45mw5usLlAIhaun5lyvIUt wExvZsnF+Y/c9O0ViQtZ375xekAQOkC2l0fk18AOQisyVPMzkPiJ8zi/btsm+NPYxju7 MI1ropBml9khGIMtFqvQZ5ZFA3RSM2ZviNcgn5VbSlXa3ytpJTdH7HofVwJ9/2Qen4WS 051PVAgJXvJkegFFklqRVDPy83wUzSBjxxDFu56mRAEtSP4gKFxiSBA/OEyqTU1LUHUg 0l6Uv5BmErx3H0lthWvCFUBb1ELtXAQOYfj1Gs6CU2Pgsoh6ZNLfoKpQiBo5nt9UKPZF HrHQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1685553938; x=1688145938; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=f9rDw2v1LEmye3RLuD/RrrSGCh1UZ3V6JVzZJFvjbWQ=; b=D9qNrH7Z6nNzEvkuwhvjVyzdeh3laUAuTv77EpBucOROqEapsdIM9UXngg+qsTpS2X 8GS2OXOtdjvUSU67zUuK+XNb6JxJU4HQsGO79vTIj9Y6I/EJVkKvk164hqMZvYoTfurR 0ZD2olzcF4a/xamDS+gfezeWRCvAEbHRG+xVyaHcbWfHOpgCsdbvcYdsyqu0KhbzxsQY MIe8sQYxYexq4uFM1ZXqZ/xV3wG+gieC0JV4LSaEAnFih1CJiyzGNcEswN2curNcrRef kjGPLWa4Ct31wUHTscZ59/qM1niXHdEEgj8h61P7YBSeEtQ8q7ylb5OpWBUE0kRdvg2U JHwA== X-Gm-Message-State: AC+VfDyfthrlqG6NFptAUO9DYAU8ofAu8b8n3ySiGLatw0gU1eHNsHLJ uqSRBfRkwC2Td8HJfDhinsNmQEtoi1/7o+EzMWIk8w== X-Google-Smtp-Source: ACHHUZ6Afl+xZKFHZomNksLJA//ac6T4P12fNCTRCN7NynywZdGagjN9pdDeftj4IADrAcP1/u34lNvJe8J80I1ISvg= X-Received: by 2002:a05:6e02:1a44:b0:33a:e716:a76e with SMTP id u4-20020a056e021a4400b0033ae716a76emr187444ilv.28.1685553937544; Wed, 31 May 2023 10:25:37 -0700 (PDT) MIME-Version: 1.0 References: <35e983f5-7ed3-b310-d949-9ae8b130cdab@google.com> <563340a4-7ac9-7cc8-33d8-f7cc6ef19ea6@google.com> In-Reply-To: <563340a4-7ac9-7cc8-33d8-f7cc6ef19ea6@google.com> From: Jann Horn Date: Wed, 31 May 2023 19:25:01 +0200 Message-ID: Subject: Re: [PATCH 10/12] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock() To: Hugh Dickins Cc: Andrew Morton , Mike Kravetz , Mike Rapoport , "Kirill A. Shutemov" , Matthew Wilcox , David Hildenbrand , Suren Baghdasaryan , Qi Zheng , Yang Shi , Mel Gorman , Peter Xu , Peter Zijlstra , Will Deacon , Yu Zhao , Alistair Popple , Ralph Campbell , Ira Weiny , Steven Price , SeongJae Park , Naoya Horiguchi , Christophe Leroy , Zack Rusin , Jason Gunthorpe , Axel Rasmussen , Anshuman Khandual , Pasha Tatashin , Miaohe Lin , Minchan Kim , Christoph Hellwig , Song Liu , Thomas Hellstrom , Russell King , "David S. Miller" , Michael Ellerman , "Aneesh Kumar K.V" , Heiko Carstens , Christian Borntraeger , Claudio Imbrenda , Alexander Gordeev , linux-arm-kernel@lists.infradead.org, sparclinux@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: d6seff7smxnohgxiostuj61uyjsgbf16 X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: A7087C002F X-Rspam-User: X-HE-Tag: 1685553938-756455 X-HE-Meta: U2FsdGVkX1+nKMF5dvH8qhHCyoeJNJ6WDqIphCP+2eUTHoxunQw5ApjeViN268ttZjhMQqblKYe3fRSDLo+Lk9KYtLCzqlAW6U9YxyLI37iT2r4xDKUzuBcWTJkB27QEh6fYEPdh1hdcx6hnQdG3XRND1R1SAHOPWjd2Q9a6tEzG0r6wwrRfl2Hk861eQCGCNi1BGgIfL3GoG2jl40LyhBY+nAUAwdKD8oXuAtlB3GG1qkqaQKLFK5xvDQR3EE5mr/S4oFpiJkHlXNoSBB/DblLbXneR/va2Aj4zO0OWVkv8xyG424iLPrqQz654O1TZ5L+J11VtZHQ0nCiNOKMLtLjsj+U3+BTj6RYYPwlq0/L7o+rdvpAcJ6zIWQ7bO3DMJZn05tGjoDUvHvBk1v5ey+v0hSccasNCWrVPJkFPrFANc7d0rdNuxL6ix+jIdRtO447KFIS7FCZMqiw0kIz+Dqvn97mwDhbtTRl9e3ApnGE1zDppeZjBE6ZnQj1HyOoUJgjpsjjIeEzO3XzKxRnIH+6rTGIxmgdUgXWQF6ykW4AQI5ss2ioY6O+T5Sj2N422Kbeb9GTE1m36iAScFg7/AWZW/KtBb6mllLYI8Z6skUl4kM9OwOeMsHQU3avsQDW0MYdyDkvBycjuzQGb+hitbNSW6zPsBFvVIyQBsOdnpdkG6dL+ZpeiuPSsYF4RUIKJ+WV+rzAblvpVDb98Uk6SkzO1Crgr8a1HbQjiUqJrVYLXf03pRrcKIi0sgJqmP3Kx6LL+FVlqqFqWo7dW9uQvoMu7TLnG7O/CeJZ8TJlIDB2oOOXExH9Gi9anlnCEbMESsMTWpZqMRzf2UpA+9tOkv8CnzSP6Zdt2BUo0DK0JJvXLWFYQyYrBoTHJaEyfFgplNVBIsnygeGzhcpi6BLWwoAkRI/xtU5JdGrDZPjM5xA30nEAz6Wu4vEDVZei8J18+dqQwIkX08Bggc7Hk93i Fz/OVk/U FmQR3LP1u0SEkV2KlecD8sHtaXrmkllSZMZ4IZ06z/c7P/IEIU55T0hHGe/07yD3/zEBxTHiItd7SrakPbL9lsyrgr+cUWMEx5DTrlmojrYc2Jk0RXZ4UBXnCuKYwjbiAMWXghrgVtcZEzyrpkSKLi9Fby7qBnbB5EJ42LqqklcM66WxqhoyF9CmEP0KUTp5pw6Vndbh08DbkG5NXW4qWf/EskjaU0y/Rne2JlTJvlBv2ucVgELVgxohS2sp5pXOg5E+7 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, May 29, 2023 at 8:26=E2=80=AFAM Hugh Dickins wro= te: > Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp(). > It does need mmap_read_lock(), but it does not need mmap_write_lock(), > nor vma_start_write() nor i_mmap lock nor anon_vma lock. All racing > paths are relying on pte_offset_map_lock() and pmd_lock(), so use those. I think there's a weirdness in the existing code, and this change probably turns that into a UAF bug. collapse_pte_mapped_thp() can be called on an address that might not be associated with a VMA anymore, and after this change, the page tables for that address might be in the middle of page table teardown in munmap(), right? The existing mmap_write_lock() guards against concurrent munmap() (so in the old code we are guaranteed to either see a normal VMA or not see the page tables anymore), but mmap_read_lock() only guards against the part of munmap() up to the mmap_write_downgrade() in do_vmi_align_munmap(), and unmap_region() (including free_pgtables()) happens after that. So we can now enter collapse_pte_mapped_thp() and race with concurrent free_pgtables() such that a PUD disappears under us while we're walking it or something like that: int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, bool install_pmd) { struct mmu_notifier_range range; unsigned long haddr =3D addr & HPAGE_PMD_MASK; struct vm_area_struct *vma =3D vma_lookup(mm, haddr); // <<< returns NULL struct page *hpage; pte_t *start_pte, *pte; pmd_t *pmd, pgt_pmd; spinlock_t *pml, *ptl; int nr_ptes =3D 0, result =3D SCAN_FAIL; int i; mmap_assert_locked(mm); /* Fast check before locking page if already PMD-mapped */ result =3D find_pmd_or_thp_or_none(mm, haddr, &pmd); // <<< PUD UAF in he= re if (result =3D=3D SCAN_PMD_MAPPED) return result; if (!vma || !vma->vm_file || // <<< bailout happens too late !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE)) return SCAN_VMA_CHECK; I guess the right fix here is to make sure that at least the basic VMA revalidation stuff (making sure there still is a VMA covering this range) happens before find_pmd_or_thp_or_none()? Like: diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 301c0e54a2ef..5db365587556 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1481,15 +1481,15 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, mmap_assert_locked(mm); + if (!vma || !vma->vm_file || + !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE)) + return SCAN_VMA_CHECK; + /* Fast check before locking page if already PMD-mapped */ result =3D find_pmd_or_thp_or_none(mm, haddr, &pmd); if (result =3D=3D SCAN_PMD_MAPPED) return result; - if (!vma || !vma->vm_file || - !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE)) - return SCAN_VMA_CHECK; - /* * If we are here, we've succeeded in replacing all the native pag= es * in the page cache with a single hugepage. If a mm were to fault= -in