From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E08B5C5AD49 for ; Fri, 6 Jun 2025 09:28:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7AF5A6B0089; Fri, 6 Jun 2025 05:28:25 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7859D6B008A; Fri, 6 Jun 2025 05:28:25 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6C3CD6B008C; Fri, 6 Jun 2025 05:28:25 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 508CB6B0089 for ; Fri, 6 Jun 2025 05:28:25 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 9A36B1D720A for ; Fri, 6 Jun 2025 09:28:24 +0000 (UTC) X-FDA: 83524450128.08.D653902 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf10.hostedemail.com (Postfix) with ESMTP id 1A9E1C0009 for ; Fri, 6 Jun 2025 09:28:22 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=none; spf=pass (imf10.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1749202103; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references; bh=lntG9rhqREjCVyMWRadtRQw6bXGu4649Y9q2SENwAIw=; b=VdqpRgHPMErQuoJ15caJ/5yg+JbBsVLWo//sHx/95g6/POAig4CGZKn2KPj6FB6Awewewj yjiu7UUsJ/Rl4R3d3hu8oMIvOMZKOc6z/TkaCCVHErDMsemcC3tpnkulveU+l2u6vhIg3R LAnsOMQ8EIxadI22sqEoyOsgARqK/Oc= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=none; spf=pass (imf10.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749202103; a=rsa-sha256; cv=none; b=rjFwJEP4fz/fO3lSgqQp2T+Wxy+UpTaHn+zDfIdkpUOFBFRDEiAWPf0JKRVKnbXzWz3Rsz WhO5vQEQ6TGK1k6MeT4/38yOAsjel3qRdelZWApVLc4q1VC7oq2EDwz+Rsdz9w3EOmCtTI bIlK5Wxo0euTMJs5HKKKdtBmcLigbuo= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 525361655; Fri, 6 Jun 2025 02:28:04 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.27]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id B4AEE3F59E; Fri, 6 Jun 2025 02:28:20 -0700 (PDT) From: Ryan Roberts To: Andrew Morton , "Liam R. Howlett" , Lorenzo Stoakes , David Hildenbrand , Vlastimil Babka , Jann Horn , Mel Gorman Cc: Ryan Roberts , linux-mm@kvack.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org Subject: [PATCH v1] mm: Close theoretical race where stale TLB entries could linger Date: Fri, 6 Jun 2025 10:28:07 +0100 Message-ID: <20250606092809.4194056-1-ryan.roberts@arm.com> X-Mailer: git-send-email 2.43.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspam-User: X-Stat-Signature: 81cdtkwecpj3opui9rfizqxophap9zjr X-Rspamd-Queue-Id: 1A9E1C0009 X-Rspamd-Server: rspam11 X-HE-Tag: 1749202102-297288 X-HE-Meta: U2FsdGVkX1+OipFJAfIbtxZVoAmmtrbVSjcNQwYlHijR+imFAdwk+qlZPg1Aef1iWqkwXn63Dwyl9dPgfCTwCBcGJYS4KpMTVDlHnTu/XWA8e7q+HzRJmc9fyQxxUCVOrIAk8q8xX8TWHZQDp+CIrcQ9snfK9sVcxYtCSQucN4pY6sWXmEmTzc0v+V2GKbktGP9ZzEFM1RVNaHalL2gDw4rHuGYl0O0BEGyJ6HDyOrYCi9on+ZUE1fhyupIGBUR/2hSQBz+0LE1E2lu1HjMZxVewf5e3O+ujTWzxyVe9WTvEr7SmVn3OJQmv0VIfieG8xb2soNdolXjX/xG+4WRlABmFYJnMlFOG5Ag4F4crwRVSTu//88EBWKAUjRNi3uQRDyr/q451ksl+/5ENo8fBPVpygFAGJ6AWmUoBr2pLogUbRGLrFU7DErdiUyPjFqmoLH3O9dQwrTY4bHEDHMq5ClJxm1rNXTeFwF2Pr6ksqbXCGlj154CkrZ5LoLUktFZGX6NGvsHcMmm8ueLi/Xeg0ITiLmu0jaG9MfYo27tYcSzRus+hfLwazUgYiXeIWuiXv/eCM8Kodvvxr3CUPVzQI/YlIH7svAQeuTAkB7/fx9i8/7qpTJCOmapcHk6RqIFWxjG+slaYTwM/LmaGnFh1YKdJOXqFc2ntots70qXemcjbaxz/Py3AK2sWsOvxLdDuRXmXwx/j60G2MQs5lACpxmz9hyg8bxrv3HDasTCGvI6YD3z/llQfVEuXxfKJx4WwZFGS47qwarq/OAnVJ9oXCYAVs5J2hWfAe7pLGvfl8ldIsuDFh9ZD9SGf5ZjkSl7QzhtqunfdM18fhGz4NkW3gJeuH7J/AbVY6BYYVF3IIsW1DBPXdZGZdqnPC5X3jFYU0LSur2aM1vOA9SyTdqjTEwHgD2JrX0D18FO2m9UIhPT9RGfBa4IT1zvQihQRZmEV3ccW9iLsOv9wATJArMe IChDSBgk eEPRozJ655R5m4EL/om0JmFpPQOh9yB1XigVXWBr3IQyXWfarFCn5FqIBvpZVnvM3NzMCZkaYryCU6a1PLdJ5h/sNttcqQuaWP8cC4j5etyc841dVKPrSbYYSwUxKVRKRB2NzKIWOSCs2+FH2IzoLXuZwa99n3HqZIruMe+OVaAOcVH0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Commit 3ea277194daa ("mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries") described a theoretical race as such: """ Nadav Amit identified a theoritical race between page reclaim and mprotect due to TLB flushes being batched outside of the PTL being held. He described the race as follows: CPU0 CPU1 ---- ---- user accesses memory using RW PTE [PTE now cached in TLB] try_to_unmap_one() ==> ptep_get_and_clear() ==> set_tlb_ubc_flush_pending() mprotect(addr, PROT_READ) ==> change_pte_range() ==> [ PTE non-present - no flush ] user writes using cached RW PTE ... try_to_unmap_flush() The same type of race exists for reads when protecting for PROT_NONE and also exists for operations that can leave an old TLB entry behind such as munmap, mremap and madvise. """ The solution was to introduce flush_tlb_batched_pending() and call it under the PTL from mprotect/madvise/munmap/mremap to complete any pending tlb flushes. However, while madvise_free_pte_range() and madvise_cold_or_pageout_pte_range() were both retro-fitted to call flush_tlb_batched_pending() immediately after initially acquiring the PTL, they both temporarily release the PTL to split a large folio if they stumble upon one. In this case, where re-acquiring the PTL flush_tlb_batched_pending() must be called again, but it previously was not. Let's fix that. There are 2 Fixes: tags here: the first is the commit that fixed madvise_free_pte_range(). The second is the commit that added madvise_cold_or_pageout_pte_range(), which looks like it copy/pasted the faulty pattern from madvise_free_pte_range(). This is a theoretical bug discovered during code review. Cc: stable@vger.kernel.org Fixes: 3ea277194daa ("mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries") Fixes: 9c276cc65a58 ("mm: introduce MADV_COLD") Signed-off-by: Ryan Roberts --- Applies on today's mm-unstable (3f676fe5c7a0). All mm selftests continue to pass. Thanks, Ryan mm/madvise.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/madvise.c b/mm/madvise.c index 5f7a66a1617e..1d44a35ae85c 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -508,6 +508,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, pte_offset_map_lock(mm, pmd, addr, &ptl); if (!start_pte) break; + flush_tlb_batched_pending(mm); arch_enter_lazy_mmu_mode(); if (!err) nr = 0; @@ -741,6 +742,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, start_pte = pte; if (!start_pte) break; + flush_tlb_batched_pending(mm); arch_enter_lazy_mmu_mode(); if (!err) nr = 0; -- 2.43.0