From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8DA47C27C76 for ; Thu, 26 Jan 2023 00:38:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D508B6B0071; Wed, 25 Jan 2023 19:38:13 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D00D56B0072; Wed, 25 Jan 2023 19:38:13 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BC8236B0073; Wed, 25 Jan 2023 19:38:13 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id A9DDA6B0071 for ; Wed, 25 Jan 2023 19:38:13 -0500 (EST) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 78BD3120ED5 for ; Thu, 26 Jan 2023 00:38:13 +0000 (UTC) X-FDA: 80395088466.08.54B69EE Received: from mail-ed1-f48.google.com (mail-ed1-f48.google.com [209.85.208.48]) by imf21.hostedemail.com (Postfix) with ESMTP id B601D1C000B for ; Thu, 26 Jan 2023 00:38:11 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=pR9t7gPw; spf=pass (imf21.hostedemail.com: domain of zokeefe@google.com designates 209.85.208.48 as permitted sender) smtp.mailfrom=zokeefe@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1674693491; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kPGzIQn/yC7I4PMSm/BhNLWJz75dQ27geDgXzSkjTEI=; b=cw1RlG5vHMdliNtUATa707PEZ76jpW5mciPK7jShSQKRY6R/c8zAOPzLt9t7GleQOHwbiF j48Tl39Xb7FcjlhQCjdsEzW9/v9yepMXxNz879apq2tItIp5hBYeY0vOgVrgyJmcj3XHMk MDcwbbCt+772+KhEjyGoxf78Z1lSSaE= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=pR9t7gPw; spf=pass (imf21.hostedemail.com: domain of zokeefe@google.com designates 209.85.208.48 as permitted sender) smtp.mailfrom=zokeefe@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1674693491; a=rsa-sha256; cv=none; b=t0U9djYOqpkZVknxrk6j+6wHY4sTpaF+hHYQP9q9ghTdZO8b7myTSceo5KMirNBKVGPgM9 DXyWnXeSe9vvWOruZYQAkuX4FoGKO5ixShajUjijWcm1ja99DOXvfUoNZnhBRJNU6HY+Ob wHIQasqp4sNFh23ynSXS11JrPoTkZoc= Received: by mail-ed1-f48.google.com with SMTP id v10so519411edi.8 for ; Wed, 25 Jan 2023 16:38:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=kPGzIQn/yC7I4PMSm/BhNLWJz75dQ27geDgXzSkjTEI=; b=pR9t7gPw92TX4Op6jnwsqpLEnvMeUf2bMccpBm2w3vzuuJjZ8fyHuqxsLf9qlFbYIR s9qFII1/XKxc1g5q77oS8oIu+/KGDovpGWFamVIJbdbhjCiZpBUFyxcApAGy7D1MBH+L 7UCymwM6RA6mLVPjMER1/oblGuSqI2ieJ0QcqAJd/F0b2PKyjRJYAA6b+VcV6qR2qt4q EJzvDYc5eJErm2PogPXYixNNAJnPWBUg5aBhBH3tFL/EEXBbDqtsA+B3sQf+av7RuMms rhp7KX7e9bG90bzW8whWbjH1BMYs07kNqNz4wrCHMCnfa8iO2wLxjZZfPEkAzRkaFIZD caEQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=kPGzIQn/yC7I4PMSm/BhNLWJz75dQ27geDgXzSkjTEI=; b=4ZUsmwg27+9mSrobN7q1tnioWdRxGo0kqvNyG/sqWwLDWkIZYytng6MJKTApP4t9YI nysv2wDeULXGYCMb8WbMuRyKu5E4uA9X1ETlgTMRWW3fLFMjXueGHMRnrCGn2TXgksj/ nQF+kx7rN93U7ATrb3LLHN//BtFS00Miccsdl2ROXsOhr/71yLNwEnaOltSjk/wrwlRl g/6Vg+qMlgZOO5T2J93U0VlAB7zOpGUMZgCqbSRG6nqK37R5nDrvf5WDJT+INTegywK0 mEBl7Lg4XWUvE+qCMqyZ220rr7jcLX3BB5EF6fWhwYOMM4lRr9t8XSssklYvrPhJueNb NQdA== X-Gm-Message-State: AO0yUKX9unB93w0YLBuJos7Rh8Tnak1ZSguGGUK7Q2eiRSPQZDWeaCZs DoUezH+eXV82DQ0Us5YVOuh4RBik/elUBUXxwHpuug== X-Google-Smtp-Source: AK7set/92BOJrwXhfPvabeWT1IOYWPTjlypUFc0izHR0fwZEPOBiX4/G+rGCVw1IHggZSPJfoEIceMeX4zPdW6GwqUM= X-Received: by 2002:aa7:ca42:0:b0:4a0:8fd6:34c2 with SMTP id j2-20020aa7ca42000000b004a08fd634c2mr1389214edt.67.1674693489832; Wed, 25 Jan 2023 16:38:09 -0800 (PST) MIME-Version: 1.0 References: <20230125225358.2576151-1-zokeefe@google.com> In-Reply-To: From: "Zach O'Keefe" Date: Wed, 25 Jan 2023 16:37:33 -0800 Message-ID: Subject: Re: [PATCH v2] mm/MADV_COLLAPSE: catch !none !huge !bad pmd lookups To: Yang Shi Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Hugh Dickins , stable@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: B601D1C000B X-Stat-Signature: bt93zctcht8swze1hphsjdfiepkdhzht X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1674693491-340321 X-HE-Meta: U2FsdGVkX19tGlAMx7FzclgqRQ9Auzj4jbmNJCUGWde4pnTm6ZuU5qKMD0xxc9fsUMc2G0/V8RTWcorjd8d+CFIkFPxOTZfTtNDpRsTeWpBPrwlt153lkDguYdMjcGMbuIFSLepcrFi4wf6fqhjOpZh5v2CTuIFApED4ncjKm8zEUBrCiDZq43z38O1SwbALWaUomn4C+YS3dChNuNpPks3DXO+TwgPAhpqfZGGAI/UnUGXL01NWrPFicCB4GOo2rSy3aqNZTMF1QPyO6teeS5gh0YxOnQ3KxPhtBKoFJS4b23hXEYorVH4RAGfSawTmRiugMzr8HqiN5ok62xYYRG21tW6xH26DX9oLcDBVCcfPjO+3J5Ua1MKH+U2LvH0DOPP82YbGMJUNwkWK3hTHi7VMUxKV71tkO2xPoJ9pjBwEti1DCJuRnx9luHgmxcw+OCAtD9xoI7IV6VKyZOYoDRJo1LNcepk4G6gU8FDBShgV+JgOuXAqSmnaw08kmPJrdVCjfzx+vphI7+IgI7Af0oOXcDF/Ig10cwk8/grG5RWJg8sUMF0lIiR/GFWhJqeyV+W7v13YWeNILYVSfUs2I9FtGIXbuWKaorI5DuKYHcMEuYE0fkRoo8ozfdzlTktlnE90RFB9ahG7Xoxx6v4foqvzRs4yVxKkgMkBs2uhhb5mrpJ7N34jNCxO6keU6xotfnK1JC3k6E+rlU904D0hXuhJc+2ZVtnKQaCDsx7M4NCPo+yUTA4xF7yZH6AL+QGb9pomBAGRXUCoTCh/nz2x2FiOE9RbV7uDKG6PYRWarUXq0/zOI2as/WiVEfRUTE2pZ4cejDjXGNMJbB/T0gYzt627QEOlQPG59BaAvF3Jczo9/MagBEpE7JV8XQq+N7vspSz6o8e0Q9gQMswVg54dPocqbtK82xxOJPWpnkKFKlmy17Wo0wV2dK6IZZoVgvZwrVQbfskbsSJWyvjSZjN BBXRcfdy qjEWm3w4TvlokU5At8kTG+iivPM4uXcpw5Qdb23Uy9gzuhwHi4TYrDb96ydaJ5V/DEasniiIu486yeXaVkb5dl5eeI25TchL5pU0fht3pubJPol9iWoOmgryn3n7dtMiZlMN8qnIYS2k3rccsbKwvaa9OZRd/heOXz44j0IBI7mfPd8HGwqnB8vvw64ukZoG42VMJH9uiY3oXIr9mm1ex9Z6ByZp4uaPqaQymllnv2T0DFVr+vcMWSTqfHbBfrXzyDmbGy4vD/oLqYukEothDTbn8J3fd6pFPF06NHd5dp4mbEWxpDFz7vb8fHl5NSQFM8u/cwzHFnaAUYvC4OS463Yvb9fAvVsSCrywJgja6dQ9PvI6Azlr3QBpvqWGEr89aUKzCzQPRC1F2M8NYhgxLsWalWiE/oLB7mfYwJpeTUPFqk7fl1AKsI87etA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Jan 25, 2023 at 4:24 PM Yang Shi wrote: > > On Wed, Jan 25, 2023 at 2:54 PM Zach O'Keefe wrote: > > > > In commit 34488399fa08 ("mm/madvise: add file and shmem support to > > MADV_COLLAPSE") we make the following change to find_pmd_or_thp_or_none(): > > > > - if (!pmd_present(pmde)) > > - return SCAN_PMD_NULL; > > + if (pmd_none(pmde)) > > + return SCAN_PMD_NONE; > > > > This was for-use by MADV_COLLAPSE file/shmem codepaths, where MADV_COLLAPSE > > might identify a pte-mapped hugepage, only to have khugepaged race-in, free > > the pte table, and clear the pmd. Such codepaths include: > > > > A) If we find a suitably-aligned compound page of order HPAGE_PMD_ORDER > > already in the pagecache. > > B) In retract_page_tables(), if we fail to grab mmap_lock for the target > > mm/address. > > > > In these cases, collapse_pte_mapped_thp() really does expect a none (not > > just !present) pmd, and we want to suitably identify that case separate > > from the case where no pmd is found, or it's a bad-pmd (of course, many > > things could happen once we drop mmap_lock, and the pmd could plausibly > > undergo multiple transitions due to intervening fault, split, etc). > > Regardless, the code is prepared install a huge-pmd only when the existing > > pmd entry is either a genuine pte-table-mapping-pmd, or the none-pmd. > > > > However, the commit introduces a logical hole; namely, that we've allowed > > !none- && !huge- && !bad-pmds to be classified as genuine > > pte-table-mapping-pmds. One such example that could leak through are swap > > entries. The pmd values aren't checked again before use in > > pte_offset_map_lock(), which is expecting nothing less than a genuine > > pte-table-mapping-pmd. > > > > We want to put back the !pmd_present() check (below the pmd_none() check), > > but need to be careful to deal with subtleties in pmd transitions and > > treatments by various arch. > > > > The issue is that __split_huge_pmd_locked() temporarily clears the present > > bit (or otherwise marks the entry as invalid), but pmd_present() > > and pmd_trans_huge() still need to return true while the pmd is in this > > transitory state. For example, x86's pmd_present() also checks the > > _PAGE_PSE , riscv's version also checks the _PAGE_LEAF bit, and arm64 also > > checks a PMD_PRESENT_INVALID bit. > > > > Covering all 4 cases for x86 (all checks done on the same pmd value): > > > > 1) pmd_present() && pmd_trans_huge() > > All we actually know here is that the PSE bit is set. Either: > > a) We aren't racing with __split_huge_page(), and PRESENT or PROTNONE > > is set. > > => huge-pmd > > b) We are currently racing with __split_huge_page(). The danger here > > is that we proceed as-if we have a huge-pmd, but really we are > > looking at a pte-mapping-pmd. So, what is the risk of this > > danger? > > > > The only relevant path is: > > > > madvise_collapse() -> collapse_pte_mapped_thp() > > > > Where we might just incorrectly report back "success", when really > > the memory isn't pmd-backed. This is fine, since split could > > happen immediately after (actually) successful madvise_collapse(). > > So, it should be safe to just assume huge-pmd here. > > > > 2) pmd_present() && !pmd_trans_huge() > > Either: > > a) PSE not set and either PRESENT or PROTNONE is. > > => pte-table-mapping pmd (or PROT_NONE) > > b) devmap. This routine can be called immediately after > > unlocking/locking mmap_lock -- or called with no locks held (see > > khugepaged_scan_mm_slot()), so previous VMA checks have since been > > invalidated. > > > > 3) !pmd_present() && pmd_trans_huge() > > Not possible. > > > > 4) !pmd_present() && !pmd_trans_huge() > > Neither PRESENT nor PROTNONE set > > => not present > > > > I've checked all archs that implement pmd_trans_huge() (arm64, riscv, > > powerpc, longarch, x86, mips, s390) and this logic roughly translates > > (though devmap treatment is unique to x86 and powerpc, and (3) doesn't > > necessarily hold in general -- but that doesn't matter since !pmd_present() > > always takes failure path). > > > > Also, add a comment above find_pmd_or_thp_or_none() to help future > > travelers reason about the validity of the code; namely, the possible > > mutations that might happen out from under us, depending on how > > mmap_lock is held (if at all). > > > > Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE") > > Reported-by: Hugh Dickins > > Signed-off-by: Zach O'Keefe > > Cc: stable@vger.kernel.org > > Reviewed-by: Yang Shi Thanks for your time as always, Yang! Best, Zach > > > > --- > > Request that this be pulled into stable since it's theoretically > > possible (though I have no reproducer) that while mmap_lock is dropped, > > racing thp migration installs a pmd migration entry which then has a path to > > be consumed, unchecked, by pte_offset_map(). > > > > v1 -> v2: Fix typo > > --- > > mm/khugepaged.c | 8 ++++++++ > > 1 file changed, 8 insertions(+) > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > index 9548644bdb56..1face2ae5877 100644 > > --- a/mm/khugepaged.c > > +++ b/mm/khugepaged.c > > @@ -943,6 +943,10 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address, > > return SCAN_SUCCEED; > > } > > > > +/* > > + * See pmd_trans_unstable() for how the result may change out from > > + * underneath us, even if we hold mmap_lock in read. > > + */ > > static int find_pmd_or_thp_or_none(struct mm_struct *mm, > > unsigned long address, > > pmd_t **pmd) > > @@ -961,8 +965,12 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm, > > #endif > > if (pmd_none(pmde)) > > return SCAN_PMD_NONE; > > + if (!pmd_present(pmde)) > > + return SCAN_PMD_NULL; > > if (pmd_trans_huge(pmde)) > > return SCAN_PMD_MAPPED; > > + if (pmd_devmap(pmde)) > > + return SCAN_PMD_NULL; > > if (pmd_bad(pmde)) > > return SCAN_PMD_NULL; > > return SCAN_SUCCEED; > > -- > > 2.39.1.456.gfc5497dd1b-goog > >