From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 12CD2C54E94 for ; Thu, 26 Jan 2023 00:24:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5BA036B0071; Wed, 25 Jan 2023 19:24:25 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 569C46B0072; Wed, 25 Jan 2023 19:24:25 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 431836B0073; Wed, 25 Jan 2023 19:24:25 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 3360A6B0071 for ; Wed, 25 Jan 2023 19:24:25 -0500 (EST) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 0CC5B40E5B for ; Thu, 26 Jan 2023 00:24:25 +0000 (UTC) X-FDA: 80395053690.25.FB17899 Received: from mail-pg1-f169.google.com (mail-pg1-f169.google.com [209.85.215.169]) by imf27.hostedemail.com (Postfix) with ESMTP id 3E29640003 for ; Thu, 26 Jan 2023 00:24:23 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=QGMTyEVH; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf27.hostedemail.com: domain of shy828301@gmail.com designates 209.85.215.169 as permitted sender) smtp.mailfrom=shy828301@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1674692663; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=CM96+4ruCJJU8nhNgGrZjMZb++bxzndT9HSNrxKvf1o=; b=zea6xRNfBJiBPmFbbZ07M5/B38bMRuab+PZpfmxpM+xSdgNuITLRQMrWjZoRrYThdsqvEe o8xImCJ2PcD4vQOhJDLdSIRPeHvb6kfIozaqV6dr1Gr7uEXJkDrvx9V3anlr8o5uTooXoY T28mffqAiXTrVaCVl0DIXj0syk1Xi9g= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=QGMTyEVH; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf27.hostedemail.com: domain of shy828301@gmail.com designates 209.85.215.169 as permitted sender) smtp.mailfrom=shy828301@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1674692663; a=rsa-sha256; cv=none; b=sKrzrwW9MPmnVRW5YahTnWT1Ni3TpWKOLTJNoQCleukaNteS2oPgarGw/lsdRHVSvRwSn2 1RPNI2or4yfYnWoe//FXQNPE5yLfVpMcfM4L47ZHFChkr32PqM81VEK7H7J7gLWThBBoLD XKW4433D7gqkQXlarsogbS11czs90dU= Received: by mail-pg1-f169.google.com with SMTP id 78so123266pgb.8 for ; Wed, 25 Jan 2023 16:24:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=CM96+4ruCJJU8nhNgGrZjMZb++bxzndT9HSNrxKvf1o=; b=QGMTyEVHh6YOWSPQO6QhW+4fjz1yvbxRbtWLzmzY9MTxIFwIVR7hUrBT3v/WzGzzvI GfXz9Kc+0LkUppERgoAu1Hvm7xknxDwECOHUZ6SQBCx5rNHjFQvI0noDYji8gAVYrm14 w9qCWc2wPZ/C14Iih31gQvztV5tHIFXP2HFMBA8ahQe7PFcX91RwKy4yWPnTI5YQMDi3 HUwJ4+FFC0zfqir8zNjzCYgDn3btP9fLKKrva67GrSWvhaKmII9p1OzotciMFRn1Bpna PX8EAyLBRF3n0pipDXr5i4PbcsISL7BOpLRr7KJBBbTbSO+fHLM38jg4SHoMcBjxnAv+ m40g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=CM96+4ruCJJU8nhNgGrZjMZb++bxzndT9HSNrxKvf1o=; b=EGqueL6j14EKvgUzp1YfgmsbDIWmSfLlgTH8T8F4DCqcp2qroB19vwItLlKlG0PJKV bPWNSsmfvX8LkwycljvMyIoqbqcn0569dAOJ7W+10BYrQu/HBkm7pa2V7q5Vo2dpmr5X 6TYFvkB7MB0LYWEWAN6SfZjE1IpHe0o12eymKmV5uqJD5NrbJ2Nx8Qndj1WIlJsMDJZ7 +guTazCt9HYC9OiJEiEaZYDvDZTi17nwRITYd1kJL61iHXuUfa1J+ggylrkfAuc0gAuz FFLFEDK6/niUMODCFC+yi+ncTG/hFi2kVLLQGefZSxotaAE96QWTRxHd5/IiUGhsPa76 Y/Ag== X-Gm-Message-State: AFqh2kqdKr6qktvWWipFpDzuSE75ofAJ/O95IemO6afXmH+3dhwQY7jq Gyla3VE8ESqwyf+ADQDr0MySNwQW2n8rOkeko3j0uxQm X-Google-Smtp-Source: AMrXdXtavfzIqkEATJXVuRF9BtHF3tOg4erHKUxNEB7vkGi3ho5NVQsMM5sj80josQUCgA7D2GnUdhk8isjgxwcsUPc= X-Received: by 2002:a63:144d:0:b0:481:ef17:f551 with SMTP id 13-20020a63144d000000b00481ef17f551mr3203862pgu.11.1674692661928; Wed, 25 Jan 2023 16:24:21 -0800 (PST) MIME-Version: 1.0 References: <20230125225358.2576151-1-zokeefe@google.com> In-Reply-To: <20230125225358.2576151-1-zokeefe@google.com> From: Yang Shi Date: Wed, 25 Jan 2023 16:24:10 -0800 Message-ID: Subject: Re: [PATCH v2] mm/MADV_COLLAPSE: catch !none !huge !bad pmd lookups To: "Zach O'Keefe" Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Hugh Dickins , stable@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 3E29640003 X-Stat-Signature: zoxuq1txbbpp5k1otcpjk4zp7bea5a3g X-HE-Tag: 1674692663-534752 X-HE-Meta: U2FsdGVkX185Lo6UfV2K5WuIJhrAgGR9GTqmKUASADcc5TePRtIEnEzAfMiBbIWSimlP6JcK+Zw38mCcZKJG5ivQ2eSMsnbhZd9M+X9yP5UYnPiA4lKd3ydNdBb+iercEdb2yi/nSn2PF4QBwzItJAxat7EVXwbvqB4faD21rM5RvTAKQcpHgs073TYtKQ7A71dZFe73YSt5MNFPA1a2cEGW9NRHxcza+OLXvvlEa6Vba19i90jituk0/2HqiJBX5gIZh2uqq4B6a9CygePyHoI16DTFWnlUSXi/YgXEG5iJLsRRo506ob4YfKKCIy7m5W5ryFJqYgBvkKMLMaSHjvFhlBuBY4T68plizBC8ubqN8vNzO2MCulLPj15OLiH/skxd1X+YwQeVQHYZ3DMGnUjSVRlVb+SHuIX2rEemPbkW/3k2LZRHwZq1brPR6ZqxbKfOIXSGrcuFXhnlF34HVvnXUN8CQzllaErfAcx/axCGIcxwcUI6D86E4jkbfcl0Qi5bBZCypNToOGEcOmCaCapZf9hihVVZme+IV1IdtbYITKT1UDTXDGxjgQ8qITmc7wnRUYrMbyzK/5w21sZRrSp9sNQ1QeqDizvPj/XX8FOXpraq7G/M63rHbv2pnjqY1SmfTavhQGukQuuYDkGe1lQq1GzH5zxD030hFKo/Ypv/e/cCiYuMoZ0nFlu3HHLgjt+yGoT1CNLV9kUPAsiyyB+x7q6ubwu8hnYjTe5k7DkUqsm0gjV/OhodVbUVQXdZuDDFO0/6sOIKVTRsToiBDTo+D0a2lKVZFIZ5/6wKpfP7bZp09tqN4QPOyBj++FqZKS1yJgbZVcZ+7HxAhLrVQZvuZwQuKCW6E9a3MzD4V0QZ69IbfABRMlf2rfffgO9ut8NpN38Qw442vlhZgy7kZPZoTDAyW8DqFogFXHmHOi4ttB9JHybIgtVlW+ERb8IN733iEGUShqoPXi4jy2L o/Pm194a NdlKoqEcJ3fxji6/BQ2ERXxmuR8M82wbL8ybgvA26F0AQ6EzbRmite7CmCIG9Zd5wgFRG7S0yUi8LfAwX9YjbhbkJfW6SVQRimYRc9xQlnvJshnTRzqpewwlpKp08hklnR7kRCZtJgHFKIG6wm5+Pm9a1wJxXcTfwC47ayxBoPcR+RoCM9Ajl4dP94h2Ex8NXDmRHMjaaYmooWWp031N89J2uAHdIuny3terQ29Ru9pc/rRxYU5vm5JPGKTVmkvA0sV4R/kUR7CfMeWmzI7LNlAiBrYdTpkCX3ZWN40/k0FDnLKbWdqhuqKd1BGbLM/xgU8gpWWARhud1BZ4RiyGdlAla6kJoFCFezUOq/smPVDKdhtGT7QAUKxjmTCdFl8uDZsT36DXUxvp4hf1KEKXmpnDWxQ/HeAfaXkPQLGGVK5bOXNt/4rr1ZaXOCsyJGH7cSP/l2MlU7LqPmvJd2aEPNamHALizW3oLWgAFqLM4eV0h+CQz4veOxfbUJYjyVRG2djvijlfEiauXciURzy6en1KVOApVu79NvXI7PvPN2NDp6TnRUh5+0tu5GQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Jan 25, 2023 at 2:54 PM Zach O'Keefe wrote: > > In commit 34488399fa08 ("mm/madvise: add file and shmem support to > MADV_COLLAPSE") we make the following change to find_pmd_or_thp_or_none(): > > - if (!pmd_present(pmde)) > - return SCAN_PMD_NULL; > + if (pmd_none(pmde)) > + return SCAN_PMD_NONE; > > This was for-use by MADV_COLLAPSE file/shmem codepaths, where MADV_COLLAPSE > might identify a pte-mapped hugepage, only to have khugepaged race-in, free > the pte table, and clear the pmd. Such codepaths include: > > A) If we find a suitably-aligned compound page of order HPAGE_PMD_ORDER > already in the pagecache. > B) In retract_page_tables(), if we fail to grab mmap_lock for the target > mm/address. > > In these cases, collapse_pte_mapped_thp() really does expect a none (not > just !present) pmd, and we want to suitably identify that case separate > from the case where no pmd is found, or it's a bad-pmd (of course, many > things could happen once we drop mmap_lock, and the pmd could plausibly > undergo multiple transitions due to intervening fault, split, etc). > Regardless, the code is prepared install a huge-pmd only when the existing > pmd entry is either a genuine pte-table-mapping-pmd, or the none-pmd. > > However, the commit introduces a logical hole; namely, that we've allowed > !none- && !huge- && !bad-pmds to be classified as genuine > pte-table-mapping-pmds. One such example that could leak through are swap > entries. The pmd values aren't checked again before use in > pte_offset_map_lock(), which is expecting nothing less than a genuine > pte-table-mapping-pmd. > > We want to put back the !pmd_present() check (below the pmd_none() check), > but need to be careful to deal with subtleties in pmd transitions and > treatments by various arch. > > The issue is that __split_huge_pmd_locked() temporarily clears the present > bit (or otherwise marks the entry as invalid), but pmd_present() > and pmd_trans_huge() still need to return true while the pmd is in this > transitory state. For example, x86's pmd_present() also checks the > _PAGE_PSE , riscv's version also checks the _PAGE_LEAF bit, and arm64 also > checks a PMD_PRESENT_INVALID bit. > > Covering all 4 cases for x86 (all checks done on the same pmd value): > > 1) pmd_present() && pmd_trans_huge() > All we actually know here is that the PSE bit is set. Either: > a) We aren't racing with __split_huge_page(), and PRESENT or PROTNONE > is set. > => huge-pmd > b) We are currently racing with __split_huge_page(). The danger here > is that we proceed as-if we have a huge-pmd, but really we are > looking at a pte-mapping-pmd. So, what is the risk of this > danger? > > The only relevant path is: > > madvise_collapse() -> collapse_pte_mapped_thp() > > Where we might just incorrectly report back "success", when really > the memory isn't pmd-backed. This is fine, since split could > happen immediately after (actually) successful madvise_collapse(). > So, it should be safe to just assume huge-pmd here. > > 2) pmd_present() && !pmd_trans_huge() > Either: > a) PSE not set and either PRESENT or PROTNONE is. > => pte-table-mapping pmd (or PROT_NONE) > b) devmap. This routine can be called immediately after > unlocking/locking mmap_lock -- or called with no locks held (see > khugepaged_scan_mm_slot()), so previous VMA checks have since been > invalidated. > > 3) !pmd_present() && pmd_trans_huge() > Not possible. > > 4) !pmd_present() && !pmd_trans_huge() > Neither PRESENT nor PROTNONE set > => not present > > I've checked all archs that implement pmd_trans_huge() (arm64, riscv, > powerpc, longarch, x86, mips, s390) and this logic roughly translates > (though devmap treatment is unique to x86 and powerpc, and (3) doesn't > necessarily hold in general -- but that doesn't matter since !pmd_present() > always takes failure path). > > Also, add a comment above find_pmd_or_thp_or_none() to help future > travelers reason about the validity of the code; namely, the possible > mutations that might happen out from under us, depending on how > mmap_lock is held (if at all). > > Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE") > Reported-by: Hugh Dickins > Signed-off-by: Zach O'Keefe > Cc: stable@vger.kernel.org Reviewed-by: Yang Shi > > --- > Request that this be pulled into stable since it's theoretically > possible (though I have no reproducer) that while mmap_lock is dropped, > racing thp migration installs a pmd migration entry which then has a path to > be consumed, unchecked, by pte_offset_map(). > > v1 -> v2: Fix typo > --- > mm/khugepaged.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 9548644bdb56..1face2ae5877 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -943,6 +943,10 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address, > return SCAN_SUCCEED; > } > > +/* > + * See pmd_trans_unstable() for how the result may change out from > + * underneath us, even if we hold mmap_lock in read. > + */ > static int find_pmd_or_thp_or_none(struct mm_struct *mm, > unsigned long address, > pmd_t **pmd) > @@ -961,8 +965,12 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm, > #endif > if (pmd_none(pmde)) > return SCAN_PMD_NONE; > + if (!pmd_present(pmde)) > + return SCAN_PMD_NULL; > if (pmd_trans_huge(pmde)) > return SCAN_PMD_MAPPED; > + if (pmd_devmap(pmde)) > + return SCAN_PMD_NULL; > if (pmd_bad(pmde)) > return SCAN_PMD_NULL; > return SCAN_SUCCEED; > -- > 2.39.1.456.gfc5497dd1b-goog >