linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Zach O'Keefe" <zokeefe@google.com>
To: Yang Shi <shy828301@gmail.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	 Andrew Morton <akpm@linux-foundation.org>,
	Hugh Dickins <hughd@google.com>,
	stable@vger.kernel.org
Subject: Re: [PATCH v2] mm/MADV_COLLAPSE: catch !none !huge !bad pmd lookups
Date: Wed, 25 Jan 2023 16:37:33 -0800	[thread overview]
Message-ID: <CAAa6QmSEYAGu+HCRLS8di6CF6aGPYf--MTDabh=_ixw-wpruCA@mail.gmail.com> (raw)
In-Reply-To: <CAHbLzkpJkMGF-HAZt-yqz-S9TEPW=4UWaZ0GJjs=tufrv2R8EQ@mail.gmail.com>

On Wed, Jan 25, 2023 at 4:24 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Wed, Jan 25, 2023 at 2:54 PM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > In commit 34488399fa08 ("mm/madvise: add file and shmem support to
> > MADV_COLLAPSE") we make the following change to find_pmd_or_thp_or_none():
> >
> >         -       if (!pmd_present(pmde))
> >         -               return SCAN_PMD_NULL;
> >         +       if (pmd_none(pmde))
> >         +               return SCAN_PMD_NONE;
> >
> > This was for-use by MADV_COLLAPSE file/shmem codepaths, where MADV_COLLAPSE
> > might identify a pte-mapped hugepage, only to have khugepaged race-in, free
> > the pte table, and clear the pmd.  Such codepaths include:
> >
> > A) If we find a suitably-aligned compound page of order HPAGE_PMD_ORDER
> >    already in the pagecache.
> > B) In retract_page_tables(), if we fail to grab mmap_lock for the target
> >    mm/address.
> >
> > In these cases, collapse_pte_mapped_thp() really does expect a none (not
> > just !present) pmd, and we want to suitably identify that case separate
> > from the case where no pmd is found, or it's a bad-pmd (of course, many
> > things could happen once we drop mmap_lock, and the pmd could plausibly
> > undergo multiple transitions due to intervening fault, split, etc).
> > Regardless, the code is prepared install a huge-pmd only when the existing
> > pmd entry is either a genuine pte-table-mapping-pmd, or the none-pmd.
> >
> > However, the commit introduces a logical hole; namely, that we've allowed
> > !none- && !huge- && !bad-pmds to be classified as genuine
> > pte-table-mapping-pmds.  One such example that could leak through are swap
> > entries.  The pmd values aren't checked again before use in
> > pte_offset_map_lock(), which is expecting nothing less than a genuine
> > pte-table-mapping-pmd.
> >
> > We want to put back the !pmd_present() check (below the pmd_none() check),
> > but need to be careful to deal with subtleties in pmd transitions and
> > treatments by various arch.
> >
> > The issue is that __split_huge_pmd_locked() temporarily clears the present
> > bit (or otherwise marks the entry as invalid), but pmd_present()
> > and pmd_trans_huge() still need to return true while the pmd is in this
> > transitory state.  For example, x86's pmd_present() also checks the
> > _PAGE_PSE , riscv's version also checks the _PAGE_LEAF bit, and arm64 also
> > checks a PMD_PRESENT_INVALID bit.
> >
> > Covering all 4 cases for x86 (all checks done on the same pmd value):
> >
> > 1) pmd_present() && pmd_trans_huge()
> >    All we actually know here is that the PSE bit is set. Either:
> >    a) We aren't racing with __split_huge_page(), and PRESENT or PROTNONE
> >       is set.
> >       => huge-pmd
> >    b) We are currently racing with __split_huge_page().  The danger here
> >       is that we proceed as-if we have a huge-pmd, but really we are
> >       looking at a pte-mapping-pmd.  So, what is the risk of this
> >       danger?
> >
> >       The only relevant path is:
> >
> >         madvise_collapse() -> collapse_pte_mapped_thp()
> >
> >       Where we might just incorrectly report back "success", when really
> >       the memory isn't pmd-backed.  This is fine, since split could
> >       happen immediately after (actually) successful madvise_collapse().
> >       So, it should be safe to just assume huge-pmd here.
> >
> > 2) pmd_present() && !pmd_trans_huge()
> >    Either:
> >    a) PSE not set and either PRESENT or PROTNONE is.
> >       => pte-table-mapping pmd (or PROT_NONE)
> >    b) devmap.  This routine can be called immediately after
> >       unlocking/locking mmap_lock -- or called with no locks held (see
> >       khugepaged_scan_mm_slot()), so previous VMA checks have since been
> >       invalidated.
> >
> > 3) !pmd_present() && pmd_trans_huge()
> >   Not possible.
> >
> > 4) !pmd_present() && !pmd_trans_huge()
> >   Neither PRESENT nor PROTNONE set
> >   => not present
> >
> > I've checked all archs that implement pmd_trans_huge() (arm64, riscv,
> > powerpc, longarch, x86, mips, s390) and this logic roughly translates
> > (though devmap treatment is unique to x86 and powerpc, and (3) doesn't
> > necessarily hold in general -- but that doesn't matter since !pmd_present()
> > always takes failure path).
> >
> > Also, add a comment above find_pmd_or_thp_or_none() to help future
> > travelers reason about the validity of the code; namely, the possible
> > mutations that might happen out from under us, depending on how
> > mmap_lock is held (if at all).
> >
> > Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
> > Reported-by: Hugh Dickins <hughd@google.com>
> > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > Cc: stable@vger.kernel.org
>
> Reviewed-by: Yang Shi <shy828301@gmail.com>

Thanks for your time as always, Yang!

Best,
Zach

> >
> > ---
> > Request that this be pulled into stable since it's theoretically
> > possible (though I have no reproducer) that while mmap_lock is dropped,
> > racing thp migration installs a pmd migration entry which then has a path to
> > be consumed, unchecked, by pte_offset_map().
> >
> > v1 -> v2: Fix typo
> > ---
> >  mm/khugepaged.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 9548644bdb56..1face2ae5877 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -943,6 +943,10 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> >         return SCAN_SUCCEED;
> >  }
> >
> > +/*
> > + * See pmd_trans_unstable() for how the result may change out from
> > + * underneath us, even if we hold mmap_lock in read.
> > + */
> >  static int find_pmd_or_thp_or_none(struct mm_struct *mm,
> >                                    unsigned long address,
> >                                    pmd_t **pmd)
> > @@ -961,8 +965,12 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm,
> >  #endif
> >         if (pmd_none(pmde))
> >                 return SCAN_PMD_NONE;
> > +       if (!pmd_present(pmde))
> > +               return SCAN_PMD_NULL;
> >         if (pmd_trans_huge(pmde))
> >                 return SCAN_PMD_MAPPED;
> > +       if (pmd_devmap(pmde))
> > +               return SCAN_PMD_NULL;
> >         if (pmd_bad(pmde))
> >                 return SCAN_PMD_NULL;
> >         return SCAN_SUCCEED;
> > --
> > 2.39.1.456.gfc5497dd1b-goog
> >


      reply	other threads:[~2023-01-26  0:38 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-25 22:53 Zach O'Keefe
2023-01-26  0:24 ` Yang Shi
2023-01-26  0:37   ` Zach O'Keefe [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAAa6QmSEYAGu+HCRLS8di6CF6aGPYf--MTDabh=_ixw-wpruCA@mail.gmail.com' \
    --to=zokeefe@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=shy828301@gmail.com \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox