From: Yang Shi <shy828301@gmail.com>
To: Jiaqi Yan <jiaqiyan@google.com>
Cc: "Tony Luck" <tony.luck@intel.com>,
"HORIGUCHI NAOYA(堀口 直也)" <naoya.horiguchi@nec.com>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
"Miaohe Lin" <linmiaohe@huawei.com>, "Jue Wang" <juew@google.com>,
"Linux MM" <linux-mm@kvack.org>
Subject: Re: [RFC v1 1/2] mm: khugepaged: recover from poisoned anonymous memory
Date: Wed, 6 Apr 2022 09:02:08 -0700 [thread overview]
Message-ID: <CAHbLzkq_CzPsP+POdqHbJg28y1_tVN3e5xuZrpz3DZ0MM1no9g@mail.gmail.com> (raw)
In-Reply-To: <CACw3F51AEvy4gpNhS4h=DtxSL7F41WhSv84747cDh5nz9xZ0fg@mail.gmail.com>
On Tue, Apr 5, 2022 at 1:48 PM Jiaqi Yan <jiaqiyan@google.com> wrote:
>
> On Thu, Mar 24, 2022 at 7:20 PM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Wed, Mar 23, 2022 at 4:29 PM Jiaqi Yan <jiaqiyan@google.com> wrote:
> > >
> > > Make __collapse_huge_page_copy return whether
> > > collapsing/copying anonymous pages succeeded,
> > > and make collapse_huge_page handle the return status.
> > >
> > > Break existing PTE scan loop into two for-loops.
> > > The first loop copies source pages into target huge page,
> > > and can fail gracefully when running into memory errors in
> > > source pages. Roll back the page table and page states
> > > in the 2nd loop when copy failed:
> > > 1) re-establish the PTEs-to-PMD connection.
> > > 2) release pages back to their LRU list.
> > >
> > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > > ---
> > > include/linux/highmem.h | 19 ++++++
> > > mm/khugepaged.c | 136 ++++++++++++++++++++++++++++++----------
> > > 2 files changed, 122 insertions(+), 33 deletions(-)
> > >
> > > diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> > > index 39bb9b47fa9c..15d0aa4d349c 100644
> > > --- a/include/linux/highmem.h
> > > +++ b/include/linux/highmem.h
> > > @@ -281,6 +281,25 @@ static inline void copy_user_highpage(struct page *to, struct page *from,
> > > kunmap_local(vfrom);
> > > }
> > >
> > > +/*
> > > + * Machine check exception handled version of copy_user_highpage.
> > > + * Return true if copying page content failed; otherwise false.
> > > + */
> > > +static inline bool copy_user_highpage_mc(struct page *to, struct page *from,
> > > + unsigned long vaddr, struct vm_area_struct *vma)
> >
> > Patch #2 defined copy_highpage_mc() which has the same implementation.
> > This function has two unused parameters: vaddr and vma. It seems we
> > just need to keep one.
> >
>
> Version 2 will delete copy_user_highpage_mc() from the patch series,
> and just keep copy_highpage_mc().
>
> > > +{
> > > + char *vfrom, *vto;
> > > + unsigned long ret;
> > > +
> > > + vfrom = kmap_local_page(from);
> > > + vto = kmap_local_page(to);
> > > + ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE);
> > > + kunmap_local(vto);
> > > + kunmap_local(vfrom);
> > > +
> > > + return ret > 0;
> > > +}
> > > +
> > > #endif
> > >
> > > #ifndef __HAVE_ARCH_COPY_HIGHPAGE
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index 131492fd1148..84ed177f56ff 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -739,44 +739,97 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> > > return 0;
> > > }
> > >
> > > -static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
> > > - struct vm_area_struct *vma,
> > > - unsigned long address,
> > > - spinlock_t *ptl,
> > > - struct list_head *compound_pagelist)
> > > +/*
> > > + * __collapse_huge_page_copy - attempts to copy memory contents from normal
> > > + * pages to a hugepage. Cleanup the normal pages if copy succeeds;
> > > + * otherwise restore the original pmd page table.
> > > + *
> > > + * @pte: starting of the PTEs to copy from
> > > + * @page: the new hugepage to copy contents to
> > > + * @pmd: pointer to the new hugepage's PMD
> > > + * @rollback: the original normal PTEs' PMD
> > > + * @address: starting address to copy
> > > + * @pte_ptl: lock on normal pages' PTEs
> > > + * @compound_pagelist: list that stores compound pages
> > > + */
> > > +static bool __collapse_huge_page_copy(pte_t *pte,
> > > + struct page *page,
> > > + pmd_t *pmd,
> > > + pmd_t rollback,
> > > + struct vm_area_struct *vma,
> > > + unsigned long address,
> > > + spinlock_t *pte_ptl,
> > > + struct list_head *compound_pagelist)
> > > {
> > > struct page *src_page, *tmp;
> > > pte_t *_pte;
> > > - for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> > > - _pte++, page++, address += PAGE_SIZE) {
> > > - pte_t pteval = *_pte;
> > > + pte_t pteval;
> > > + unsigned long _address;
> > > + spinlock_t *pmd_ptl;
> > > + bool copy_succeeded = true;
> > >
> > > - if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> > > + /*
> > > + * Copying pages' contents is subject to memory poison at any iteration.
> > > + */
> > > + for (_pte = pte, _address = address;
> > > + _pte < pte + HPAGE_PMD_NR;
> > > + _pte++, page++, _address += PAGE_SIZE) {
> > > + pteval = *_pte;
> > > +
> > > + if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval)))
> > > clear_user_highpage(page, address);
> > > - add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
> > > - if (is_zero_pfn(pte_pfn(pteval))) {
> > > - /*
> > > - * ptl mostly unnecessary.
> > > - */
> > > - spin_lock(ptl);
> > > - ptep_clear(vma->vm_mm, address, _pte);
> > > - spin_unlock(ptl);
> > > + else {
> > > + src_page = pte_page(pteval);
> > > + if (copy_user_highpage_mc(page, src_page, address, vma)) {
> > > + copy_succeeded = false;
> > > + break;
> > > + }
> > > + }
> > > + }
> > > +
> > > + if (!copy_succeeded) {
> > > + /*
> > > + * Copying failed, re-establish the regular PMD that
> > > + * points to regular page table. Since PTEs are still
> > > + * isolated and locked, acquiring anon_vma_lock is unnecessary.
> > > + */
> > > + pmd_ptl = pmd_lock(vma->vm_mm, pmd);
> > > + pmd_populate(vma->vm_mm, pmd, pmd_pgtable(rollback));
> > > + spin_unlock(pmd_ptl);
> >
> > I think we could just jump to list_for_each_entry_safe, right? But it
> > seems that list_for_each_entry_safe is not enough, please see the
> > later comments.
>
> Yes, it is not enough, we need to release_pte_page() for non-compound pages.
>
> If we are going to use jump/goto, we need two for-loops over PTEs, one
> for the copy
> succeeded case, and one for the failure case. What I have right now is combining
> the two for-loops into one.
> I don't like to switch to jump because: 1) goto statements have
> bad readability, 2) one for-loop is more concise than two for-loops.
Makes sense to me, it is not worth two loops.
>
> >
> > > + }
> > > +
> > > + for (_pte = pte, _address = address; _pte < pte + HPAGE_PMD_NR;
> > > + _pte++, _address += PAGE_SIZE) {
> > > + pteval = *_pte;
> > > + if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> > > + if (copy_succeeded) {
> >
> > With the above jump, you don't need to check if copy is succeeded or
> > not since this is the succeeded path only.
> >
> > > + add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
> > > + if (is_zero_pfn(pte_pfn(pteval))) {
> > > + /*
> > > + * ptl mostly unnecessary.
> > > + */
> > > + spin_lock(pte_ptl);
> > > + pte_clear(vma->vm_mm, _address, _pte);
> > > + spin_unlock(pte_ptl);
> > > + }
> > > }
> > > } else {
> > > src_page = pte_page(pteval);
> > > - copy_user_highpage(page, src_page, address, vma);
> > > if (!PageCompound(src_page))
> > > release_pte_page(src_page);
> > > - /*
> > > - * ptl mostly unnecessary, but preempt has to
> > > - * be disabled to update the per-cpu stats
> > > - * inside page_remove_rmap().
> > > - */
> > > - spin_lock(ptl);
> > > - ptep_clear(vma->vm_mm, address, _pte);
> > > - page_remove_rmap(src_page, false);
> > > - spin_unlock(ptl);
> > > - free_page_and_swap_cache(src_page);
> > > +
> > > + if (copy_succeeded) {
> > > + /*
> > > + * ptl mostly unnecessary, but preempt has to
> > > + * be disabled to update the per-cpu stats
> > > + * inside page_remove_rmap().
> > > + */
> > > + spin_lock(pte_ptl);
> > > + pte_clear(vma->vm_mm, _address, _pte);
> > > + page_remove_rmap(src_page, false);
> > > + spin_unlock(pte_ptl);
> > > + free_page_and_swap_cache(src_page);
> > > + }
> > > }
> > > }
> > >
> > > @@ -784,6 +837,8 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
> > > list_del(&src_page->lru);
> > > release_pte_page(src_page);
> > > }
> >
> > If copy failed you need to unpin the isolated pages and put them back
> > to LRU IIUC.
>
> No matter if the copy failed or succeeded, we need to use a for-loop
> to unpin isolated pages
> and put them back to LRU (i.e. release_pte_page()). In current code,
> this is a common
> operation in the above for-loop.
>
> >
> > > +
> > > + return copy_succeeded;
> > > }
> > >
> > > static void khugepaged_alloc_sleep(void)
> > > @@ -1066,6 +1121,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> > > struct vm_area_struct *vma;
> > > struct mmu_notifier_range range;
> > > gfp_t gfp;
> > > + bool copy_succeeded = false;
> > >
> > > VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> > >
> > > @@ -1177,9 +1233,13 @@ static void collapse_huge_page(struct mm_struct *mm,
> > > */
> > > anon_vma_unlock_write(vma->anon_vma);
> > >
> > > - __collapse_huge_page_copy(pte, new_page, vma, address, pte_ptl,
> > > - &compound_pagelist);
> > > + copy_succeeded = __collapse_huge_page_copy(pte, new_page, pmd, _pmd,
> > > + vma, address, pte_ptl, &compound_pagelist);
> > > pte_unmap(pte);
> > > + if (!copy_succeeded) {
> >
> > Shall the fail handling be moved before pte_unmap()?
>
> I don't think so. I think at this point we need to pte_unmap() no
> matter if the copying
> succeeded or not. It is obviously needed if __collapse_huge_page_copy succeeded
> (because existing code already assumes __collapse_huge_page_copy
> always succeeded).
> For the failure case, my reasoning is based on __collapse_huge_page_isolate.
> If __collapse_huge_page_isolate failed, we need to pte_unmap()
> before goto out_up_write. So I would assume if __collapse_huge_page_copy failed,
> pte_unmap() is also required.
Yeah, make sense.
>
> >
> > > + result = SCAN_FAIL;
> >
> > I think a new result is preferred, for example, SCAN_COPY_MC, it would
> > be helpful for debug.
>
> Version 2 will add this new enum.
>
>
> >
> > > + goto out_up_write;
> > > + }
> > > /*
> > > * spin_lock() below is not the equivalent of smp_wmb(), but
> > > * the smp_wmb() inside __SetPageUptodate() can be reused to
> > > @@ -1364,9 +1424,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> > > pte_unmap_unlock(pte, ptl);
> > > if (ret) {
> > > node = khugepaged_find_target_node();
> > > - /* collapse_huge_page will return with the mmap_lock released */
> > > - collapse_huge_page(mm, address, hpage, node,
> > > - referenced, unmapped);
> > > + /*
> > > + * collapse_huge_page will return with the mmap_r+w_lock released.
> > > + * It is uncertain if *hpage is NULL or not when collapse_huge_page
> > > + * returns, so keep ret=1 to jump to breakouterloop_mmap_lock
> > > + * in khugepaged_scan_mm_slot, then *hpage will be freed
> > > + * if collapse failed.
> >
> > It may be not true for !NUMA case, the huge page may be reused, so you
> > need to make sure the huge page is cleared before return, I had a
> > patch that remove the special case for !NUMA
> > (https://lore.kernel.org/linux-mm/20210817202146.3218-1-shy828301@gmail.com/),
> > it could avoid clearing huge page, I should resurrect it.
>
> Thanks for catching this! Clearing the hugepage before returning is
> for security reasons
> or to avoid memory corruption, right?
Yes, but it should be fine since the users typically clear the page
before using it, for example, do_anonymous_page() does call
alloc_zeroed_user_highpage_movable() to allocate a page, which clears
the page before returning. So this should not block you, but remind me
to relook my patch.
> If my understanding is correct, once your patch is merged, this will
> be good. Your patch
> seems to be a very simple one, do you expect it to be merged soon?
It still has some unsolved review comments, and I didn't get time to
work on it yet.
>
> >
> > > + */
> > > + collapse_huge_page(mm, address, hpage, node, referenced, unmapped);
> > > }
> > > out:
> > > trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced,
> > > @@ -2168,6 +2233,11 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
> > > khugepaged_scan_file(mm, file, pgoff, hpage);
> > > fput(file);
> > > } else {
> > > + /*
> > > + * mmap_read_lock is
> > > + * 1) released if both scan and collapse succeeded;
> > > + * 2) still held if either scan or collapse failed.
> > > + */
> > > ret = khugepaged_scan_pmd(mm, vma,
> > > khugepaged_scan.address,
> > > hpage);
> > > --
> > > 2.35.1.894.gb6a874cedc-goog
> > >
>
> Thanks for the quick and valuable feedback.
> I will send out the v2 RFC, and please leave/reply there. Thanks!
I may not be able to take a deep look at v2 soon, I will be off for
the following week. I will take a look at it once I'm back.
next prev parent reply other threads:[~2022-04-06 16:11 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-03-23 23:29 [RFC v1 0/2] Memory poison recovery in khugepaged Jiaqi Yan
2022-03-23 23:29 ` [RFC v1 1/2] mm: khugepaged: recover from poisoned anonymous memory Jiaqi Yan
2022-03-25 2:20 ` Yang Shi
2022-04-05 20:48 ` Jiaqi Yan
2022-04-06 16:02 ` Yang Shi [this message]
2022-03-23 23:29 ` [RFC v1 2/2] mm: khugepaged: recover from poisoned file-backed memory Jiaqi Yan
2022-03-28 23:37 ` Yang Shi
2022-04-05 20:46 ` Jiaqi Yan
2022-03-29 4:02 ` Tong Tiangen
2022-04-05 20:46 ` Jiaqi Yan
2022-03-25 2:50 ` [RFC v1 0/2] Memory poison recovery in khugepaged Yang Shi
2022-03-25 21:11 ` Jiaqi Yan
2022-03-25 21:42 ` Yang Shi
2022-03-25 23:07 ` Jiaqi Yan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAHbLzkq_CzPsP+POdqHbJg28y1_tVN3e5xuZrpz3DZ0MM1no9g@mail.gmail.com \
--to=shy828301@gmail.com \
--cc=jiaqiyan@google.com \
--cc=juew@google.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linmiaohe@huawei.com \
--cc=linux-mm@kvack.org \
--cc=naoya.horiguchi@nec.com \
--cc=tony.luck@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox