linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Yang Shi <shy828301@gmail.com>
To: Jiaqi Yan <jiaqiyan@google.com>
Cc: tongtiangen@huawei.com, "Linux MM" <linux-mm@kvack.org>,
	"Tony Luck" <tony.luck@intel.com>,
	"HORIGUCHI NAOYA(堀口 直也)" <naoya.horiguchi@nec.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	"Miaohe Lin" <linmiaohe@huawei.com>, "Jue Wang" <juew@google.com>
Subject: Re: [RFC v2 1/2] mm: khugepaged: recover from poisoned anonymous memory
Date: Fri, 29 Apr 2022 15:15:02 -0700	[thread overview]
Message-ID: <CAHbLzkqM+nCyoRFV9oGjbnCFOYzD_jEU1Y_cEoVEww+6bTdXAQ@mail.gmail.com> (raw)
In-Reply-To: <20220429000947.2172219-2-jiaqiyan@google.com>

On Thu, Apr 28, 2022 at 5:09 PM Jiaqi Yan <jiaqiyan@google.com> wrote:
>
> Make __collapse_huge_page_copy return whether
> collapsing/copying anonymous pages succeeded,
> and make collapse_huge_page handle the return status.
>
> Break existing PTE scan loop into two for-loops.
> The first loop copies source pages into target huge page,
> and can fail gracefully when running into memory errors in
> source pages. Roll back the page table and page states
> in the 2nd loop when copying failed:
> 1) re-establish the PTEs-to-PMD connection.
> 2) release pages back to their LRU list.

Could you please include a changelog next time? It is really helpful
for reviewers.

>
> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> ---
>  include/linux/highmem.h |  19 ++++++
>  mm/khugepaged.c         | 138 ++++++++++++++++++++++++++++++----------
>  2 files changed, 124 insertions(+), 33 deletions(-)
>
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index 39bb9b47fa9cd..0ccb1e92c4b06 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -298,6 +298,25 @@ static inline void copy_highpage(struct page *to, struct page *from)
>
>  #endif
>
> +/*
> + * Machine check exception handled version of copy_highpage.
> + * Return true if copying page content failed; otherwise false.
> + * Note handling #MC requires arch opt-in.
> + */
> +static inline bool copy_highpage_mc(struct page *to, struct page *from)
> +{
> +       char *vfrom, *vto;
> +       unsigned long ret;
> +
> +       vfrom = kmap_local_page(from);
> +       vto = kmap_local_page(to);
> +       ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE);
> +       kunmap_local(vto);
> +       kunmap_local(vfrom);
> +
> +       return ret > 0;
> +}
> +
>  static inline void memcpy_page(struct page *dst_page, size_t dst_off,
>                                struct page *src_page, size_t src_off,
>                                size_t len)
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 131492fd1148b..8e69a0640e551 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -52,6 +52,7 @@ enum scan_result {
>         SCAN_CGROUP_CHARGE_FAIL,
>         SCAN_TRUNCATED,
>         SCAN_PAGE_HAS_PRIVATE,
> +       SCAN_COPY_MC,

You need to update the tracepoint in
include/trace/events/huge_memory.h to include the new result.

>  };
>
>  #define CREATE_TRACE_POINTS
> @@ -739,44 +740,98 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>         return 0;
>  }
>
> -static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
> -                                     struct vm_area_struct *vma,
> -                                     unsigned long address,
> -                                     spinlock_t *ptl,
> -                                     struct list_head *compound_pagelist)
> +/*

Better to use "/**" so that it could be converted to kernel doc.

> + * __collapse_huge_page_copy - attempts to copy memory contents from normal
> + * pages to a hugepage. Cleanup the normal pages if copying succeeds;
> + * otherwise restore the original pmd page table. Returns true if copying
> + * succeeds, otherwise returns false.
> + *
> + * @pte: starting of the PTEs to copy from
> + * @page: the new hugepage to copy contents to
> + * @pmd: pointer to the new hugepage's PMD
> + * @rollback: the original normal PTEs' PMD

You may not need pmd and rollback, please see the below comments for the reason.

> + * @address: starting address to copy
> + * @pte_ptl: lock on normal pages' PTEs
> + * @compound_pagelist: list that stores compound pages
> + */
> +static bool __collapse_huge_page_copy(pte_t *pte,
> +                               struct page *page,
> +                               pmd_t *pmd,
> +                               pmd_t rollback,
> +                               struct vm_area_struct *vma,
> +                               unsigned long address,
> +                               spinlock_t *pte_ptl,
> +                               struct list_head *compound_pagelist)
>  {
>         struct page *src_page, *tmp;
>         pte_t *_pte;
> -       for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> -                               _pte++, page++, address += PAGE_SIZE) {
> -               pte_t pteval = *_pte;
> +       pte_t pteval;
> +       unsigned long _address;
> +       spinlock_t *pmd_ptl;
> +       bool copy_succeeded = true;
>
> -               if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> +       /*
> +        * Copying pages' contents is subject to memory poison at any iteration.
> +        */
> +       for (_pte = pte, _address = address;
> +                       _pte < pte + HPAGE_PMD_NR;
> +                       _pte++, page++, _address += PAGE_SIZE) {
> +               pteval = *_pte;
> +
> +               if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval)))
>                         clear_user_highpage(page, address);
> -                       add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
> -                       if (is_zero_pfn(pte_pfn(pteval))) {
> -                               /*
> -                                * ptl mostly unnecessary.
> -                                */
> -                               spin_lock(ptl);
> -                               ptep_clear(vma->vm_mm, address, _pte);
> -                               spin_unlock(ptl);
> +               else {
> +                       src_page = pte_page(pteval);
> +                       if (copy_highpage_mc(page, src_page)) {
> +                               copy_succeeded = false;
> +                               break;
> +                       }
> +               }
> +       }
> +
> +       if (!copy_succeeded) {
> +               /*
> +                * Copying failed, re-establish the regular PMD that
> +                * points to regular page table. Since PTEs are still
> +                * isolated and locked, acquiring anon_vma_lock is unnecessary.
> +                */
> +               pmd_ptl = pmd_lock(vma->vm_mm, pmd);
> +               pmd_populate(vma->vm_mm, pmd, pmd_pgtable(rollback));
> +               spin_unlock(pmd_ptl);
> +       }

I think the above section could be moved to out of
__collapse_huge_page_copy(), just like what is done after
__collapse_huge_page_isolate() is failed.

You don't have to restore the pmd here since khugepaged holds write
mmap_lock, there can't be page fault running in parallel. Hence you
don't have to add pmd and rollback parameters to
__collapse_huge_page_copy().

> +
> +       for (_pte = pte, _address = address; _pte < pte + HPAGE_PMD_NR;
> +                       _pte++, _address += PAGE_SIZE) {
> +               pteval = *_pte;
> +               if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> +                       if (copy_succeeded) {
> +                               add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
> +                               if (is_zero_pfn(pte_pfn(pteval))) {
> +                                       /*
> +                                        * ptl mostly unnecessary.
> +                                        */
> +                                       spin_lock(pte_ptl);
> +                                       pte_clear(vma->vm_mm, _address, _pte);
> +                                       spin_unlock(pte_ptl);
> +                               }
>                         }
>                 } else {
>                         src_page = pte_page(pteval);
> -                       copy_user_highpage(page, src_page, address, vma);
>                         if (!PageCompound(src_page))
>                                 release_pte_page(src_page);
> -                       /*
> -                        * ptl mostly unnecessary, but preempt has to
> -                        * be disabled to update the per-cpu stats
> -                        * inside page_remove_rmap().
> -                        */
> -                       spin_lock(ptl);
> -                       ptep_clear(vma->vm_mm, address, _pte);
> -                       page_remove_rmap(src_page, false);
> -                       spin_unlock(ptl);
> -                       free_page_and_swap_cache(src_page);
> +
> +                       if (copy_succeeded) {
> +                               /*
> +                                * ptl mostly unnecessary, but preempt has to
> +                                * be disabled to update the per-cpu stats
> +                                * inside page_remove_rmap().
> +                                */
> +                               spin_lock(pte_ptl);
> +                               pte_clear(vma->vm_mm, _address, _pte);
> +                               page_remove_rmap(src_page, false);
> +                               spin_unlock(pte_ptl);
> +                               free_page_and_swap_cache(src_page);
> +                       }
>                 }
>         }
>
> @@ -784,6 +839,8 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
>                 list_del(&src_page->lru);
>                 release_pte_page(src_page);
>         }
> +
> +       return copy_succeeded;
>  }
>
>  static void khugepaged_alloc_sleep(void)
> @@ -1066,6 +1123,7 @@ static void collapse_huge_page(struct mm_struct *mm,
>         struct vm_area_struct *vma;
>         struct mmu_notifier_range range;
>         gfp_t gfp;
> +       bool copied = false;
>
>         VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>
> @@ -1177,9 +1235,13 @@ static void collapse_huge_page(struct mm_struct *mm,
>          */
>         anon_vma_unlock_write(vma->anon_vma);
>
> -       __collapse_huge_page_copy(pte, new_page, vma, address, pte_ptl,
> -                       &compound_pagelist);
> +       copied = __collapse_huge_page_copy(pte, new_page, pmd, _pmd,
> +                       vma, address, pte_ptl, &compound_pagelist);
>         pte_unmap(pte);
> +       if (!copied) {
> +               result = SCAN_COPY_MC;
> +               goto out_up_write;
> +       }
>         /*
>          * spin_lock() below is not the equivalent of smp_wmb(), but
>          * the smp_wmb() inside __SetPageUptodate() can be reused to
> @@ -1364,9 +1426,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>         pte_unmap_unlock(pte, ptl);
>         if (ret) {
>                 node = khugepaged_find_target_node();
> -               /* collapse_huge_page will return with the mmap_lock released */
> -               collapse_huge_page(mm, address, hpage, node,
> -                               referenced, unmapped);
> +               /*
> +                * collapse_huge_page will return with the mmap_r+w_lock released.
> +                * It is uncertain if *hpage is NULL or not when collapse_huge_page
> +                * returns, so keep ret=1 to jump to breakouterloop_mmap_lock
> +                * in khugepaged_scan_mm_slot, then *hpage will be freed
> +                * if collapse failed.
> +                */
> +               collapse_huge_page(mm, address, hpage, node, referenced, unmapped);
>         }
>  out:
>         trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced,
> @@ -2168,6 +2235,11 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
>                                 khugepaged_scan_file(mm, file, pgoff, hpage);
>                                 fput(file);
>                         } else {
> +                               /*
> +                                * mmap_read_lock is
> +                                * 1) released if both scan and collapse succeeded;
> +                                * 2) still held if either scan or collapse failed.

The #2 doesn't look correct. Even though collapse is failed, the
mmap_lock is released as long as scan is succeeded.

IIUC the collapse does:
    read unlock (passed in by scan)
    read lock
    read unlock
    write lock
    write unlock

> +                                */
>                                 ret = khugepaged_scan_pmd(mm, vma,
>                                                 khugepaged_scan.address,
>                                                 hpage);
> --
> 2.35.1.1178.g4f1659d476-goog
>


  reply	other threads:[~2022-04-29 22:15 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-29  0:09 [RFC v2 0/2] Memory poison recovery in khugepaged collapsing Jiaqi Yan
2022-04-29  0:09 ` [RFC v2 1/2] mm: khugepaged: recover from poisoned anonymous memory Jiaqi Yan
2022-04-29 22:15   ` Yang Shi [this message]
2022-05-20 23:34     ` Jiaqi Yan
2022-05-23 18:48       ` Yang Shi
2022-05-23 19:02         ` Jiaqi Yan
2022-05-23 21:13           ` Yang Shi
2022-05-24  2:50             ` Jiaqi Yan
2022-04-29  0:09 ` [RFC v2 2/2] mm: khugepaged: recover from poisoned file-backed memory Jiaqi Yan
  -- strict thread matches above, loose matches on Subject: below --
2022-04-06 22:23 [RFC v2 0/2] Memory poison recovery in khugepaged collapsing Jiaqi Yan
2022-04-06 22:23 ` [RFC v2 1/2] mm: khugepaged: recover from poisoned anonymous memory Jiaqi Yan
2022-04-05 20:51 [RFC v2 0/2] Memory poison recovery in khugepaged collapsing Jiaqi Yan
2022-04-05 20:51 ` [RFC v2 1/2] mm: khugepaged: recover from poisoned anonymous memory Jiaqi Yan
2022-04-06 17:47   ` Jiaqi Yan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAHbLzkqM+nCyoRFV9oGjbnCFOYzD_jEU1Y_cEoVEww+6bTdXAQ@mail.gmail.com \
    --to=shy828301@gmail.com \
    --cc=jiaqiyan@google.com \
    --cc=juew@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-mm@kvack.org \
    --cc=naoya.horiguchi@nec.com \
    --cc=tongtiangen@huawei.com \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox