From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 85FF7C433F5
	for <linux-mm@archiver.kernel.org>; Mon, 23 May 2022 18:48:49 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id D8B5A6B0005; Mon, 23 May 2022 14:48:48 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id D3BC16B0006; Mon, 23 May 2022 14:48:48 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id C032D6B0007; Mon, 23 May 2022 14:48:48 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id B17CC6B0005
	for <linux-mm@kvack.org>; Mon, 23 May 2022 14:48:48 -0400 (EDT)
Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 8ACD321223
	for <linux-mm@kvack.org>; Mon, 23 May 2022 18:48:48 +0000 (UTC)
X-FDA: 79497894336.11.A27487B
Received: from mail-pf1-f182.google.com (mail-pf1-f182.google.com [209.85.210.182])
	by imf26.hostedemail.com (Postfix) with ESMTP id 67458140017
	for <linux-mm@kvack.org>; Mon, 23 May 2022 18:48:44 +0000 (UTC)
Received: by mail-pf1-f182.google.com with SMTP id v11so14460799pff.6
        for <linux-mm@kvack.org>; Mon, 23 May 2022 11:48:47 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=8OEZLRQV10Yt3fP8n5HMAVA+QhG0cI9uQSbHCVj4G2A=;
        b=bCPnkxJLZedmeNXq6rE3mhFdUMPFM3IBsuGE6jKPGXrsvwhhYOjD2xpKLUsIfE5/So
         lLWby/QfwsfpraxgGBioP/a5letEa6nqcrNzASCBLLQdMwjVffNNAZ1RUWZIF288+2Vl
         wUF6L+yZ69CzDrFX7SHUOEWqeTH6aAmRNazyww0E58GXu50ev9wLOU9av/I0yTnmH3Ag
         rpfMf/e78Hd1JmP3pTRdll9encuGhH+MUDirhLkh/zF1GDvDdBBIF12Jv+t7YqYCGSgd
         toAl7cdxWQIKbxh2Jqfv4edAwjpDywa+NiKAUOdSewo+1peJepftNoxx/Vgrzt8U/Lho
         9Gsw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=8OEZLRQV10Yt3fP8n5HMAVA+QhG0cI9uQSbHCVj4G2A=;
        b=GLTDo8G4RpftnSHwN7gES6XptAKTVf69SEzgCOO73shoA4TO8O1wuLTtF5p/JjTBhp
         NlHK/fHUdwzm/pUrGKwN/jNNKDWP6/P/S/4+TetVIscRV2OII8rF6La9DUWaJCvMQhrO
         TNzI+ggWrLy3MUBPWzGv4fQbpd48vc2wI+/s9gb//uiVzoz/d4m1MLfSQ1ENxJUV57gt
         oI5/UPN8Omo3go7Ua185Fq18WeyseHOqC8c7vNwUCdfU/nmmUQVmpLIYAHDlQibCOVfU
         HSPEgpRjqwRvnhM8geOxEtXKoPVxpYZCfw2QlK0c5u4KZHVCdtFu/N3l0nWBPQ+2gv6V
         duYg==
X-Gm-Message-State: AOAM530XnAJfHHBT8wErl2SnHR479MR/J2yEh70BiSSQvDE9JJTaNuuy
	wFIoAczpWHlKmud9tfVpxBgjM6wkrBMZbG3c6fc=
X-Google-Smtp-Source: ABdhPJyxKDISEGsDS5OhrM9SyJrt5IYQu3D9IrbceAocHjerOQ+/jqJ8uLmBxbiWAGiZgrujlSnTmgTg0MT+UkUPa1w=
X-Received: by 2002:a63:384c:0:b0:3f5:cc47:8a40 with SMTP id
 h12-20020a63384c000000b003f5cc478a40mr20861000pgn.587.1653331726906; Mon, 23
 May 2022 11:48:46 -0700 (PDT)
MIME-Version: 1.0
References: <20220429000947.2172219-1-jiaqiyan@google.com> <20220429000947.2172219-2-jiaqiyan@google.com>
 <CAHbLzkqM+nCyoRFV9oGjbnCFOYzD_jEU1Y_cEoVEww+6bTdXAQ@mail.gmail.com> <CACw3F52GzMe=mp5PoFv_cgLbmiWaA_TgyZjPmWfmSn6Rz-T4TA@mail.gmail.com>
In-Reply-To: <CACw3F52GzMe=mp5PoFv_cgLbmiWaA_TgyZjPmWfmSn6Rz-T4TA@mail.gmail.com>
From: Yang Shi <shy828301@gmail.com>
Date: Mon, 23 May 2022 11:48:34 -0700
Message-ID: <CAHbLzkrF0fBtnr_K-_CjpwBFqk+r=cb+v2E6Yw-8ZQr=M5AbGA@mail.gmail.com>
Subject: Re: [RFC v2 1/2] mm: khugepaged: recover from poisoned anonymous memory
To: Jiaqi Yan <jiaqiyan@google.com>
Cc: Tong Tiangen <tongtiangen@huawei.com>, Linux MM <linux-mm@kvack.org>, 
	Tony Luck <tony.luck@intel.com>, 
	=?UTF-8?B?SE9SSUdVQ0hJIE5BT1lBKOWggOWPoyDnm7TkuZ8p?= <naoya.horiguchi@nec.com>, 
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>, Miaohe Lin <linmiaohe@huawei.com>, 
	Jue Wang <juew@google.com>
Content-Type: text/plain; charset="UTF-8"
X-Rspamd-Server: rspam08
X-Rspamd-Queue-Id: 67458140017
X-Stat-Signature: o79bamwzix4mz11serxpi6wxqirpetzf
Authentication-Results: imf26.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20210112 header.b=bCPnkxJL;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf26.hostedemail.com: domain of shy828301@gmail.com designates 209.85.210.182 as permitted sender) smtp.mailfrom=shy828301@gmail.com
X-Rspam-User: 
X-HE-Tag: 1653331724-203076
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Fri, May 20, 2022 at 4:34 PM Jiaqi Yan <jiaqiyan@google.com> wrote:
>
> On Fri, Apr 29, 2022 at 3:15 PM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Thu, Apr 28, 2022 at 5:09 PM Jiaqi Yan <jiaqiyan@google.com> wrote:
> > >
> > > Make __collapse_huge_page_copy return whether
> > > collapsing/copying anonymous pages succeeded,
> > > and make collapse_huge_page handle the return status.
> > >
> > > Break existing PTE scan loop into two for-loops.
> > > The first loop copies source pages into target huge page,
> > > and can fail gracefully when running into memory errors in
> > > source pages. Roll back the page table and page states
> > > in the 2nd loop when copying failed:
> > > 1) re-establish the PTEs-to-PMD connection.
> > > 2) release pages back to their LRU list.
> >
> > Could you please include a changelog next time? It is really helpful
> > for reviewers.
>
> Of course, I will include change logs for both v2 + v3 changes.
>
> >
> >
> > >
> > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > > ---
> > >  include/linux/highmem.h |  19 ++++++
> > >  mm/khugepaged.c         | 138 ++++++++++++++++++++++++++++++----------
> > >  2 files changed, 124 insertions(+), 33 deletions(-)
> > >
> > > diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> > > index 39bb9b47fa9cd..0ccb1e92c4b06 100644
> > > --- a/include/linux/highmem.h
> > > +++ b/include/linux/highmem.h
> > > @@ -298,6 +298,25 @@ static inline void copy_highpage(struct page *to, struct page *from)
> > >
> > >  #endif
> > >
> > > +/*
> > > + * Machine check exception handled version of copy_highpage.
> > > + * Return true if copying page content failed; otherwise false.
> > > + * Note handling #MC requires arch opt-in.
> > > + */
> > > +static inline bool copy_highpage_mc(struct page *to, struct page *from)
> > > +{
> > > +       char *vfrom, *vto;
> > > +       unsigned long ret;
> > > +
> > > +       vfrom = kmap_local_page(from);
> > > +       vto = kmap_local_page(to);
> > > +       ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE);
> > > +       kunmap_local(vto);
> > > +       kunmap_local(vfrom);
> > > +
> > > +       return ret > 0;
> > > +}
> > > +
> > >  static inline void memcpy_page(struct page *dst_page, size_t dst_off,
> > >                                struct page *src_page, size_t src_off,
> > >                                size_t len)
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index 131492fd1148b..8e69a0640e551 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -52,6 +52,7 @@ enum scan_result {
> > >         SCAN_CGROUP_CHARGE_FAIL,
> > >         SCAN_TRUNCATED,
> > >         SCAN_PAGE_HAS_PRIVATE,
> > > +       SCAN_COPY_MC,
> >
> > You need to update the tracepoint in
> > include/trace/events/huge_memory.h to include the new result.
>
> Will add tracepoint in v3.
>
> >
> > >  };
> > >
> > >  #define CREATE_TRACE_POINTS
> > > @@ -739,44 +740,98 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> > >         return 0;
> > >  }
> > >
> > > -static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
> > > -                                     struct vm_area_struct *vma,
> > > -                                     unsigned long address,
> > > -                                     spinlock_t *ptl,
> > > -                                     struct list_head *compound_pagelist)
> > > +/*
> >
> > Better to use "/**" so that it could be converted to kernel doc.
>
> Will update in v3.
>
> >
> > > + * __collapse_huge_page_copy - attempts to copy memory contents from normal
> > > + * pages to a hugepage. Cleanup the normal pages if copying succeeds;
> > > + * otherwise restore the original pmd page table. Returns true if copying
> > > + * succeeds, otherwise returns false.
> > > + *
> > > + * @pte: starting of the PTEs to copy from
> > > + * @page: the new hugepage to copy contents to
> > > + * @pmd: pointer to the new hugepage's PMD
> > > + * @rollback: the original normal PTEs' PMD
> >
> > You may not need pmd and rollback, please see the below comments for the reason.
> >
> > > + * @address: starting address to copy
> > > + * @pte_ptl: lock on normal pages' PTEs
> > > + * @compound_pagelist: list that stores compound pages
> > > + */
> > > +static bool __collapse_huge_page_copy(pte_t *pte,
> > > +                               struct page *page,
> > > +                               pmd_t *pmd,
> > > +                               pmd_t rollback,
> > > +                               struct vm_area_struct *vma,
> > > +                               unsigned long address,
> > > +                               spinlock_t *pte_ptl,
> > > +                               struct list_head *compound_pagelist)
> > >  {
> > >         struct page *src_page, *tmp;
> > >         pte_t *_pte;
> > > -       for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> > > -                               _pte++, page++, address += PAGE_SIZE) {
> > > -               pte_t pteval = *_pte;
> > > +       pte_t pteval;
> > > +       unsigned long _address;
> > > +       spinlock_t *pmd_ptl;
> > > +       bool copy_succeeded = true;
> > >
> > > -               if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> > > +       /*
> > > +        * Copying pages' contents is subject to memory poison at any iteration.
> > > +        */
> > > +       for (_pte = pte, _address = address;
> > > +                       _pte < pte + HPAGE_PMD_NR;
> > > +                       _pte++, page++, _address += PAGE_SIZE) {
> > > +               pteval = *_pte;
> > > +
> > > +               if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval)))
> > >                         clear_user_highpage(page, address);
> > > -                       add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
> > > -                       if (is_zero_pfn(pte_pfn(pteval))) {
> > > -                               /*
> > > -                                * ptl mostly unnecessary.
> > > -                                */
> > > -                               spin_lock(ptl);
> > > -                               ptep_clear(vma->vm_mm, address, _pte);
> > > -                               spin_unlock(ptl);
> > > +               else {
> > > +                       src_page = pte_page(pteval);
> > > +                       if (copy_highpage_mc(page, src_page)) {
> > > +                               copy_succeeded = false;
> > > +                               break;
> > > +                       }
> > > +               }
> > > +       }
> > > +
> > > +       if (!copy_succeeded) {
> > > +               /*
> > > +                * Copying failed, re-establish the regular PMD that
> > > +                * points to regular page table. Since PTEs are still
> > > +                * isolated and locked, acquiring anon_vma_lock is unnecessary.
> > > +                */
> > > +               pmd_ptl = pmd_lock(vma->vm_mm, pmd);
> > > +               pmd_populate(vma->vm_mm, pmd, pmd_pgtable(rollback));
> > > +               spin_unlock(pmd_ptl);
> > > +       }
> >
> > I think the above section could be moved to out of
> > __collapse_huge_page_copy(), just like what is done after
> > __collapse_huge_page_isolate() is failed.
> >
> > You don't have to restore the pmd here since khugepaged holds write
> > mmap_lock, there can't be page fault running in parallel. Hence you
> > don't have to add pmd and rollback parameters to
> > __collapse_huge_page_copy().
> >
>
> The reason I re-establish the pmd table here is to avoid anon_vma_lock_write
> since here PTEs are still isolated and locked. Are you implying that, we should
> do the PMD re-establishment after PTEs released? and more importantly,
> with or without holding the anon_vma_lock_write?

I think you mean "pages" are still isolated and locked. I see the
point. If the pages are unisolated before PMD is restored, the
parallel try_to_unmap/migrate() may fail prematurely.

>
> > > +
> > > +       for (_pte = pte, _address = address; _pte < pte + HPAGE_PMD_NR;
> > > +                       _pte++, _address += PAGE_SIZE) {
> > > +               pteval = *_pte;
> > > +               if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> > > +                       if (copy_succeeded) {
> > > +                               add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
> > > +                               if (is_zero_pfn(pte_pfn(pteval))) {
> > > +                                       /*
> > > +                                        * ptl mostly unnecessary.
> > > +                                        */
> > > +                                       spin_lock(pte_ptl);
> > > +                                       pte_clear(vma->vm_mm, _address, _pte);
> > > +                                       spin_unlock(pte_ptl);
> > > +                               }
> > >                         }
> > >                 } else {
> > >                         src_page = pte_page(pteval);
> > > -                       copy_user_highpage(page, src_page, address, vma);
> > >                         if (!PageCompound(src_page))
> > >                                 release_pte_page(src_page);
> > > -                       /*
> > > -                        * ptl mostly unnecessary, but preempt has to
> > > -                        * be disabled to update the per-cpu stats
> > > -                        * inside page_remove_rmap().
> > > -                        */
> > > -                       spin_lock(ptl);
> > > -                       ptep_clear(vma->vm_mm, address, _pte);
> > > -                       page_remove_rmap(src_page, false);
> > > -                       spin_unlock(ptl);
> > > -                       free_page_and_swap_cache(src_page);
> > > +
> > > +                       if (copy_succeeded) {
> > > +                               /*
> > > +                                * ptl mostly unnecessary, but preempt has to
> > > +                                * be disabled to update the per-cpu stats
> > > +                                * inside page_remove_rmap().
> > > +                                */
> > > +                               spin_lock(pte_ptl);
> > > +                               pte_clear(vma->vm_mm, _address, _pte);
> > > +                               page_remove_rmap(src_page, false);
> > > +                               spin_unlock(pte_ptl);
> > > +                               free_page_and_swap_cache(src_page);
> > > +                       }
> > >                 }
> > >         }
> > >
> > > @@ -784,6 +839,8 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
> > >                 list_del(&src_page->lru);
> > >                 release_pte_page(src_page);
> > >         }
> > > +
> > > +       return copy_succeeded;
> > >  }
> > >
> > >  static void khugepaged_alloc_sleep(void)
> > > @@ -1066,6 +1123,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> > >         struct vm_area_struct *vma;
> > >         struct mmu_notifier_range range;
> > >         gfp_t gfp;
> > > +       bool copied = false;
> > >
> > >         VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> > >
> > > @@ -1177,9 +1235,13 @@ static void collapse_huge_page(struct mm_struct *mm,
> > >          */
> > >         anon_vma_unlock_write(vma->anon_vma);
> > >
> > > -       __collapse_huge_page_copy(pte, new_page, vma, address, pte_ptl,
> > > -                       &compound_pagelist);
> > > +       copied = __collapse_huge_page_copy(pte, new_page, pmd, _pmd,
> > > +                       vma, address, pte_ptl, &compound_pagelist);
> > >         pte_unmap(pte);
> > > +       if (!copied) {
> > > +               result = SCAN_COPY_MC;
> > > +               goto out_up_write;
> > > +       }
> > >         /*
> > >          * spin_lock() below is not the equivalent of smp_wmb(), but
> > >          * the smp_wmb() inside __SetPageUptodate() can be reused to
> > > @@ -1364,9 +1426,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> > >         pte_unmap_unlock(pte, ptl);
> > >         if (ret) {
> > >                 node = khugepaged_find_target_node();
> > > -               /* collapse_huge_page will return with the mmap_lock released */
> > > -               collapse_huge_page(mm, address, hpage, node,
> > > -                               referenced, unmapped);
> > > +               /*
> > > +                * collapse_huge_page will return with the mmap_r+w_lock released.
> > > +                * It is uncertain if *hpage is NULL or not when collapse_huge_page
> > > +                * returns, so keep ret=1 to jump to breakouterloop_mmap_lock
> > > +                * in khugepaged_scan_mm_slot, then *hpage will be freed
> > > +                * if collapse failed.
> > > +                */
> > > +               collapse_huge_page(mm, address, hpage, node, referenced, unmapped);
> > >         }
> > >  out:
> > >         trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced,
> > > @@ -2168,6 +2235,11 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
> > >                                 khugepaged_scan_file(mm, file, pgoff, hpage);
> > >                                 fput(file);
> > >                         } else {
> > > +                               /*
> > > +                                * mmap_read_lock is
> > > +                                * 1) released if both scan and collapse succeeded;
> > > +                                * 2) still held if either scan or collapse failed.
> >
> > The #2 doesn't look correct. Even though collapse is failed, the
> > mmap_lock is released as long as scan is succeeded.
> >
> > IIUC the collapse does:
> >     read unlock (passed in by scan)
> >     read lock
> >     read unlock
> >     write lock
> >     write unlock
> >
>
> You are right! I will update this comment block in v3.
>
> > > +                                */
> > >                                 ret = khugepaged_scan_pmd(mm, vma,
> > >                                                 khugepaged_scan.address,
> > >                                                 hpage);
> > > --
> > > 2.35.1.1178.g4f1659d476-goog
> > >