From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 85FF7C433F5 for ; Mon, 23 May 2022 18:48:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D8B5A6B0005; Mon, 23 May 2022 14:48:48 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D3BC16B0006; Mon, 23 May 2022 14:48:48 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C032D6B0007; Mon, 23 May 2022 14:48:48 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id B17CC6B0005 for ; Mon, 23 May 2022 14:48:48 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 8ACD321223 for ; Mon, 23 May 2022 18:48:48 +0000 (UTC) X-FDA: 79497894336.11.A27487B Received: from mail-pf1-f182.google.com (mail-pf1-f182.google.com [209.85.210.182]) by imf26.hostedemail.com (Postfix) with ESMTP id 67458140017 for ; Mon, 23 May 2022 18:48:44 +0000 (UTC) Received: by mail-pf1-f182.google.com with SMTP id v11so14460799pff.6 for ; Mon, 23 May 2022 11:48:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=8OEZLRQV10Yt3fP8n5HMAVA+QhG0cI9uQSbHCVj4G2A=; b=bCPnkxJLZedmeNXq6rE3mhFdUMPFM3IBsuGE6jKPGXrsvwhhYOjD2xpKLUsIfE5/So lLWby/QfwsfpraxgGBioP/a5letEa6nqcrNzASCBLLQdMwjVffNNAZ1RUWZIF288+2Vl wUF6L+yZ69CzDrFX7SHUOEWqeTH6aAmRNazyww0E58GXu50ev9wLOU9av/I0yTnmH3Ag rpfMf/e78Hd1JmP3pTRdll9encuGhH+MUDirhLkh/zF1GDvDdBBIF12Jv+t7YqYCGSgd toAl7cdxWQIKbxh2Jqfv4edAwjpDywa+NiKAUOdSewo+1peJepftNoxx/Vgrzt8U/Lho 9Gsw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=8OEZLRQV10Yt3fP8n5HMAVA+QhG0cI9uQSbHCVj4G2A=; b=GLTDo8G4RpftnSHwN7gES6XptAKTVf69SEzgCOO73shoA4TO8O1wuLTtF5p/JjTBhp NlHK/fHUdwzm/pUrGKwN/jNNKDWP6/P/S/4+TetVIscRV2OII8rF6La9DUWaJCvMQhrO TNzI+ggWrLy3MUBPWzGv4fQbpd48vc2wI+/s9gb//uiVzoz/d4m1MLfSQ1ENxJUV57gt oI5/UPN8Omo3go7Ua185Fq18WeyseHOqC8c7vNwUCdfU/nmmUQVmpLIYAHDlQibCOVfU HSPEgpRjqwRvnhM8geOxEtXKoPVxpYZCfw2QlK0c5u4KZHVCdtFu/N3l0nWBPQ+2gv6V duYg== X-Gm-Message-State: AOAM530XnAJfHHBT8wErl2SnHR479MR/J2yEh70BiSSQvDE9JJTaNuuy wFIoAczpWHlKmud9tfVpxBgjM6wkrBMZbG3c6fc= X-Google-Smtp-Source: ABdhPJyxKDISEGsDS5OhrM9SyJrt5IYQu3D9IrbceAocHjerOQ+/jqJ8uLmBxbiWAGiZgrujlSnTmgTg0MT+UkUPa1w= X-Received: by 2002:a63:384c:0:b0:3f5:cc47:8a40 with SMTP id h12-20020a63384c000000b003f5cc478a40mr20861000pgn.587.1653331726906; Mon, 23 May 2022 11:48:46 -0700 (PDT) MIME-Version: 1.0 References: <20220429000947.2172219-1-jiaqiyan@google.com> <20220429000947.2172219-2-jiaqiyan@google.com> In-Reply-To: From: Yang Shi Date: Mon, 23 May 2022 11:48:34 -0700 Message-ID: Subject: Re: [RFC v2 1/2] mm: khugepaged: recover from poisoned anonymous memory To: Jiaqi Yan Cc: Tong Tiangen , Linux MM , Tony Luck , =?UTF-8?B?SE9SSUdVQ0hJIE5BT1lBKOWggOWPoyDnm7TkuZ8p?= , "Kirill A. Shutemov" , Miaohe Lin , Jue Wang Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 67458140017 X-Stat-Signature: o79bamwzix4mz11serxpi6wxqirpetzf Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=bCPnkxJL; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf26.hostedemail.com: domain of shy828301@gmail.com designates 209.85.210.182 as permitted sender) smtp.mailfrom=shy828301@gmail.com X-Rspam-User: X-HE-Tag: 1653331724-203076 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, May 20, 2022 at 4:34 PM Jiaqi Yan wrote: > > On Fri, Apr 29, 2022 at 3:15 PM Yang Shi wrote: > > > > On Thu, Apr 28, 2022 at 5:09 PM Jiaqi Yan wrote: > > > > > > Make __collapse_huge_page_copy return whether > > > collapsing/copying anonymous pages succeeded, > > > and make collapse_huge_page handle the return status. > > > > > > Break existing PTE scan loop into two for-loops. > > > The first loop copies source pages into target huge page, > > > and can fail gracefully when running into memory errors in > > > source pages. Roll back the page table and page states > > > in the 2nd loop when copying failed: > > > 1) re-establish the PTEs-to-PMD connection. > > > 2) release pages back to their LRU list. > > > > Could you please include a changelog next time? It is really helpful > > for reviewers. > > Of course, I will include change logs for both v2 + v3 changes. > > > > > > > > > > > Signed-off-by: Jiaqi Yan > > > --- > > > include/linux/highmem.h | 19 ++++++ > > > mm/khugepaged.c | 138 ++++++++++++++++++++++++++++++---------- > > > 2 files changed, 124 insertions(+), 33 deletions(-) > > > > > > diff --git a/include/linux/highmem.h b/include/linux/highmem.h > > > index 39bb9b47fa9cd..0ccb1e92c4b06 100644 > > > --- a/include/linux/highmem.h > > > +++ b/include/linux/highmem.h > > > @@ -298,6 +298,25 @@ static inline void copy_highpage(struct page *to, struct page *from) > > > > > > #endif > > > > > > +/* > > > + * Machine check exception handled version of copy_highpage. > > > + * Return true if copying page content failed; otherwise false. > > > + * Note handling #MC requires arch opt-in. > > > + */ > > > +static inline bool copy_highpage_mc(struct page *to, struct page *from) > > > +{ > > > + char *vfrom, *vto; > > > + unsigned long ret; > > > + > > > + vfrom = kmap_local_page(from); > > > + vto = kmap_local_page(to); > > > + ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE); > > > + kunmap_local(vto); > > > + kunmap_local(vfrom); > > > + > > > + return ret > 0; > > > +} > > > + > > > static inline void memcpy_page(struct page *dst_page, size_t dst_off, > > > struct page *src_page, size_t src_off, > > > size_t len) > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > > index 131492fd1148b..8e69a0640e551 100644 > > > --- a/mm/khugepaged.c > > > +++ b/mm/khugepaged.c > > > @@ -52,6 +52,7 @@ enum scan_result { > > > SCAN_CGROUP_CHARGE_FAIL, > > > SCAN_TRUNCATED, > > > SCAN_PAGE_HAS_PRIVATE, > > > + SCAN_COPY_MC, > > > > You need to update the tracepoint in > > include/trace/events/huge_memory.h to include the new result. > > Will add tracepoint in v3. > > > > > > }; > > > > > > #define CREATE_TRACE_POINTS > > > @@ -739,44 +740,98 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, > > > return 0; > > > } > > > > > > -static void __collapse_huge_page_copy(pte_t *pte, struct page *page, > > > - struct vm_area_struct *vma, > > > - unsigned long address, > > > - spinlock_t *ptl, > > > - struct list_head *compound_pagelist) > > > +/* > > > > Better to use "/**" so that it could be converted to kernel doc. > > Will update in v3. > > > > > > + * __collapse_huge_page_copy - attempts to copy memory contents from normal > > > + * pages to a hugepage. Cleanup the normal pages if copying succeeds; > > > + * otherwise restore the original pmd page table. Returns true if copying > > > + * succeeds, otherwise returns false. > > > + * > > > + * @pte: starting of the PTEs to copy from > > > + * @page: the new hugepage to copy contents to > > > + * @pmd: pointer to the new hugepage's PMD > > > + * @rollback: the original normal PTEs' PMD > > > > You may not need pmd and rollback, please see the below comments for the reason. > > > > > + * @address: starting address to copy > > > + * @pte_ptl: lock on normal pages' PTEs > > > + * @compound_pagelist: list that stores compound pages > > > + */ > > > +static bool __collapse_huge_page_copy(pte_t *pte, > > > + struct page *page, > > > + pmd_t *pmd, > > > + pmd_t rollback, > > > + struct vm_area_struct *vma, > > > + unsigned long address, > > > + spinlock_t *pte_ptl, > > > + struct list_head *compound_pagelist) > > > { > > > struct page *src_page, *tmp; > > > pte_t *_pte; > > > - for (_pte = pte; _pte < pte + HPAGE_PMD_NR; > > > - _pte++, page++, address += PAGE_SIZE) { > > > - pte_t pteval = *_pte; > > > + pte_t pteval; > > > + unsigned long _address; > > > + spinlock_t *pmd_ptl; > > > + bool copy_succeeded = true; > > > > > > - if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { > > > + /* > > > + * Copying pages' contents is subject to memory poison at any iteration. > > > + */ > > > + for (_pte = pte, _address = address; > > > + _pte < pte + HPAGE_PMD_NR; > > > + _pte++, page++, _address += PAGE_SIZE) { > > > + pteval = *_pte; > > > + > > > + if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) > > > clear_user_highpage(page, address); > > > - add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1); > > > - if (is_zero_pfn(pte_pfn(pteval))) { > > > - /* > > > - * ptl mostly unnecessary. > > > - */ > > > - spin_lock(ptl); > > > - ptep_clear(vma->vm_mm, address, _pte); > > > - spin_unlock(ptl); > > > + else { > > > + src_page = pte_page(pteval); > > > + if (copy_highpage_mc(page, src_page)) { > > > + copy_succeeded = false; > > > + break; > > > + } > > > + } > > > + } > > > + > > > + if (!copy_succeeded) { > > > + /* > > > + * Copying failed, re-establish the regular PMD that > > > + * points to regular page table. Since PTEs are still > > > + * isolated and locked, acquiring anon_vma_lock is unnecessary. > > > + */ > > > + pmd_ptl = pmd_lock(vma->vm_mm, pmd); > > > + pmd_populate(vma->vm_mm, pmd, pmd_pgtable(rollback)); > > > + spin_unlock(pmd_ptl); > > > + } > > > > I think the above section could be moved to out of > > __collapse_huge_page_copy(), just like what is done after > > __collapse_huge_page_isolate() is failed. > > > > You don't have to restore the pmd here since khugepaged holds write > > mmap_lock, there can't be page fault running in parallel. Hence you > > don't have to add pmd and rollback parameters to > > __collapse_huge_page_copy(). > > > > The reason I re-establish the pmd table here is to avoid anon_vma_lock_write > since here PTEs are still isolated and locked. Are you implying that, we should > do the PMD re-establishment after PTEs released? and more importantly, > with or without holding the anon_vma_lock_write? I think you mean "pages" are still isolated and locked. I see the point. If the pages are unisolated before PMD is restored, the parallel try_to_unmap/migrate() may fail prematurely. > > > > + > > > + for (_pte = pte, _address = address; _pte < pte + HPAGE_PMD_NR; > > > + _pte++, _address += PAGE_SIZE) { > > > + pteval = *_pte; > > > + if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { > > > + if (copy_succeeded) { > > > + add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1); > > > + if (is_zero_pfn(pte_pfn(pteval))) { > > > + /* > > > + * ptl mostly unnecessary. > > > + */ > > > + spin_lock(pte_ptl); > > > + pte_clear(vma->vm_mm, _address, _pte); > > > + spin_unlock(pte_ptl); > > > + } > > > } > > > } else { > > > src_page = pte_page(pteval); > > > - copy_user_highpage(page, src_page, address, vma); > > > if (!PageCompound(src_page)) > > > release_pte_page(src_page); > > > - /* > > > - * ptl mostly unnecessary, but preempt has to > > > - * be disabled to update the per-cpu stats > > > - * inside page_remove_rmap(). > > > - */ > > > - spin_lock(ptl); > > > - ptep_clear(vma->vm_mm, address, _pte); > > > - page_remove_rmap(src_page, false); > > > - spin_unlock(ptl); > > > - free_page_and_swap_cache(src_page); > > > + > > > + if (copy_succeeded) { > > > + /* > > > + * ptl mostly unnecessary, but preempt has to > > > + * be disabled to update the per-cpu stats > > > + * inside page_remove_rmap(). > > > + */ > > > + spin_lock(pte_ptl); > > > + pte_clear(vma->vm_mm, _address, _pte); > > > + page_remove_rmap(src_page, false); > > > + spin_unlock(pte_ptl); > > > + free_page_and_swap_cache(src_page); > > > + } > > > } > > > } > > > > > > @@ -784,6 +839,8 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page, > > > list_del(&src_page->lru); > > > release_pte_page(src_page); > > > } > > > + > > > + return copy_succeeded; > > > } > > > > > > static void khugepaged_alloc_sleep(void) > > > @@ -1066,6 +1123,7 @@ static void collapse_huge_page(struct mm_struct *mm, > > > struct vm_area_struct *vma; > > > struct mmu_notifier_range range; > > > gfp_t gfp; > > > + bool copied = false; > > > > > > VM_BUG_ON(address & ~HPAGE_PMD_MASK); > > > > > > @@ -1177,9 +1235,13 @@ static void collapse_huge_page(struct mm_struct *mm, > > > */ > > > anon_vma_unlock_write(vma->anon_vma); > > > > > > - __collapse_huge_page_copy(pte, new_page, vma, address, pte_ptl, > > > - &compound_pagelist); > > > + copied = __collapse_huge_page_copy(pte, new_page, pmd, _pmd, > > > + vma, address, pte_ptl, &compound_pagelist); > > > pte_unmap(pte); > > > + if (!copied) { > > > + result = SCAN_COPY_MC; > > > + goto out_up_write; > > > + } > > > /* > > > * spin_lock() below is not the equivalent of smp_wmb(), but > > > * the smp_wmb() inside __SetPageUptodate() can be reused to > > > @@ -1364,9 +1426,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, > > > pte_unmap_unlock(pte, ptl); > > > if (ret) { > > > node = khugepaged_find_target_node(); > > > - /* collapse_huge_page will return with the mmap_lock released */ > > > - collapse_huge_page(mm, address, hpage, node, > > > - referenced, unmapped); > > > + /* > > > + * collapse_huge_page will return with the mmap_r+w_lock released. > > > + * It is uncertain if *hpage is NULL or not when collapse_huge_page > > > + * returns, so keep ret=1 to jump to breakouterloop_mmap_lock > > > + * in khugepaged_scan_mm_slot, then *hpage will be freed > > > + * if collapse failed. > > > + */ > > > + collapse_huge_page(mm, address, hpage, node, referenced, unmapped); > > > } > > > out: > > > trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced, > > > @@ -2168,6 +2235,11 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, > > > khugepaged_scan_file(mm, file, pgoff, hpage); > > > fput(file); > > > } else { > > > + /* > > > + * mmap_read_lock is > > > + * 1) released if both scan and collapse succeeded; > > > + * 2) still held if either scan or collapse failed. > > > > The #2 doesn't look correct. Even though collapse is failed, the > > mmap_lock is released as long as scan is succeeded. > > > > IIUC the collapse does: > > read unlock (passed in by scan) > > read lock > > read unlock > > write lock > > write unlock > > > > You are right! I will update this comment block in v3. > > > > + */ > > > ret = khugepaged_scan_pmd(mm, vma, > > > khugepaged_scan.address, > > > hpage); > > > -- > > > 2.35.1.1178.g4f1659d476-goog > > >