From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5DA75C433F5 for ; Tue, 24 May 2022 18:41:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BA8DA8D0003; Tue, 24 May 2022 14:41:37 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B345A8D0001; Tue, 24 May 2022 14:41:37 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9FAD18D0003; Tue, 24 May 2022 14:41:37 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 8BC628D0001 for ; Tue, 24 May 2022 14:41:37 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay12.hostedemail.com (Postfix) with ESMTP id 48AD81207B8 for ; Tue, 24 May 2022 18:41:37 +0000 (UTC) X-FDA: 79501505034.08.E065B5C Received: from mail-pj1-f45.google.com (mail-pj1-f45.google.com [209.85.216.45]) by imf08.hostedemail.com (Postfix) with ESMTP id 5B50C1600E0 for ; Tue, 24 May 2022 18:41:15 +0000 (UTC) Received: by mail-pj1-f45.google.com with SMTP id pq9-20020a17090b3d8900b001df622bf81dso2970910pjb.3 for ; Tue, 24 May 2022 11:41:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=okoXsIrHEKZVxkTFugFT4BVBSvK7BUm4laZP96GxP/8=; b=cAqrUAaPh9AEbYtYyTBS2pWBRZbpvSY/GAMuStCmKNKRDMgzvoxOaYMVDwGnaUFjXs tmV0WngxZooQNkIP06KOSC/MINcofNBG3FSwP3tHooqRpDikN8LTDXQyXqpsui3MR1V9 DShEMv4LylZjMfsfgCNa7JNOhzXkf82KHUZPG8DmK1+6P78iQHShGCJhvPsbjdrwwUfa 4lACg6hABcG8gxIaSr35C4e3rs75ZNq0jTw8GjPPuh+H7QBiv1rbSFNbIe/WFJBUG/pB dx1xOzqzkFQTeYX0W5VRcxBOKJGUHqPbI2HLh3O6/+xBYtgs/YrXVEorvuarfKrZv3WN uaYw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=okoXsIrHEKZVxkTFugFT4BVBSvK7BUm4laZP96GxP/8=; b=N70aP8PBwPaz9QcRRhMUaOZBXbwcE7tQOVGUcVdaY81huo2ZZ8gf6/m1ojQt6wQdSs zAfeFYS1Xnja4h3XEGdK0708H+zjb3ZQYIP48NWS4A3SsmJ26cfUUQlMXb5ZCjS82/lJ lee2UmF4MY7QD8K2zd1aASkQYlIfZBoutF44CFlhv7sTwdTGfBDuIloi3Ii/0xTmr8BO 55NfTFLXMcF6t+VoH0bPSxqq8GVX6IDbvZ8JMNYgjQSxwWAcN7XUFaRJWoUel+Apylsn /UGd4Ty8LkBvwD6jo4OEvlwPBU+o2IIndLs1dVpkA5KncupPS+NaKVoxSH81FAeCPikD F3rg== X-Gm-Message-State: AOAM532Bm/9APorCeD/iuoN2oNfss9/wLHKZTy5sjzhNCR003EqMTMXM rORCxDyNhwU/weOag3hbkYspfQz4enBZV5pNDWk= X-Google-Smtp-Source: ABdhPJwCbTuJLGQ40RLdtjGmC3I44SzvFajSJelOY/P0A7oG1G5cRy13MpM+6mN8SwnezB+AT/6D+JUqzoXAy9KTjXU= X-Received: by 2002:a17:902:aa07:b0:162:467:db94 with SMTP id be7-20020a170902aa0700b001620467db94mr17726213plb.26.1653417695205; Tue, 24 May 2022 11:41:35 -0700 (PDT) MIME-Version: 1.0 References: <20220524025352.1381911-1-jiaqiyan@google.com> <20220524025352.1381911-2-jiaqiyan@google.com> In-Reply-To: <20220524025352.1381911-2-jiaqiyan@google.com> From: Yang Shi Date: Tue, 24 May 2022 11:41:22 -0700 Message-ID: Subject: Re: [PATCH v3 1/2] mm: khugepaged: recover from poisoned anonymous memory To: Jiaqi Yan Cc: tongtiangen@huawei.com, Tony Luck , =?UTF-8?B?SE9SSUdVQ0hJIE5BT1lBKOWggOWPoyDnm7TkuZ8p?= , "Kirill A. Shutemov" , Miaohe Lin , Jue Wang , Linux MM Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 5B50C1600E0 X-Stat-Signature: u5pjjgyfcbjoh5kdz67equgipb5tfioc X-Rspam-User: Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=cAqrUAaP; spf=pass (imf08.hostedemail.com: domain of shy828301@gmail.com designates 209.85.216.45 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com X-HE-Tag: 1653417675-710907 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, May 23, 2022 at 7:54 PM Jiaqi Yan wrote: > > Make __collapse_huge_page_copy return whether > collapsing/copying anonymous pages succeeded, > and make collapse_huge_page handle the return status. > > Break existing PTE scan loop into two for-loops. > The first loop copies source pages into target huge page, > and can fail gracefully when running into memory errors in > source pages. Roll back the page table and page states > when copying failed: > 1) re-establish the PTEs-to-PMD connection. > 2) release pages back to their LRU list. > > Signed-off-by: Jiaqi Yan > --- > include/linux/highmem.h | 19 ++++ > include/trace/events/huge_memory.h | 27 +++++- > mm/khugepaged.c | 146 ++++++++++++++++++++++------- > 3 files changed, 158 insertions(+), 34 deletions(-) > > diff --git a/include/linux/highmem.h b/include/linux/highmem.h > index 39bb9b47fa9cd..0ccb1e92c4b06 100644 > --- a/include/linux/highmem.h > +++ b/include/linux/highmem.h > @@ -298,6 +298,25 @@ static inline void copy_highpage(struct page *to, struct page *from) > > #endif > > +/* > + * Machine check exception handled version of copy_highpage. > + * Return true if copying page content failed; otherwise false. > + * Note handling #MC requires arch opt-in. > + */ > +static inline bool copy_highpage_mc(struct page *to, struct page *from) > +{ > + char *vfrom, *vto; > + unsigned long ret; > + > + vfrom = kmap_local_page(from); > + vto = kmap_local_page(to); > + ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE); > + kunmap_local(vto); > + kunmap_local(vfrom); > + > + return ret > 0; > +} > + > static inline void memcpy_page(struct page *dst_page, size_t dst_off, > struct page *src_page, size_t src_off, > size_t len) > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h > index 4fdb14a81108b..7e02c0aa2f728 100644 > --- a/include/trace/events/huge_memory.h > +++ b/include/trace/events/huge_memory.h > @@ -34,7 +34,8 @@ > EM( SCAN_ALLOC_HUGE_PAGE_FAIL, "alloc_huge_page_failed") \ > EM( SCAN_CGROUP_CHARGE_FAIL, "ccgroup_charge_failed") \ > EM( SCAN_TRUNCATED, "truncated") \ > - EMe(SCAN_PAGE_HAS_PRIVATE, "page_has_private") \ > + EM( SCAN_PAGE_HAS_PRIVATE, "page_has_private") \ > + EMe(SCAN_COPY_MC, "copy_poisoned_page") \ > > #undef EM > #undef EMe > @@ -167,5 +168,29 @@ TRACE_EVENT(mm_collapse_huge_page_swapin, > __entry->ret) > ); > > +TRACE_EVENT(mm_collapse_huge_page_copy, You may misunderstand my point. I don't mean we need a new tracepoint, IMHO trace_mm_collapse_huge_page() is good enough. When copy is failed you do have "result = SCAN_COPY_MC" and the result will be consumed by trace_mm_collapse_huge_page(). You just need have trace interpret the new result as the above section does. > + > + TP_PROTO(struct page *start, struct page *poisoned, int status), > + > + TP_ARGS(start, poisoned, status), > + > + TP_STRUCT__entry( > + __field(unsigned long, start_pfn) > + __field(unsigned long, poisoned_pfn) > + __field(int, status) > + ), > + > + TP_fast_assign( > + __entry->start_pfn = start ? page_to_pfn(start) : -1; > + __entry->poisoned_pfn = poisoned ? page_to_pfn(poisoned) : -1; > + __entry->status = status; > + ), > + > + TP_printk("start_pfn=0x%lx, poisoned_pfn=0x%lx, status=%s", > + __entry->start_pfn, > + __entry->poisoned_pfn, > + __print_symbolic(__entry->status, SCAN_STATUS)) > +); > + > #endif /* __HUGE_MEMORY_H */ > #include > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 131492fd1148b..1b08e31ba081a 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -52,6 +52,7 @@ enum scan_result { > SCAN_CGROUP_CHARGE_FAIL, > SCAN_TRUNCATED, > SCAN_PAGE_HAS_PRIVATE, > + SCAN_COPY_MC, > }; > > #define CREATE_TRACE_POINTS > @@ -739,44 +740,104 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, > return 0; > } > > -static void __collapse_huge_page_copy(pte_t *pte, struct page *page, > - struct vm_area_struct *vma, > - unsigned long address, > - spinlock_t *ptl, > - struct list_head *compound_pagelist) > +/* > + * __collapse_huge_page_copy - attempts to copy memory contents from normal > + * pages to a hugepage. Cleanup the normal pages if copying succeeds; > + * otherwise restore the original page table and release isolated normal pages. > + * Returns true if copying succeeds, otherwise returns false. > + * > + * @pte: starting of the PTEs to copy from > + * @page: the new hugepage to copy contents to > + * @pmd: pointer to the new hugepage's PMD > + * @rollback: the orginal normal pages' PMD s/orginal/original > + * @address: starting address to copy > + * @pte_ptl: lock on normal pages' PTEs > + * @compound_pagelist: list that stores compound pages > + */ > +static bool __collapse_huge_page_copy(pte_t *pte, > + struct page *page, > + pmd_t *pmd, > + pmd_t rollback, > + struct vm_area_struct *vma, > + unsigned long address, > + spinlock_t *pte_ptl, > + struct list_head *compound_pagelist) > { > struct page *src_page, *tmp; > pte_t *_pte; > - for (_pte = pte; _pte < pte + HPAGE_PMD_NR; > - _pte++, page++, address += PAGE_SIZE) { > - pte_t pteval = *_pte; > + pte_t pteval; > + unsigned long _address; > + spinlock_t *pmd_ptl; > + bool copy_succeeded = true; > > - if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { > + /* > + * Copying pages' contents is subject to memory poison at any iteration. > + */ > + for (_pte = pte, _address = address; > + _pte < pte + HPAGE_PMD_NR; > + _pte++, page++, _address += PAGE_SIZE) { > + pteval = *_pte; > + > + if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) > clear_user_highpage(page, address); > - add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1); > - if (is_zero_pfn(pte_pfn(pteval))) { > - /* > - * ptl mostly unnecessary. > - */ > - spin_lock(ptl); > - ptep_clear(vma->vm_mm, address, _pte); > - spin_unlock(ptl); > + else { > + src_page = pte_page(pteval); > + if (copy_highpage_mc(page, src_page)) { > + copy_succeeded = false; > + trace_mm_collapse_huge_page_copy(pte_page(*pte), > + src_page, SCAN_COPY_MC); As aforementioned I don't think we need a new trace point. > + break; > + } > + } > + } > + > + if (copy_succeeded) > + trace_mm_collapse_huge_page_copy(pte_page(*pte), > + NULL, SCAN_SUCCEED); > + else { > + /* > + * Copying failed, re-establish the regular PMD that points to > + * the regular page table. Restoring PMD needs to be done prior > + * to releasing pages. Since pages are still isolated and locked > + * here, acquiring anon_vma_lock_write is unnecessary. > + */ > + pmd_ptl = pmd_lock(vma->vm_mm, pmd); > + pmd_populate(vma->vm_mm, pmd, pmd_pgtable(rollback)); > + spin_unlock(pmd_ptl); > + } > + > + for (_pte = pte, _address = address; _pte < pte + HPAGE_PMD_NR; > + _pte++, _address += PAGE_SIZE) { > + pteval = *_pte; > + if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { > + if (copy_succeeded) { > + add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1); > + if (is_zero_pfn(pte_pfn(pteval))) { > + /* > + * ptl mostly unnecessary. > + */ > + spin_lock(pte_ptl); > + pte_clear(vma->vm_mm, _address, _pte); > + spin_unlock(pte_ptl); > + } > } > } else { > src_page = pte_page(pteval); > - copy_user_highpage(page, src_page, address, vma); > if (!PageCompound(src_page)) > release_pte_page(src_page); > - /* > - * ptl mostly unnecessary, but preempt has to > - * be disabled to update the per-cpu stats > - * inside page_remove_rmap(). > - */ > - spin_lock(ptl); > - ptep_clear(vma->vm_mm, address, _pte); > - page_remove_rmap(src_page, false); > - spin_unlock(ptl); > - free_page_and_swap_cache(src_page); > + > + if (copy_succeeded) { > + /* > + * ptl mostly unnecessary, but preempt has to > + * be disabled to update the per-cpu stats > + * inside page_remove_rmap(). > + */ > + spin_lock(pte_ptl); > + pte_clear(vma->vm_mm, _address, _pte); > + page_remove_rmap(src_page, false); > + spin_unlock(pte_ptl); > + free_page_and_swap_cache(src_page); > + } > } > } > > @@ -784,6 +845,8 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page, > list_del(&src_page->lru); > release_pte_page(src_page); > } > + > + return copy_succeeded; > } > > static void khugepaged_alloc_sleep(void) > @@ -1066,6 +1129,7 @@ static void collapse_huge_page(struct mm_struct *mm, > struct vm_area_struct *vma; > struct mmu_notifier_range range; > gfp_t gfp; > + bool copied = false; > > VM_BUG_ON(address & ~HPAGE_PMD_MASK); > > @@ -1177,9 +1241,13 @@ static void collapse_huge_page(struct mm_struct *mm, > */ > anon_vma_unlock_write(vma->anon_vma); > > - __collapse_huge_page_copy(pte, new_page, vma, address, pte_ptl, > - &compound_pagelist); > + copied = __collapse_huge_page_copy(pte, new_page, pmd, _pmd, > + vma, address, pte_ptl, &compound_pagelist); > pte_unmap(pte); > + if (!copied) { > + result = SCAN_COPY_MC; > + goto out_up_write; > + } > /* > * spin_lock() below is not the equivalent of smp_wmb(), but > * the smp_wmb() inside __SetPageUptodate() can be reused to > @@ -1364,9 +1432,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, > pte_unmap_unlock(pte, ptl); > if (ret) { > node = khugepaged_find_target_node(); > - /* collapse_huge_page will return with the mmap_lock released */ > - collapse_huge_page(mm, address, hpage, node, > - referenced, unmapped); > + /* > + * collapse_huge_page will return with the mmap_r+w_lock released. > + * It is uncertain if *hpage is NULL or not when collapse_huge_page Why is it uncertain? > + * returns, so keep ret=1 to jump to breakouterloop_mmap_lock "ret=1" means scan succeeded so mmap_lock is released. But this happens in the caller. > + * in khugepaged_scan_mm_slot, then *hpage will be freed > + * if collapse failed. It depends. For CONFIG_NUMA, it will be freed since next scan may need a page from a different node, but !CONFIG_NUMA may not free it, it may be reused IIUC. We do intend to remove this optimization, but it has not happened yet. TBH I don't think the comment is helpful here and actually confusing, I'd prefer to have it right before calling khugepaged_scan_pmd() if you would like to keep it. > + */ > + collapse_huge_page(mm, address, hpage, node, referenced, unmapped); > } > out: > trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced, > @@ -2168,6 +2241,13 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, > khugepaged_scan_file(mm, file, pgoff, hpage); > fput(file); > } else { > + /* > + * mmap_read_lock is > + * 1) still held if scan failed; > + * 2) released if scan succeeded. > + * It is not affected by collapsing or copying > + * operations. > + */ > ret = khugepaged_scan_pmd(mm, vma, > khugepaged_scan.address, > hpage); > -- > 2.36.1.124.g0e6072fb45-goog >