From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0B056C4332F for ; Tue, 11 Oct 2022 23:59:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4F3A78E0001; Tue, 11 Oct 2022 19:59:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4A3816B0075; Tue, 11 Oct 2022 19:59:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2F5098E0001; Tue, 11 Oct 2022 19:59:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 1B4A96B0074 for ; Tue, 11 Oct 2022 19:59:28 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id D79381405AC for ; Tue, 11 Oct 2022 23:59:27 +0000 (UTC) X-FDA: 80010337974.29.6559360 Received: from wout4-smtp.messagingengine.com (wout4-smtp.messagingengine.com [64.147.123.20]) by imf17.hostedemail.com (Postfix) with ESMTP id 42C9340028 for ; Tue, 11 Oct 2022 23:59:26 +0000 (UTC) Received: from compute2.internal (compute2.nyi.internal [10.202.2.46]) by mailout.west.internal (Postfix) with ESMTP id 37644320082A; Tue, 11 Oct 2022 19:59:24 -0400 (EDT) Received: from mailfrontend1 ([10.202.2.162]) by compute2.internal (MEProxy); Tue, 11 Oct 2022 19:59:24 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=shutemov.name; h=cc:cc:content-type:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:sender :subject:subject:to:to; s=fm1; t=1665532763; x=1665619163; bh=pd AQMzuQwdXcHRCePsA+d1J8B2x8fPTZ/hbls0uQPZI=; b=LNe6uVe9iMV04GyUXX tG9yM8AILO4n7sgNDtEZUfu7XrP/VPWMg98nqDwdHEyAQtFQNo6DZeZ8rXpxrFxO +Nc1WaqyWhsKQePO3Lxw42NCL0MSz06CUFyvx7mzldkenUU60JgZapQeZs2R6sew SlnkPiExoaBAQhYmDBNbezLdrNa1QGqPBRx43l/VG66iYyzsQRbSVfvC1v3mwjft cTnAQWG4UhBTrw4mbQYS+j+GbZYq+ZaJPSg+gycKFWK5KqQjMCFO9XmaPJ+bYZ43 t7SRl61A5rOJ1GUwKcyPex9qSD+gx+UJqyi58unkRwuXKGFAVU2ealf8TKEkDnlh Al6w== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:date:date:feedback-id :feedback-id:from:from:in-reply-to:in-reply-to:message-id :mime-version:references:reply-to:sender:subject:subject:to:to :x-me-proxy:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s= fm3; t=1665532763; x=1665619163; bh=pdAQMzuQwdXcHRCePsA+d1J8B2x8 fPTZ/hbls0uQPZI=; b=tGMmjFWCnBYgMWwBjMQBE9nk8J4a8W/5+75kpdLRMYJB NauS1yxB/+lTYEULQ1Z77qGRb31D/zVjntYykFmBZ7Tv71udeXzUMcIIoYNh/uBM lWlKOiwyKGU6NkqF0r+CyRd/0fI41zPd7r2BM2xNkvyvADQa14fU/cg9TXFAbeaN ye81xSIAI0awXeEbPsAKRjGMLidymAXDieNEjCOkEY1SnMh1tzagLs+Kwnv7cnnp XageGA1zdXTZRpc+Vt7csezH7IjfAR2xfYop3ridBtVFXnT9Zg0KGv01cOByMrfr bIupkklKWAifcHBmKTio6Zrxcay+Ho2zYI4DTE4Qqg== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvfedrfeejjedgfedtucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmne cujfgurhepfffhvfevuffkfhggtggujgesthdttddttddtvdenucfhrhhomhepfdfmihhr ihhllhcutedrucfuhhhuthgvmhhovhdfuceokhhirhhilhhlsehshhhuthgvmhhovhdrnh grmhgvqeenucggtffrrghtthgvrhhnpefhieeghfdtfeehtdeftdehgfehuddtvdeuheet tddtheejueekjeegueeivdektdenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmh epmhgrihhlfhhrohhmpehkihhrihhllhesshhhuhhtvghmohhvrdhnrghmvg X-ME-Proxy: Feedback-ID: ie3994620:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue, 11 Oct 2022 19:59:21 -0400 (EDT) Received: by box.shutemov.name (Postfix, from userid 1000) id A385A103E8E; Wed, 12 Oct 2022 02:59:18 +0300 (+03) Date: Wed, 12 Oct 2022 02:59:18 +0300 From: "Kirill A. Shutemov" To: Jiaqi Yan Cc: shy828301@gmail.com, tongtiangen@huawei.com, tony.luck@intel.com, naoya.horiguchi@nec.com, kirill.shutemov@linux.intel.com, linmiaohe@huawei.com, linux-mm@kvack.org, akpm@linux-foundation.org Subject: Re: [PATCH v5 1/2] mm/khugepaged: recover from poisoned anonymous memory Message-ID: <20221011235918.hvefriya4m3qdhr2@box.shutemov.name> References: <20221010160142.1087120-1-jiaqiyan@google.com> <20221010160142.1087120-2-jiaqiyan@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20221010160142.1087120-2-jiaqiyan@google.com> ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1665532766; a=rsa-sha256; cv=none; b=gRuv3zjsSq7ccKBtTdcEbp+w906BoF0Uw8TL2/QuaP3DHVu1abhFvqfmWmhkYazB6QWBts U5d2LOMg0qip1w/UlXZydE7FglYqm8P0IKyMk5KYeA1acz173063n1z4MPBnpMFgwjJguF SfyIwSXXE1qckpREKhpOK+CHlEA+8YY= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=shutemov.name header.s=fm1 header.b=LNe6uVe9; dkim=pass header.d=messagingengine.com header.s=fm3 header.b=tGMmjFWC; dmarc=none; spf=pass (imf17.hostedemail.com: domain of kirill@shutemov.name designates 64.147.123.20 as permitted sender) smtp.mailfrom=kirill@shutemov.name ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1665532766; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=pdAQMzuQwdXcHRCePsA+d1J8B2x8fPTZ/hbls0uQPZI=; b=o3YUrvBAna7UOsuq6SVxwzUypFUolRdTNuCi7py+y2eyZzHMFfSoq5t2Pu7QkFuUCqdD1A EWpnsNUUocb5Bl/1yDDoDBUL3Py0lay7JuiW9J/DegCSr5sJIyzxVAUfOz1V3K4AjW3S2v eXqNJDdfUTOTn92/6pslZ2o1ckBcyO4= X-Stat-Signature: i78rm8fzm9pth9xgpai57nwctuh5goj9 X-Rspamd-Queue-Id: 42C9340028 Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=shutemov.name header.s=fm1 header.b=LNe6uVe9; dkim=pass header.d=messagingengine.com header.s=fm3 header.b=tGMmjFWC; dmarc=none; spf=pass (imf17.hostedemail.com: domain of kirill@shutemov.name designates 64.147.123.20 as permitted sender) smtp.mailfrom=kirill@shutemov.name X-Rspam-User: X-Rspamd-Server: rspam03 X-HE-Tag: 1665532766-177110 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Oct 10, 2022 at 09:01:41AM -0700, Jiaqi Yan wrote: > Make __collapse_huge_page_copy return whether > collapsing/copying anonymous pages succeeded, > and make collapse_huge_page handle the return status. > > Break existing PTE scan loop into two for-loops. > The first loop copies source pages into target huge page, > and can fail gracefully when running into memory errors in > source pages. If copying all pages succeeds, the second loop > releases and clears up these normal pages. > Otherwise, the second loop does the following to > roll back the page table and page states: > 1) re-establish the original PTEs-to-PMD connection. > 2) release source pages back to their LRU list. > > Signed-off-by: Jiaqi Yan > --- > include/linux/highmem.h | 19 +++++ > include/trace/events/huge_memory.h | 3 +- > mm/khugepaged.c | 130 ++++++++++++++++++++++------- > 3 files changed, 121 insertions(+), 31 deletions(-) > > diff --git a/include/linux/highmem.h b/include/linux/highmem.h > index 25679035ca283..91a65bdabcb33 100644 > --- a/include/linux/highmem.h > +++ b/include/linux/highmem.h > @@ -332,6 +332,25 @@ static inline void copy_highpage(struct page *to, struct page *from) > > #endif > > +/* > + * Machine check exception handled version of copy_highpage. > + * Return true if copying page content failed; otherwise false. > + * Note handling #MC requires arch opt-in. > + */ > +static inline bool copy_highpage_mc(struct page *to, struct page *from) > +{ > + char *vfrom, *vto; > + unsigned long ret; > + > + vfrom = kmap_local_page(from); > + vto = kmap_local_page(to); > + ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE); > + kunmap_local(vto); > + kunmap_local(vfrom); > + > + return ret > 0; > +} > + > static inline void memcpy_page(struct page *dst_page, size_t dst_off, > struct page *src_page, size_t src_off, > size_t len) > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h > index d651f3437367d..756e991366639 100644 > --- a/include/trace/events/huge_memory.h > +++ b/include/trace/events/huge_memory.h > @@ -33,7 +33,8 @@ > EM( SCAN_ALLOC_HUGE_PAGE_FAIL, "alloc_huge_page_failed") \ > EM( SCAN_CGROUP_CHARGE_FAIL, "ccgroup_charge_failed") \ > EM( SCAN_TRUNCATED, "truncated") \ > - EMe(SCAN_PAGE_HAS_PRIVATE, "page_has_private") \ > + EM( SCAN_PAGE_HAS_PRIVATE, "page_has_private") \ > + EMe(SCAN_COPY_MC, "copy_poisoned_page") \ > > #undef EM > #undef EMe > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 70b7ac66411c0..552e7cb4c8b42 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -51,6 +51,7 @@ enum scan_result { > SCAN_CGROUP_CHARGE_FAIL, > SCAN_TRUNCATED, > SCAN_PAGE_HAS_PRIVATE, > + SCAN_COPY_MC, > }; > > #define CREATE_TRACE_POINTS > @@ -673,44 +674,99 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, > return 0; > } > > -static void __collapse_huge_page_copy(pte_t *pte, struct page *page, > - struct vm_area_struct *vma, > - unsigned long address, > - spinlock_t *ptl, > - struct list_head *compound_pagelist) > +/* > + * __collapse_huge_page_copy - attempts to copy memory contents from normal > + * pages to a hugepage. Cleanup the normal pages if copying succeeds; > + * otherwise restore the original page table and release isolated normal pages. > + * Returns true if copying succeeds, otherwise returns false. > + * > + * @pte: starting of the PTEs to copy from > + * @page: the new hugepage to copy contents to > + * @pmd: pointer to the new hugepage's PMD > + * @rollback: the original normal pages' PMD > + * @vma: the original normal pages' virtual memory area > + * @address: starting address to copy > + * @pte_ptl: lock on normal pages' PTEs > + * @compound_pagelist: list that stores compound pages > + */ > +static bool __collapse_huge_page_copy(pte_t *pte, > + struct page *page, > + pmd_t *pmd, > + pmd_t rollback, > + struct vm_area_struct *vma, > + unsigned long address, > + spinlock_t *pte_ptl, > + struct list_head *compound_pagelist) > { > struct page *src_page, *tmp; > pte_t *_pte; > - for (_pte = pte; _pte < pte + HPAGE_PMD_NR; > - _pte++, page++, address += PAGE_SIZE) { > - pte_t pteval = *_pte; > + pte_t pteval; > + unsigned long _address; > + spinlock_t *pmd_ptl; > + bool copy_succeeded = true; > > + /* > + * Copying pages' contents is subject to memory poison at any iteration. > + */ > + for (_pte = pte, _address = address; > + _pte < pte + HPAGE_PMD_NR; > + _pte++, page++, _address += PAGE_SIZE) { > + pteval = *_pte; > + > + if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) > + clear_user_highpage(page, _address); > + else { > + src_page = pte_page(pteval); > + if (copy_highpage_mc(page, src_page)) { > + copy_succeeded = false; > + break; > + } > + } > + } > + > + if (!copy_succeeded) { > + /* > + * Copying failed, re-establish the regular PMD that points to > + * the regular page table. Restoring PMD needs to be done prior > + * to releasing pages. Since pages are still isolated and locked > + * here, acquiring anon_vma_lock_write is unnecessary. > + */ > + pmd_ptl = pmd_lock(vma->vm_mm, pmd); > + pmd_populate(vma->vm_mm, pmd, pmd_pgtable(rollback)); > + spin_unlock(pmd_ptl); > + } Initially I expected return here, but you handle copy_succeeded below. Hm. > + > + for (_pte = pte, _address = address; _pte < pte + HPAGE_PMD_NR; > + _pte++, _address += PAGE_SIZE) { > + pteval = *_pte; > if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { > - clear_user_highpage(page, address); > - add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1); > - if (is_zero_pfn(pte_pfn(pteval))) { > - /* > - * ptl mostly unnecessary. > - */ > - spin_lock(ptl); > - ptep_clear(vma->vm_mm, address, _pte); > - spin_unlock(ptl); > + if (copy_succeeded) { > + add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1); > + if (is_zero_pfn(pte_pfn(pteval))) { > + /* > + * pte_ptl mostly unnecessary. > + */ > + spin_lock(pte_ptl); > + pte_clear(vma->vm_mm, _address, _pte); > + spin_unlock(pte_ptl); > + } > } So this branch is NOP if !copy_succeeded, right? > } else { > src_page = pte_page(pteval); > - copy_user_highpage(page, src_page, address, vma); > if (!PageCompound(src_page)) > release_pte_page(src_page); And this branch only calls release_pte_page() (Which I believe would screw up statistic). Looks very broken. Or just hard to follow. Or both. Please consider rework code more to streamline handling !copy_succeeded situation. > - /* > - * ptl mostly unnecessary, but preempt has to > - * be disabled to update the per-cpu stats > - * inside page_remove_rmap(). > - */ > - spin_lock(ptl); > - ptep_clear(vma->vm_mm, address, _pte); > - page_remove_rmap(src_page, vma, false); > - spin_unlock(ptl); > - free_page_and_swap_cache(src_page); > + if (copy_succeeded) { > + /* > + * pte_ptl mostly unnecessary, but preempt > + * has to be disabled to update the per-cpu > + * stats inside page_remove_rmap(). > + */ > + spin_lock(pte_ptl); > + ptep_clear(vma->vm_mm, _address, _pte); > + page_remove_rmap(src_page, vma, false); > + spin_unlock(pte_ptl); > + free_page_and_swap_cache(src_page); > + } > } > } > > @@ -723,6 +779,8 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page, > free_swap_cache(src_page); > putback_lru_page(src_page); > } > + > + return copy_succeeded; > } > > static void khugepaged_alloc_sleep(void) > @@ -1009,6 +1067,7 @@ static void collapse_huge_page(struct mm_struct *mm, > struct vm_area_struct *vma; > struct mmu_notifier_range range; > gfp_t gfp; > + bool copied = false; > > VM_BUG_ON(address & ~HPAGE_PMD_MASK); > > @@ -1121,9 +1180,13 @@ static void collapse_huge_page(struct mm_struct *mm, > */ > anon_vma_unlock_write(vma->anon_vma); > > - __collapse_huge_page_copy(pte, new_page, vma, address, pte_ptl, > - &compound_pagelist); > + copied = __collapse_huge_page_copy(pte, new_page, pmd, _pmd, > + vma, address, pte_ptl, &compound_pagelist); > pte_unmap(pte); > + if (!copied) { > + result = SCAN_COPY_MC; > + goto out_up_write; > + } > /* > * spin_lock() below is not the equivalent of smp_wmb(), but > * the smp_wmb() inside __SetPageUptodate() can be reused to > @@ -2129,6 +2192,13 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, > khugepaged_scan_file(mm, file, pgoff, hpage); > fput(file); > } else { > + /* > + * mmap_read_lock is > + * 1) still held if scan failed; > + * 2) released if scan succeeded. > + * It is not affected by collapsing or copying > + * operations. > + */ > ret = khugepaged_scan_pmd(mm, vma, > khugepaged_scan.address, > hpage); > -- > 2.38.0.rc1.362.ged0d419d3c-goog > > -- Kiryl Shutsemau / Kirill A. Shutemov