From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B8C42C76196 for ; Tue, 28 Mar 2023 15:59:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 208C16B0071; Tue, 28 Mar 2023 11:59:01 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1B9716B0074; Tue, 28 Mar 2023 11:59:01 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 059946B0081; Tue, 28 Mar 2023 11:59:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id E784B6B0071 for ; Tue, 28 Mar 2023 11:59:00 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id BA74A160ACD for ; Tue, 28 Mar 2023 15:59:00 +0000 (UTC) X-FDA: 80618765640.08.60C6B9B Received: from mail-pg1-f181.google.com (mail-pg1-f181.google.com [209.85.215.181]) by imf10.hostedemail.com (Postfix) with ESMTP id E38E3C001F for ; Tue, 28 Mar 2023 15:58:58 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=JpDhrpVi; spf=pass (imf10.hostedemail.com: domain of shy828301@gmail.com designates 209.85.215.181 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1680019139; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=gQXmkbb4vcnv4L4glkJfLiPrne+y5VNELeFwt8x86X0=; b=DDdTUnOtQOTONWP2jro5AQ5Tp3HH1lbTIXlKFZT6QoIbFiMmqmT1vsZ2CHwOZU+4d7ZdNj O/217q8YkwjWcDx9S9BSHP5sfCvriFGaW2ecBifp717W0ztp733N5JIoP59ElgFIfPTNfH lvgqsTo8AZgBn4FtLmSVwpKPVVBADd8= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=JpDhrpVi; spf=pass (imf10.hostedemail.com: domain of shy828301@gmail.com designates 209.85.215.181 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1680019139; a=rsa-sha256; cv=none; b=Z3+X9J6uY2WWrAaViCEXapBu1LOBfC8282LXmaZvMCHx4XpK0HOth9qMGjVKp9n9vFBagz 10ir/KkD6qw7w9CYuwjPpE7zIDYWe+dtvXCKU7yKsgnNMQoaImnee1vdCdm5bF76py2vpr WLj5IgvabGTvVA2wHTyy5cqgHvl3RJ4= Received: by mail-pg1-f181.google.com with SMTP id y35so7509218pgl.4 for ; Tue, 28 Mar 2023 08:58:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1680019138; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=gQXmkbb4vcnv4L4glkJfLiPrne+y5VNELeFwt8x86X0=; b=JpDhrpViDRIsMdXfzvpcZINhaiMYQUH8Ufe7xsuoFFKCXkPdgHson4/LsdtSuY0TG5 l6KLCzASHIhDGVNPybOUmw3wXRw4qZa9Q6y3QZpA+6dJadegsIdJQkJYWtQw/GgBBRvw pXVrpaPzoURh8ktPT+6uKyfAL3Z0Y0//Ybv3pBsfQWLbj0DsRgo3VIVEDjZ/tg/Dwn1j MK0TSr/DM0isnSJ30+LJTO+bSu4NlbKHNturbqEqkAGgJkpLGYjNRVKSzjWzTZf8vJcI qzrXdvv8VVFU+nsVdHUPGuE4xBWMobK9pivWdsN4REy1xT4d79OpZh9ecg4mAynzxHOT EcCQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680019138; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=gQXmkbb4vcnv4L4glkJfLiPrne+y5VNELeFwt8x86X0=; b=HVkxQtgJlwkMn4b55ait/v9UoC0jP2BezBCUKZExVnUMlfucH5PjCI8+mm1Br7WZI9 WdHlJjf58uvHfUzAFxf5leOcIPCL/a9p8IacgadmALz2lYTQoM0LiJiunMGenJqCPBZL AUeAN1Beq5BGo+gyqeUHxJg8iWSWHMBvMEPpHi48NOhsb6DArqhnbckyFKja8rXvzJIA dI6n/9sXpfVEw5K3f6q2jEjtFFGykEoR8AyPHZvIEOLdN0gG2xG2MVGhSzGyADRd4o+l aQ65KMXJLxrqHIstXplyfS5jrC6o2C690aUmt3Y5xtvhxBcktapfUiT9txZIqsPd+Q5N 4Y0Q== X-Gm-Message-State: AAQBX9dLAaL6IKRb/bL34SGYG3hKtxYhKWJb5Gmr5hVjYXJRQDrIy4Wy PpF0hE60o21nFQK5TZGv4lR1reJmqPST8NQUQ2E= X-Google-Smtp-Source: AKy350ZwzgQOe4iddkyhQ/A7xUVz10ju59yMvyrjiiU/8DpurH/W3Pr2/VTcwspLNgvAgV99W+wL8zdr9XV4AVmnwiI= X-Received: by 2002:a63:581c:0:b0:503:7bcd:89e9 with SMTP id m28-20020a63581c000000b005037bcd89e9mr4441788pgb.1.1680019137573; Tue, 28 Mar 2023 08:58:57 -0700 (PDT) MIME-Version: 1.0 References: <20230327211548.462509-1-jiaqiyan@google.com> <20230327211548.462509-2-jiaqiyan@google.com> In-Reply-To: <20230327211548.462509-2-jiaqiyan@google.com> From: Yang Shi Date: Tue, 28 Mar 2023 08:58:45 -0700 Message-ID: Subject: Re: [PATCH v11 1/3] mm/khugepaged: recover from poisoned anonymous memory To: Jiaqi Yan Cc: kirill.shutemov@linux.intel.com, kirill@shutemov.name, tongtiangen@huawei.com, tony.luck@intel.com, naoya.horiguchi@nec.com, linmiaohe@huawei.com, linux-mm@kvack.org, akpm@linux-foundation.org, osalvador@suse.de, wangkefeng.wang@huawei.com, stevensd@chromium.org, hughd@google.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: y14pugcbwo33facufmggenpqe434ma45 X-Rspam-User: X-Rspamd-Queue-Id: E38E3C001F X-Rspamd-Server: rspam06 X-HE-Tag: 1680019138-446318 X-HE-Meta: U2FsdGVkX19L0ThblwtO6D2+byzGolpUSbdPQhSIZo/91JEfpXvEvtHYN2n/QBozR2KebahvGOP/SJuOjIrMD+wEr5dnK0LOmNicK0vSXO6zycf7IR6eCf07gVrSSTWjIdjAm8YAEpUR9w2SNaarijEXGZVORb2yT3l+BjStZBDI7LSYiYnRynahjeyqB/jqj9TVq/fpmAuAv0Yf6FYs6H4dvTq668ne3GTuqh1PRJ1GDc9Y/eJe/FgF86XugtKqkEXzziqhE3LECxHn6BYQdzIoyWfyOb6O1IVJVp5oEP/4zotCIljLdP45UaFSOYx1nVplxRjYeSzyK1mJ+JU762qr3NYfNLDFbsQ8sJyyTgzCqvm4FkWwquIrpDXPxza3d4g9vuBOONJvJjzSDrW1l8dJeMjOGwhqJLzkk676HTrTfqIMeqRGxUls6osOoQiNAZjAyeyOnIf40uLCyyoRoDT6XrtgtFQwaoQ+KvPRtNTQelMD9krvmD3TR0PVssJCq54beCBikqAXWcrz1uQhC8CMbrD/TEZ41/x8NSCLBnTF/csLQEasrDsXwb6x25nYQrJgG2hx0kGq8sI6Nm3BlwX8VoQlRP0p9pu34265O3BE9LKVaqf2VSAVqlQtz/3y59bHaev3SjKLWx7RH9pMGtSMBUPUPECiSWT4IeOaBZjeRJX9nGVLhjfUNpt8q+UpqZJ1zP+VB7q9UDoxa9YaOTgtapHdXjjvCrurM3HdovCx2hwb4+TT7YOueoQcHGTJxHQbZ5e7MF+yAH8bZQ7G6FUAxqdErQgaOWSW+jXOArd2TksxhsCS4K41ZF1rB0vq6iHhA0DDPClx9RzID0bdziZtNCkq9w3qXDbor+/871IvsOoQ5Ay4iG6eMWm02Pr39whA6qINml0Xn+3gnBTu6rkfdeHNeQoKbRT0U4nvuvw6ZPsp92SrIM2C17NlN9etq9Df9H0lZtNlCmfLZts E4q1vweh sQDOTLLAHkORtwSH/rzvZesXcOC3DIimNaZ0UlxgLE0vEFC+kMm57qhjrZ0C6HJm8abASWEwoNV8XgU+7xrVf5nRuqal0IjoLgDDibwEq3o7rXNShqKfbI8vhtm4k1cqCrDUHCoiEh29D7f4yo4QBM3E1oDwiK/Y603lzWqQS+6NiSsgOU7gcdU1nkNS4KBED8aubBaJArHhDdKmkJhgtqQs8NRcWBA59vc0Bsdc+iAZfIAeFNP13IWkVz9WKyNmvU+M97LEXKqAAB9eJfbC9OX0YBJ58AdPgu1QbocNpDZ0ltPmIMjHMVwKfbecOojW51ybjf/pdALI3ehkxVZlsMbVXwNj37f0d/CJWI+iBM4d3cMRCN/91bFi+UChjEQ4HQRBaiKkpMIrJ6yG35Pw8zCXA2fN+5iTDTld9 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Mar 27, 2023 at 2:15=E2=80=AFPM Jiaqi Yan wro= te: > > Make __collapse_huge_page_copy return whether copying anonymous pages > succeeded, and make collapse_huge_page handle the return status. > > Break existing PTE scan loop into two for-loops. The first loop copies > source pages into target huge page, and can fail gracefully when running > into memory errors in source pages. If copying all pages succeeds, the > second loop releases and clears up these normal pages. Otherwise, the > second loop rolls back the page table and page states by: > - re-establishing the original PTEs-to-PMD connection. > - releasing source pages back to their LRU list. > > Tested manually: > 0. Enable khugepaged on system under test. > 1. Start a two-thread application. Each thread allocates a chunk of > non-huge anonymous memory buffer. > 2. Pick 4 random buffer locations (2 in each thread) and inject > uncorrectable memory errors at corresponding physical addresses. > 3. Signal both threads to make their memory buffer collapsible, i.e. > calling madvise(MADV_HUGEPAGE). > 4. Wait and check kernel log: khugepaged is able to recover from poisoned > pages and skips collapsing them. > 5. Signal both threads to inspect their buffer contents and make sure no > data corruption. > > Signed-off-by: Jiaqi Yan Reviewed-by: Yang Shi Just a nit below: > --- > include/trace/events/huge_memory.h | 3 +- > mm/khugepaged.c | 114 +++++++++++++++++++++++++---- > 2 files changed, 103 insertions(+), 14 deletions(-) > > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/hu= ge_memory.h > index 3e6fb05852f9a..46cce509957ba 100644 > --- a/include/trace/events/huge_memory.h > +++ b/include/trace/events/huge_memory.h > @@ -36,7 +36,8 @@ > EM( SCAN_ALLOC_HUGE_PAGE_FAIL, "alloc_huge_page_failed") \ > EM( SCAN_CGROUP_CHARGE_FAIL, "ccgroup_charge_failed") \ > EM( SCAN_TRUNCATED, "truncated") \ > - EMe(SCAN_PAGE_HAS_PRIVATE, "page_has_private") \ > + EM( SCAN_PAGE_HAS_PRIVATE, "page_has_private") \ > + EMe(SCAN_COPY_MC, "copy_poisoned_page") \ > > #undef EM > #undef EMe > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index bee7fd7db380a..bef68286345c8 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -55,6 +55,7 @@ enum scan_result { > SCAN_CGROUP_CHARGE_FAIL, > SCAN_TRUNCATED, > SCAN_PAGE_HAS_PRIVATE, > + SCAN_COPY_MC, > }; > > #define CREATE_TRACE_POINTS > @@ -681,20 +682,22 @@ static int __collapse_huge_page_isolate(struct vm_a= rea_struct *vma, > return result; > } > > -static void __collapse_huge_page_copy(pte_t *pte, struct page *page, > - struct vm_area_struct *vma, > - unsigned long address, > - spinlock_t *ptl, > - struct list_head *compound_pagelist= ) > +static void __collapse_huge_page_copy_succeeded(pte_t *pte, > + pmd_t *pmd, > + struct vm_area_struct *vm= a, > + unsigned long address, > + spinlock_t *ptl, > + struct list_head *compoun= d_pagelist) > { > - struct page *src_page, *tmp; > + struct page *src_page; > + struct page *tmp; > pte_t *_pte; > - for (_pte =3D pte; _pte < pte + HPAGE_PMD_NR; > - _pte++, page++, address +=3D PAGE_SIZE) { > - pte_t pteval =3D *_pte; > + pte_t pteval; > > + for (_pte =3D pte; _pte < pte + HPAGE_PMD_NR; > + _pte++, address +=3D PAGE_SIZE) { > + pteval =3D *_pte; > if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { > - clear_user_highpage(page, address); > add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1); > if (is_zero_pfn(pte_pfn(pteval))) { > /* > @@ -706,7 +709,6 @@ static void __collapse_huge_page_copy(pte_t *pte, str= uct page *page, > } > } else { > src_page =3D pte_page(pteval); > - copy_user_highpage(page, src_page, address, vma); > if (!PageCompound(src_page)) > release_pte_page(src_page); > /* > @@ -733,6 +735,88 @@ static void __collapse_huge_page_copy(pte_t *pte, st= ruct page *page, > } > } > > +static void __collapse_huge_page_copy_failed(pte_t *pte, > + pmd_t *pmd, > + pmd_t orig_pmd, > + struct vm_area_struct *vma, > + unsigned long address, It looks like "address" is not used at all. It could be removed. > + struct list_head *compound_p= agelist) > +{ > + spinlock_t *pmd_ptl; > + > + /* > + * Re-establish the PMD to point to the original page table > + * entry. Restoring PMD needs to be done prior to releasing > + * pages. Since pages are still isolated and locked here, > + * acquiring anon_vma_lock_write is unnecessary. > + */ > + pmd_ptl =3D pmd_lock(vma->vm_mm, pmd); > + pmd_populate(vma->vm_mm, pmd, pmd_pgtable(orig_pmd)); > + spin_unlock(pmd_ptl); > + /* > + * Release both raw and compound pages isolated > + * in __collapse_huge_page_isolate. > + */ > + release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist); > +} > + > +/* > + * __collapse_huge_page_copy - attempts to copy memory contents from raw > + * pages to a hugepage. Cleans up the raw pages if copying succeeds; > + * otherwise restores the original page table and releases isolated raw = pages. > + * Returns SCAN_SUCCEED if copying succeeds, otherwise returns SCAN_COPY= _MC. > + * > + * @pte: starting of the PTEs to copy from > + * @page: the new hugepage to copy contents to > + * @pmd: pointer to the new hugepage's PMD > + * @orig_pmd: the original raw pages' PMD > + * @vma: the original raw pages' virtual memory area > + * @address: starting address to copy > + * @pte_ptl: lock on raw pages' PTEs > + * @compound_pagelist: list that stores compound pages > + */ > +static int __collapse_huge_page_copy(pte_t *pte, > + struct page *page, > + pmd_t *pmd, > + pmd_t orig_pmd, > + struct vm_area_struct *vma, > + unsigned long address, > + spinlock_t *pte_ptl, > + struct list_head *compound_pagelist) > +{ > + struct page *src_page; > + pte_t *_pte; > + pte_t pteval; > + unsigned long _address; > + int result =3D SCAN_SUCCEED; > + > + /* > + * Copying pages' contents is subject to memory poison at any ite= ration. > + */ > + for (_pte =3D pte, _address =3D address; _pte < pte + HPAGE_PMD_N= R; > + _pte++, page++, _address +=3D PAGE_SIZE) { > + pteval =3D *_pte; > + if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { > + clear_user_highpage(page, _address); > + continue; > + } > + src_page =3D pte_page(pteval); > + if (copy_mc_user_highpage(page, src_page, _address, vma) = > 0) { > + result =3D SCAN_COPY_MC; > + break; > + } > + } > + > + if (likely(result =3D=3D SCAN_SUCCEED)) > + __collapse_huge_page_copy_succeeded(pte, pmd, vma, addres= s, > + pte_ptl, compound_pag= elist); > + else > + __collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma, > + address, compound_pageli= st); > + > + return result; > +} > + > static void khugepaged_alloc_sleep(void) > { > DEFINE_WAIT(wait); > @@ -1106,9 +1190,13 @@ static int collapse_huge_page(struct mm_struct *mm= , unsigned long address, > */ > anon_vma_unlock_write(vma->anon_vma); > > - __collapse_huge_page_copy(pte, hpage, vma, address, pte_ptl, > - &compound_pagelist); > + result =3D __collapse_huge_page_copy(pte, hpage, pmd, _pmd, > + vma, address, pte_ptl, > + &compound_pagelist); > pte_unmap(pte); > + if (unlikely(result !=3D SCAN_SUCCEED)) > + goto out_up_write; > + > /* > * spin_lock() below is not the equivalent of smp_wmb(), but > * the smp_wmb() inside __SetPageUptodate() can be reused to > -- > 2.40.0.348.gf938b09366-goog >