From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.8 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,FROM_EXCESS_BASE64, HEADER_FROM_DIFFERENT_DOMAINS,HTML_MESSAGE,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6CD42C3A5A2 for ; Sat, 21 Sep 2019 00:21:12 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 104FF2086A for ; Sat, 21 Sep 2019 00:21:11 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="uFhT82pr" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 104FF2086A Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 9C1366B0005; Fri, 20 Sep 2019 20:21:11 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 970976B0006; Fri, 20 Sep 2019 20:21:11 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 886366B0007; Fri, 20 Sep 2019 20:21:11 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0033.hostedemail.com [216.40.44.33]) by kanga.kvack.org (Postfix) with ESMTP id 673306B0005 for ; Fri, 20 Sep 2019 20:21:11 -0400 (EDT) Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with SMTP id 03E17180AD803 for ; Sat, 21 Sep 2019 00:21:11 +0000 (UTC) X-FDA: 75957023142.22.earth24_275d56b02e049 X-HE-Tag: earth24_275d56b02e049 X-Filterd-Recvd-Size: 10480 Received: from mail-qt1-f193.google.com (mail-qt1-f193.google.com [209.85.160.193]) by imf17.hostedemail.com (Postfix) with ESMTP for ; Sat, 21 Sep 2019 00:21:10 +0000 (UTC) Received: by mail-qt1-f193.google.com with SMTP id m15so10722100qtq.2 for ; Fri, 20 Sep 2019 17:21:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=gO7Hegk020p0w3YnPGTEoavJTJsiFeGFYNBETJ9rj7M=; b=uFhT82prPpG9504YxiqpjkicaWDtyUbT4D1TNvYijhXcqYFynDxtE9A0G+bwvHV+pp o9ID+s3lk9des5KEAFkKEH6uA+Bjk/0TiVqofxCoa2tZtydffoCOCroaCCRAkVYpPWjq 8eDBGMa72kAZmvkrII+mIqkac37PyKDax9/7tNb5sLk3biBCW0/ZktLnIunk7fETFZOF qI3cCOFHhP8IBr8Ts2EpRrp9g7Rr1htBE+nveOI9uIidX2WT5LY940O2OZtnXfbKwp2f Ohl/toSyDH8+HEZwW6Y4uB3gUVE4f01jL0vLxwzoHkqrSsuapvZeRdzfqZPF6bzLiLLk Eu5A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=gO7Hegk020p0w3YnPGTEoavJTJsiFeGFYNBETJ9rj7M=; b=GP43VX2SErhs7BnmgWopHBUrhz3YacHKlPr8xmobLdsbeYe3jQScxLOilCs92nypCM HDJtHRKCo3Xm73iY1pFsqPh2HERX4oxb3TytmFK/B6qQPtfnautkWbwQ+54qbMRItlkW 7IDsXyabjVARq1JmzQYbz8KRjjPx5RjoJhWlHwrCCaaUg9AKdHZvWZOG17Uy+2D3bSSB z9Yg7AoXue1p/CTAwHwtlA1JpRQInQx3sl7ptqDMXqS43SYI85CdlV8lNHI91AoDi9dY jJTywLgkdoxNSCjbIvmFWkFeW0omQZMQAA4P87DIurMsuuI3Xzwk6DMGnancJ2h27mqP ZIhg== X-Gm-Message-State: APjAAAWSvjd5MYEjrwMnhPtj52tG6VdyfisNVpBhtJKGiw8hX6DxhBwi CoQFXSSrYy8FODhLTlS3TMXjZXayqfoyOsUEqak= X-Google-Smtp-Source: APXvYqy/l9jDXBM2FtBMN2rhGi99UYpRAY1H/c/6r4HrCtROzb/iNIqWjAgN0IJ1WPXxW1RlVTvBjFbjXEXIyuH+akU= X-Received: by 2002:a0c:9326:: with SMTP id d35mr14898168qvd.162.1569025269794; Fri, 20 Sep 2019 17:21:09 -0700 (PDT) MIME-Version: 1.0 References: <1568994684-1425-1-git-send-email-hqjagain@gmail.com> <1a162778-41b9-4428-1058-82aaf82314b1@nvidia.com> In-Reply-To: <1a162778-41b9-4428-1058-82aaf82314b1@nvidia.com> From: =?UTF-8?B?6buE56eL6ZKn?= Date: Sat, 21 Sep 2019 08:20:56 +0800 Message-ID: Subject: Re: [PATCH 3/3] mm:fix gup_pud_range To: John Hubbard Cc: akpm@linux-foundation.org, ira.weiny@intel.com, jgg@ziepe.ca, dan.j.williams@intel.com, rppt@linux.ibm.com, "Aneesh Kumar K.V" , keith.busch@intel.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: multipart/alternative; boundary="000000000000d98cf605930529fc" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: --000000000000d98cf605930529fc Content-Type: text/plain; charset="UTF-8" >On 9/20/19 8:51 AM, Qiujun Huang wrote: >> __get_user_pages_fast try to walk the page table but the >> hugepage pte is replace by hwpoison swap entry by mca path. > >I expect you mean MCE (machine check exception), rather than mca? Yeah > >> ... >> [15798.177437] mce: Uncorrected hardware memory error in >> user-access at 224f1761c0 >> [15798.180171] MCE 0x224f176: Killing pal_main:6784 due to >> hardware memory corruption >> [15798.180176] MCE 0x224f176: Killing qemu-system-x86:167336 >> due to hardware memory corruption >> ... >> [15798.180206] BUG: unable to handle kernel >> [15798.180226] paging request at ffff891200003000 >> [15798.180236] IP: [] gup_pud_range+ >> 0x13e/0x1e0 >> ... >> >> We need to skip the hwpoison entry in gup_pud_range. > >It would be nice if this spelled out a little more clearly what's >wrong. I think you and Aneesh are saying that the entry is really >a swap entry, created by the MCE response to a bad page? do_machine_check-> do_memory_failure-> memory_failure-> hwpoison_user_mappings will updated PUD level PTE entry as a swap entry. static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, unsigned long address, void *arg) { ... if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) { pteval = swp_entry_to_pte(make_hwpoison_entry(subpage)); if (PageHuge(page)) { int nr = 1 << compound_order(page); hugetlb_count_sub(nr, mm); set_huge_swap_pte_at(mm, address, pvmw.pte, pteval, vma_mmu_pagesize(vma)); } else { dec_mm_counter(mm, mm_counter(page)); set_pte_at(mm, address, pvmw.pte, pteval); } ... And, gup_pud_range will reference the pud entry. gup_pud_range->gup_pmd_range: static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, int write, struct page **pages, int *nr) { unsigned long next; pmd_t *pmdp; pmdp = pmd_offset(&pud, addr); do { pmd_t pmd = *pmdp; <--the pmdp is hwpoison swap entry(ffff891200003000), and results in corruption ... > >> >> Signed-off-by: Qiujun Huang >> --- >> mm/gup.c | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/mm/gup.c b/mm/gup.c >> index 98f13ab..6157ed9 100644 >> --- a/mm/gup.c >> +++ b/mm/gup.c >> @@ -2230,6 +2230,8 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, >> next = pud_addr_end(addr, end); >> if (pud_none(pud)) >> return 0; >> + if (unlikely(!pud_present(pud))) >> + return 0; > >If the MCE hwpoison behavior puts in swap entries, then it seems like all >page table walkers would need to check for p*d_present(), and maybe at all >levels too, right? I think so > >thanks, --000000000000d98cf605930529fc Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
>On 9/20/19 8:51 AM, Qiujun Huang wrote:
>> __= get_user_pages_fast try to walk the page table but the
>> hugepage= pte is replace by hwpoison swap entry by mca path.
>
>I expect= you mean MCE (machine check exception), rather than mca?
Yeah
>>> ...
>> [15798.177437] mce: Uncorrected hardware memory = error in
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 user-access at 224f176= 1c0
>> [15798.180171] MCE 0x224f176: Killing pal_main:6784 due to<= br>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 hardware memory corruption
>= ;> [15798.180176] MCE 0x224f176: Killing qemu-system-x86:167336
>&= gt; =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 due to hardware memory corruption
>= ;> ...
>> [15798.180206] BUG: unable to handle kernel
>&g= t; [15798.180226] paging request at ffff891200003000
>> [15798.180= 236] IP: [<ffffffff8106edae>] gup_pud_range+
>> =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 0x13e/0x1e0
>> ...
>>
>> W= e need to skip the hwpoison entry in gup_pud_range.
>
>It would= be nice if this spelled out a little more clearly what's
>wrong.= I think you and Aneesh are saying that the entry is really
>a swap e= ntry, created by the MCE response to a bad page?
do_machine_check->=C2=A0 =C2=A0 do_memory_failure->
=C2=A0 =C2=A0 =C2=A0 =C2=A0 memor= y_failure->
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 hwpoison_user_m= appings
will updated PUD level PTE entry as a swap entry.

static= bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
= unsigned long address, void *arg)
{
...
if (PageHWPoison(p= age) && !(flags & TTU_IGNORE_HWPOISON)) {
pteval =3D swp= _entry_to_pte(make_hwpoison_entry(subpage));
if (PageHuge(page)) { int nr =3D 1 << compound_order(page);
hugetlb_count_su= b(nr, mm);
set_huge_swap_pte_at(mm, address,
pvmw.pte, = pteval,
vma_mmu_pagesize(vma));
} else {
dec_mm_= counter(mm, mm_counter(page));
set_pte_at(mm, address, pvmw.pte, pt= eval);
}
...

And, gup_pud_range will reference the pud en= try.

gup_pud_range->gup_pmd_range:

static int gup_pmd_= range(pud_t pud, unsigned long addr, unsigned long end,
int write, str= uct page **pages, int *nr)
{
unsigned long next;
pmd_t *pmdp;=

pmdp =3D pmd_offset(&pud, addr);
do {
pmd_t pmd = =3D *pmdp; =C2=A0<--the pmdp is hwpoison swap entry(ffff891200003000),=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 and results in = corruption

...
>
>>
>> Signed-off-by: Qiuj= un Huang <hqjagain@gmail.com&g= t;
>> ---
>> =C2=A0mm/gup.c | 2 ++
>> =C2=A01 fi= le changed, 2 insertions(+)
>>
>> diff --git a/mm/gup.c = b/mm/gup.c
>> index 98f13ab..6157ed9 100644
>> --- a/mm/g= up.c
>> +++ b/mm/gup.c
>> @@ -2230,6 +2230,8 @@ static in= t gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>&g= t; =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 next =3D pud_addr_end(a= ddr, end);
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if = (pud_none(pud))
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return 0;
>> + =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 if (unlikely(!pud_present(pud)))
>> + =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return 0= ;
>
>If the MCE hwpoison behavior puts in swap entries, then it= seems like all
>page table walkers would need to check for p*d_prese= nt(), and maybe at all
>levels too, right? =C2=A0
I think so
&g= t;
>thanks,
--000000000000d98cf605930529fc--