From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 53486C3A59E for ; Wed, 21 Aug 2019 05:42:13 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id DF9F32087E for ; Wed, 21 Aug 2019 05:42:12 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org DF9F32087E Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=ah.jp.nec.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 3CAFB6B0297; Wed, 21 Aug 2019 01:42:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 37AA86B0298; Wed, 21 Aug 2019 01:42:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 26A9B6B0299; Wed, 21 Aug 2019 01:42:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0240.hostedemail.com [216.40.44.240]) by kanga.kvack.org (Postfix) with ESMTP id 039126B0297 for ; Wed, 21 Aug 2019 01:42:11 -0400 (EDT) Received: from smtpin18.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with SMTP id A4CA1180AD803 for ; Wed, 21 Aug 2019 05:42:11 +0000 (UTC) X-FDA: 75845339262.18.ghost29_12fd4e382ec5d X-HE-Tag: ghost29_12fd4e382ec5d X-Filterd-Recvd-Size: 9556 Received: from tyo161.gate.nec.co.jp (tyo161.gate.nec.co.jp [114.179.232.161]) by imf10.hostedemail.com (Postfix) with ESMTP for ; Wed, 21 Aug 2019 05:42:10 +0000 (UTC) Received: from mailgate01.nec.co.jp ([114.179.233.122]) by tyo161.gate.nec.co.jp (8.15.1/8.15.1) with ESMTPS id x7L5fP8t013248 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Wed, 21 Aug 2019 14:41:25 +0900 Received: from mailsv01.nec.co.jp (mailgate-v.nec.co.jp [10.204.236.94]) by mailgate01.nec.co.jp (8.15.1/8.15.1) with ESMTP id x7L5fPie027570; Wed, 21 Aug 2019 14:41:25 +0900 Received: from mail01b.kamome.nec.co.jp (mail01b.kamome.nec.co.jp [10.25.43.2]) by mailsv01.nec.co.jp (8.15.1/8.15.1) with ESMTP id x7L5fPMY031987; Wed, 21 Aug 2019 14:41:25 +0900 Received: from bpxc99gp.gisp.nec.co.jp ([10.38.151.151] [10.38.151.151]) by mail02.kamome.nec.co.jp with ESMTP id BT-MMP-7746119; Wed, 21 Aug 2019 14:39:05 +0900 Received: from BPXM23GP.gisp.nec.co.jp ([10.38.151.215]) by BPXC23GP.gisp.nec.co.jp ([10.38.151.151]) with mapi id 14.03.0439.000; Wed, 21 Aug 2019 14:39:04 +0900 From: Naoya Horiguchi To: Wanpeng Li CC: Mike Kravetz , Michael Ellerman , Andrew Morton , "Punit Agrawal" , "linux-mm@kvack.org" , Michal Hocko , "Aneesh Kumar K.V" , Anshuman Khandual , "linux-kernel@vger.kernel.org" , Benjamin Herrenschmidt , "linuxppc-dev@lists.ozlabs.org" , kvm , Paolo Bonzini , Xiao Guangrong , "lidongchen@tencent.com" , "yongkaiwu@tencent.com" , Mel Gorman , "Kirill A. Shutemov" , "Hansen, Dave" , Hugh Dickins Subject: Re: ##freemail## Re: [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage Thread-Topic: ##freemail## Re: [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage Thread-Index: AQHVFnaoUuMkT7+k5kKGu78GtXiXKKaVCtaAgG58OYCAAXqfAA== Date: Wed, 21 Aug 2019 05:39:04 +0000 Message-ID: <20190821053904.GA23349@hori.linux.bs1.fc.nec.co.jp> References: <87inbbjx2w.fsf@e105922-lin.cambridge.arm.com> <20180207011455.GA15214@hori1.linux.bs1.fc.nec.co.jp> <87fu6bfytm.fsf@e105922-lin.cambridge.arm.com> <20180208121749.0ac09af2b5a143106f339f55@linux-foundation.org> <87wozhvc49.fsf@concordia.ellerman.id.au> <20190610235045.GB30991@hori.linux.bs1.fc.nec.co.jp> In-Reply-To: Accept-Language: en-US, ja-JP Content-Language: ja-JP X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.34.125.150] Content-Type: text/plain; charset="iso-2022-jp" Content-ID: <44D254A2BDC35E41B1EA6CD19B17F3BF@gisp.nec.co.jp> Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-TM-AS-MML: disable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Aug 20, 2019 at 03:03:55PM +0800, Wanpeng Li wrote: > Cc Mel Gorman, Kirill, Dave Hansen, > On Tue, 11 Jun 2019 at 07:51, Naoya Horiguchi = wrote: > > > > On Wed, May 29, 2019 at 04:31:01PM -0700, Mike Kravetz wrote: > > > On 5/28/19 2:49 AM, Wanpeng Li wrote: > > > > Cc Paolo, > > > > Hi all, > > > > On Wed, 14 Feb 2018 at 06:34, Mike Kravetz wrote: > > > >> > > > >> On 02/12/2018 06:48 PM, Michael Ellerman wrote: > > > >>> Andrew Morton writes: > > > >>> > > > >>>> On Thu, 08 Feb 2018 12:30:45 +0000 Punit Agrawal wrote: > > > >>>> > > > >>>>>> > > > >>>>>> So I don't think that the above test result means that errors = are properly > > > >>>>>> handled, and the proposed patch should help for arm64. > > > >>>>> > > > >>>>> Although, the deviation of pud_huge() avoids a kernel crash the= code > > > >>>>> would be easier to maintain and reason about if arm64 helpers a= re > > > >>>>> consistent with expectations by core code. > > > >>>>> > > > >>>>> I'll look to update the arm64 helpers once this patch gets merg= ed. But > > > >>>>> it would be helpful if there was a clear expression of semantic= s for > > > >>>>> pud_huge() for various cases. Is there any version that can be = used as > > > >>>>> reference? > > > >>>> > > > >>>> Is that an ack or tested-by? > > > >>>> > > > >>>> Mike keeps plaintively asking the powerpc developers to take a l= ook, > > > >>>> but they remain steadfastly in hiding. > > > >>> > > > >>> Cc'ing linuxppc-dev is always a good idea :) > > > >>> > > > >> > > > >> Thanks Michael, > > > >> > > > >> I was mostly concerned about use cases for soft/hard offline of hu= ge pages > > > >> larger than PMD_SIZE on powerpc. I know that powerpc supports PGD= _SIZE > > > >> huge pages, and soft/hard offline support was specifically added f= or this. > > > >> See, 94310cbcaa3c "mm/madvise: enable (soft|hard) offline of HugeT= LB pages > > > >> at PGD level" > > > >> > > > >> This patch will disable that functionality. So, at a minimum this= is a > > > >> 'heads up'. If there are actual use cases that depend on this, th= en more > > > >> work/discussions will need to happen. From the e-mail thread on P= GD_SIZE > > > >> support, I can not tell if there is a real use case or this is jus= t a > > > >> 'nice to have'. > > > > > > > > 1GB hugetlbfs pages are used by DPDK and VMs in cloud deployment, w= e > > > > encounter gup_pud_range() panic several times in product environmen= t. > > > > Is there any plan to reenable and fix arch codes? > > > > > > I too am aware of slightly more interest in 1G huge pages. Suspect t= hat as > > > Intel MMU capacity increases to handle more TLB entries there will be= more > > > and more interest. > > > > > > Personally, I am not looking at this issue. Perhaps Naoya will comme= nt as > > > he know most about this code. > > > > Thanks for forwarding this to me, I'm feeling that memory error handlin= g > > on 1GB hugepage is demanded as real use case. > > > > > > > > > In addition, https://git.kernel.org/pub/scm/linux/kernel/git/torval= ds/linux.git/tree/arch/x86/kvm/mmu.c#n3213 > > > > The memory in guest can be 1GB/2MB/4K, though the host-backed memor= y > > > > are 1GB hugetlbfs pages, after above PUD panic is fixed, > > > > try_to_unmap() which is called in MCA recovery path will mark the P= UD > > > > hwpoison entry. The guest will vmexit and retry endlessly when > > > > accessing any memory in the guest which is backed by this 1GB poiso= ned > > > > hugetlbfs page. We have a plan to split this 1GB hugetblfs page by = 2MB > > > > hugetlbfs pages/4KB pages, maybe file remap to a virtual address ra= nge > > > > which is 2MB/4KB page granularity, also split the KVM MMU 1GB SPTE > > > > into 2MB/4KB and mark the offensive SPTE w/ a hwpoison flag, a sigb= us > > > > will be delivered to VM at page fault next time for the offensive > > > > SPTE. Is this proposal acceptable? > > > > > > I am not sure of the error handling design, but this does sound reaso= nable. > > > > I agree that that's better. > > > > > That block of code which potentially dissolves a huge page on memory = error > > > is hard to understand and I'm not sure if that is even the 'normal' > > > functionality. Certainly, we would hate to waste/poison an entire 1G= page > > > for an error on a small subsection. > > > > Yes, that's not practical, so we need at first establish the code base = for > > 2GB hugetlb splitting and then extending it to 1GB next. >=20 > I found it is not easy to split. There is a unique hugetlb page size > that is associated with a mounted hugetlbfs filesystem, file remap to > 2MB/4KB will break this. How about hard offline 1GB hugetlb page as > what has already done in soft offline, replace the corrupted 1GB page > by new 1GB page through page migration, the offending/corrupted area > in the original 1GB page doesn't need to be copied into the new page, > the offending/corrupted area in new page can keep full zero just as it > is clear during hugetlb page fault, other sub-pages of the original > 1GB page can be freed to buddy system. The sigbus signal is sent to > userspace w/ offending/corrupted virtual address, and signal code, > userspace should take care this. Splitting hugetlb is simply hard, IMHO. THP splitting is done by years of effort by many great kernel develpers, and I don't think doing similar development on hugetlb is a good idea. I thought of converting hugetlb into thp, but maybe it's not an easy task either. "Hard offlining via soft offlining" approach sounds new and promising to me= . I guess we don't need a large patchset to do this. So, thanks for the idea! - Naoya=